key: cord-0874080-8ep6w2of authors: Ahmed, Imran; Ahmad, Misbah; Rodrigues, Joel J.P.C.; Jeon, Gwanggil; Din, Sadia title: A Deep Learning-Based Social Distance Monitoring framework for COVID-19 date: 2020-11-01 journal: Sustain Cities Soc DOI: 10.1016/j.scs.2020.102571 sha: 17d3b28205fd66478772b796c3dc7e2b20a28425 doc_id: 874080 cord_uid: 8ep6w2of The ongoing COVID-19 corona virus outbreak has caused a global disaster with its deadly spreading. Due to the absence of effective remedial agents and the shortage of immunizations against the virus, population vulnerability increases. In the current situation, as there are no vaccines available; therefore, social distancing is thought to be an adequate precaution (norm) against the spread of the pandemic virus. The risks of virus spread can be minimized by avoiding physical contact among people. The purpose of this work is, therefore, to provide a deep learning platform for social distance tracking using an overhead perspective. The framework uses the YOLOv3 object recognition paradigm to identify humans in video sequences. The transfer learning methodology is also implemented to increase the accuracy of the model. In this way, the detection algorithm uses a pre-trained algorithm that is connected to an extra trained layer using an overhead human data set. The detection model identifies peoples using detected bounding box information. Using the Euclidean distance, the detected bounding box centroid's pairwise distances of people are determined. To estimate social distance violations between people, we used an approximation of physical distance to pixel and set a threshold. A violation threshold is established to evaluate whether or not the distance value breaches the minimum social distance threshold. In addition, a tracking algorithm is used to detect individuals in video sequences such that the person who violates/crosses the social distance threshold is also being tracked. Experiments are carried out on different video sequences to test the efficiency of the model. Findings indicate that the developed framework successfully distinguishes individuals who walk too near and breaches/violates social distances; also, the transfer learning approach boosts the overall efficiency of the model. The accuracy of 92% and 98% achieved by the detection model without and with transfer learning, respectively. The tracking accuracy of the model is 95%. The purpose of this work is to provide a deep learning platform for social distance tracking. The framework uses the YOLOv3 object recognition paradigm to identify humans in video sequences. The transfer learning methodology is implemented to increase the accuracy of the model. The detection algorithm uses a pre-trained algorithm. To estimate social distance violations between people, we used an approximation of physical distance. Overall the conceptual methodology of the paper is Nice and Readability of the paper is excellent for end users to understand and researchers can take this paper as base paper to undertake advanced research in this area. Dear Reviewer, we really admire your deep understanding of the area presented in the manuscript. We are really thankful for appreciating our work. Reviewer #4: This paper presents a deep learning-based social distance monitoring framework by overhead perspective. A pre-trained YOLOv3 model with transfer learning is provided. The present results are interesting and can be accepted after revision. The following issues must be addressed before the publication of the manuscript: Thank you for providing us this opportunity to further revise our manuscript. We appreciate the very positive and constructive comments from the Reviewer. Major concerns: 1.(a) Since all information is obtained by camera, how did the authors calculate the tracking accuracy? Dear Reviewer, the camera is used for only data recording. The tracking algorithm is entirely separate (not built-in camera). We used the centroid tracking algorithm, which mainly used centroid information of bounding box a track multiple people. For details, kindly refer to the methodology section in the paper. (b) What does the one hundred percent accurate mean? Respected Reviewer, we have not made such claim in the paper kindly refer to Fig.13 and 14 and Performance Evaluation section in the revised manuscript. (c) What's the reference/real value? And how to obtain the reference/real value? Dear Reviewer, we used a manually annotated value against the automatic predicted value by the algorithm in the paper. 2.For overhead view perspective, the size of detected bounding box for people located in the center and the boundaries are significantly different. How to keep the accuracy at the same level for the whole detection zone? Dear reviver, we really appreciate your gravity of knowledge. Actually, you are right that the shape of the person is significantly different from an overhead perspective. The pre-trained algorithm is failing; that is why transfer learning is adopted in work, and the model is trained on the overhead data set. 3.Why does the training loss fluctuate over time( Figure 10 ) ? But the training accuracy keeps growing. Dear Reviewer, we really apricate your knowledge in this regard, but as the epoch size increases, the model's training accuracy is improved. Same in vice versa, the training improves the loss function. 4.As peppered with some grammatical errors and typos, the manuscript should be completely reviewed. Dear Reviewer, the revised manuscript is updated, and grammatical errors and typos mistakes are removed. Minor comments: 5.Page 2, Abstract:Please avoid using the same word over and over again, such as "check" and "therefore". The suggestion has been incorporated; kindly refer to the highlighted section in the revised manuscript. 6.Page 3, Line 15-19. "identity" should be "identify". Thanks for highlighting our mistake. The suggestion has been incorporated. 7.Page 3, Line 22-25. I suggest a rewrite of the last sentence to make it clear. Dear Reviewer, the suggestion has been incorporated. 8.Page 3, Line 37. It should be "a pandemic disease". Dear Reviewer, thanks for highlighting the mistakes; the changes have been made. 9.Page 4, Figure 1 . A comma symbol is found in the title of Figure. 1 (b). Dear Reviewer, thanks for highlighting the mistakes; the changes have been made. 10.Page 5, Figure 2 . Size of the "half person" is slightly smaller the "whole person", and they are not aligned. Dear Reviewer, thanks for highlighting the mistakes; the changes have been made. 12.Page 10, Line 11-13. The first sentence of Section 3 is grammatically incorrect. Dear Reviewer, thanks for highlighting the mistakes; the changes have been made. 13.Page 11, Line 8-9. What does COCO represents? Please give some detail information. Dear Reviewer, thanks for highlighting the mistakes; the changes have been made. 14.Page 11, Figure 5 . How did the authors obtain the Detected bounding box (only people)? Dear Reviewer, the pre-trained model is trained for different class objects. In this work, we only consider human class and trained the model for human class; that is why the model automatically detected only people bounding boxes. 15.Page 11, Figure 5 . If two people are closed, which set will the information be added into? "Yes or No" signs should be labeled. Dear Reviewer the Fig.5 is updated in the revised manuscript. Figure 7 . The front size of "a" and "b" is different. Also, the label form should be consistent through the whole manuscript. "(a)" and "(b)" is used for Figure 4 , but for other Figures, they are labeled as "a, b, c, d…". Please correct them. Dear Reviewer thank you so much for highlighting the mistake; we have incorporated the mistakes in the manuscript. In the current situation, as there are no vaccines available; therefore, social distancing is thought to be an adequate precaution (norm) against the spread of the pandemic virus. The risks of virus spread can be minimized by avoiding physical contact among people. The purpose of this work is, therefore, to provide a deep learning platform for social distance tracking using an overhead perspective. The framework uses the YOLOv3 object recognition paradigm to identify humans in video sequences. The transfer learning methodology is also implemented to increase the accuracy of the model. In this way, the detection algorithm uses a pre-trained algorithm that is connected to an extra trained layer using an overhead human data set. The detection model identifies peoples using detected bounding box information. Using the Euclidean distance, the detected bounding box centroid's pairwise distances of people are determined. To estimate social distance violations between people, we used an approxima-Preprint submitted to Journal of L A T E X Templates October 14, 2020 [3] . Therefore, it is necessary to maintain at least 6 feet distance from others, even if people do not have any symptoms. (a) Region wise number of confirmed cases (October 7, 2020) (b) Region wise number of deaths, (October 7, 2020). Figure 1 : Latest number confirmed cases and deaths reported by WHO due to pandemic [3] . Social distancing associates with the measures that overcome the virus' spread, by minimizing the physical contacts of humans, such as the masses at public places (e.g., shopping malls, parks, schools, universities, airports, workplaces), evading crowd gatherings, and maintaining an adequate distance between people [4] , [5] . Social distancing is essential, particularly for those people who are at higher risk of serious illness from COVID-19 . By decreasing the risk of virus transmission from an infected person to a healthy, the virus' spread and dis-3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f ease severity can be significantly reduced [6] Figure. 2. If social distancing is implemented at the initial stages, it can perform a pivotal role in overcoming the virus spread and preventing the pandemic disease's peak, as illustrated in Figure. 3 [7] . It can be observed that social distancing can decrease the number of infected patients and reduce the burden on healthcare organizations. It also lowers the mortality rates by assuring that the number of infected cases (patients) does not surpass the public healthcare capability [8] . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f In the past decades, computer vision, machine learning, and deep learning have shown promising results in several daily life problems. Recent improvement in deep learning allows object detection tasks [9] more effective. Researchers [10] , [11] , [12] , often utilize these methods to measure social distancing among people across the moving frames, as seen in Figure.4 [12] . In this work, we used an overhead view to provide an effective framework for social distance monitoring. Some scholars, e.g. [13] , [14] , [15] , [16] , [17] , [18] , [19] , and [20] use an overhead perspective for human detection and tracking. The overhead perspective offers a better field of view and overcomes the issues 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 • To present a deep learning-based social distance monitoring framework using an overhead view perspective. • To deploy pre-trained YOLOv3 for human detection and computing their bounding box centroid information. In addition, a transfer learning method is applied to enhance the performance of the model. The additional training is performed with overhead data set, and the newly trained layer is appended to the pre-trained model. • In order to track the social distance between individuals, the Euclidean 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f distance is used to approximate the distance between each pair of the centroid of the bounding box detected. In addition, a social distance violation threshold is specified using a pixel to distance estimation. • Utilizing a centroid tracking algorithm to keep track of the person who violates the social distance threshold. • To assess the performance of pre-trained YOLOv3 by evaluating it on an overhead data set. The output of the detection framework is assessed with and without the transition of learning. Furthermore, the model performance is also compared with other deep learning models. The rest of the work discussed in the paper is structured as follows. The related work is presented in Section.2. A deep learning-based social distance monitoring framework has been presented in Section.3. The overhead view data set used for training and testing during experimentation is briefly discussed in Section.4. The detailed analysis of output results and performance evaluation of the model with and without transfer learning is also illustrated in this Section. The conclusion of the given work with potential future plans is provided in Section.5. After the rise of the COVID-19 pandemic since late December 2019, Social distancing is deemed to be an utmost reliable practice to prevent the contagious virus transmission and opted as standard practice on January 23, 2020 [23] . During one month, the number of cases rises exceptionally, with two thousand to four thousand new confirmed cases reported per day in the first week of February 2020. Later, there has been a sign of relief for the first time for five successive days up to March 23, 2020, with no new confirmed cases [24] . This is because of the social distance practice initiated in China and, latterly, adopted by worldwide to control COVID-19. Kylie et al. [25] investigated the relationship between the region's economic situation and the social distancing strictness. The study revealed that moderate stages of exercise could be allowed for evading 7 [28] . Until now researchers have done considerable work for detection [29] , [30] , & [31] , some provides an smart healthcare system for pandemic using Internet of [36] suggested an efficient graph-based monitoring framework for physical distancing and crowd management. [37] performed human detection in a crowded situation. The model is designed for individuals who do not obey a social distance restriction, i.e., 6 feet of space between them. The authors used a mobile robot with an RGB-D camera and a 2-D lidar to make collision-free navigation in mass gatherings. From the literature, we concluded that the researcher had done a considerable amount of work for monitoring of social distance in public environments. But, most of the work is focused on the frontal or side view camera perspective. Therefore, in this work, we presented an overhead view social distance monitoring framework that offers a better field of view and overcomes the issues of occlusion, thereby playing a key role in social distance monitoring to compute the distance between peoples. Researchers use a frontal or side perspective for social distance monitoring, as discussed in Section.2. In this work, a deep learning-based social distance monitoring framework using an overhead perspective has been introduced. The flow diagram of the framework is shown in Figure. 5. The recorded overhead data set are split into training and testing sets. A deep learning-based detection paradigm is used to detect individuals in sequences. There are a variety of object detection models available, such as [38] , [39] , [40] , [41] , [42] and [43] . Due to the best performance results for generic object detection, in this work, YOLOv3 [22] is used. The model used single-stage network architecture to estimate the bounding boxes and class probabilities. The model was originally trained on the COCO (Common objects in context) data set [44] . J o u r n a l P r e -p r o o f After detection, the bounding box information, mainly centroid information, is used to compute each bounding box centroid distance. We used Euclidean distance and calculated the distance between each detected bounding box of peoples. Following computing centroid distance, a predefined threshold is used to check either the distance among any two bounding box centroids is less than the configured number of pixels or not. If two people are close to each other and the distance value violates the minimum social distance threshold. The bounding box information is stored in a violation set, as seen in Figure. 5, and the color of the bounding box is updated/changed to red. A centroid tracking algorithm is adopted for tracking so that it helps in tracking of those people who violate/breach the social distancing threshold. At the output, the model displays the information about the total number of social distancing violations 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f along with detected people bounding boxes and centroids. In this work, YOLOv3 is used for human detection as it improves predictive accuracy, particularly for small-scale objects. The main advantage is that it has adjusted network structure for multi-scale object detection. Furthermore, for object classification, it uses various independent logistic rather than softmax. The model's overall architecture is presented in Figure. 6; it can be seen that feature learning is performed using the convolutional layers, also called Residual Blocks. The blocks are made up of many convolutional layers and skip connections. The model's unique characteristic is that it performs detection at three separate scales, as depicted in Figure. 6.The convolutional layers with a given stride are practiced to downsample the feature map and transfer invariant-sized features [22] . Three feature maps, as shown in Figure. 6, are utilized for object detection. The architecture shown in Figure. 6 is trained using an overhead data set. For that purpose, a transfer learning approach is adopted, that enhance the efficiency of the model. With transfer learning, the model is additionally trained without dropping the valuable information of the existing model. Further, the additional overhead data set trained layer is appended with the existing architecture. In this way, the model takes advantage of the pre-trained and newly 11 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 The architecture shown in Figure. 6 used a single-stage network for the entire input image to predict the bounding box and class probability of detected objects. For feature extraction, the architecture utilizes convolution layers, and for class prediction, fully connected layers are used. During human identification, as seen in Figure. 6, the input frame is divided into a region of S, also called grid cells. These cells are related to bounding box estimation and class probabilities. It predicts the probability of whether the center of the person bounding box is in the grid cell or not. In Equation.1, P r(p) indicates that whether the person present is in the detected bounding box or not. The value of P r(p) is 1 for yes and 0 for not. IoU (pred, actual) determines the Intersection Over Union of the actual and predicted bounding box. It is defined as [22] : Where the ground truth box (actual) manually labeled in the training data set represented with BoxT , and the predicted bounding box is displayed as BoxP . area presents the area of intersection. An acceptable area is predicted and decided for each detected person in the input frame. The confidence value is applied after prediction to achieve the optimal bounding box. For each predicted bounding box, h, w, x, y are estimated, where bounding box coordinates are defined by x, y, and width and height are determined by w, h. The model produces the following predicted bounding box values as seen in Figure. 7 and Equation. 3 [22] : 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 A threshold value is defined that process the high confidence values and discards the low confidence values. Using non-maximal suppression, the final location parameters are derived for the detected bounding box. At last, loss function is calculated, for detected bounding box [22] . The given loss function is the sum of three functions, i.e., regression, classification, and confidence. At each grid cell, if the object is detected, then the classification loss is computed as the squared error of the conditional class probabilities and calculated as [22] : 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f object, i.e., a person, is added. It is defined as [22] : Finally the confidence loss is calculated that is given in Equation.6 as [22] : Where, the confidence score is defined as C * , for j th bounding box in grid cell i and 1 obj ij and is equal to 1 in case if in cell i the j th bounding box is responsible for object detection; otherwise it is equal to 0. In case if the object is not detected, then the confidence loss is provided as [22] : In Equation.7, 1 noobj ij is defined as the complement of 1 obj ij . The bounding box' confidence score C * in cell i and λ noobj is used to weights down the loss during detecting background. As in most cases detected, bounding boxes do not contain any objects that cause a class imbalance problem; therefore, the model is more 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f frequently trained to detect background rather than detect objects. To solve this, the loss is weight down by a factor λ noobj (default: 0.5). After detecting people in video frames, in the next step, the centroid of each detected person bounding boxes shown as green boxes are used for distance calculation, as shown in Figure.8(b) . The detected bounding box coordinates (x, y) are used to compute the bounding box's centroid. Figure. 8(c) demonstrates ac- cepting a set of bounding box coordinates and computing the centroid. After computing, centroid, a unique ID is assigned to each detected bounding box. In the next step, we measure the distance between each detected centroid using Euclidean distance. For every subsequent frame in the video stream, we firstly 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 The detailed descriptions of various experiments carried out in this work are presented in this section. For social distance monitoring, an indoor data set recorded at Institute of Management Sciences, Hayatabad, Peshawar Pakistan is used [16] & [46] , containing video sequences captured from the overhead view. The data collection is divided into 70% and 30% training and testing, respectively. There is no restriction on the mobility of persons throughout the scene. Peoples in the scene move freely; their visual appearance is affected by radial distance and camera position. From example frames, It can be observed that the human's visual appearance is not identical, and peoples heights, poses, scales are varying in the data set. For implementation, we used OpenCV. The experimental results are divided into two subsections; first, the pre-trained model's testing results are discussed, while in the second subsection, the results of the detection model after applying transfer learning and training on the overhead data set are explained. For comparison, the model is tested using the same video sequences. The performance evaluation of the model is also made in this 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 In Figure. 9, the testing results of the social distance framework using a pretrained model [22] has been visualized. The testing results are evaluated using different video sequences. The people in the video sequences are freely moving in the scenes; it can be seen from sample frames that the individual's visual appearance is not identical to the frontal or side view ( Figure.9 ). The person's size is also varying at different locations, as shown in Figure. 9. Since the model only 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f considers human (person) class; therefore, only an object having an appearance like a human is detected by a pre-trained model. The pre-trained model delivers good results and detects various size person bounding boxes, as shown with green rectangles in Figure. it can be seen that after person detection, the distance between each detected bounding box is measured to check whether the person in the scene violates the social distance or not. In Figure. The model is now tested for the same test video sequences, as discussed in the above sub-section. The experimental findings reveal that transfer learning significantly increases the detection results, as seen in Figure. 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f Figure. 12(e), the violation is detected; however, the number of people present in the scene is small as compared to Figure.12(b) , where all people are maintaining social distance, and therefore not a single violation is observed. In Figure. The same behavior can be found in Figure. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f to (g). In Figure. 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 Figure. 13. It can be analyzed that when the model is additionally trained for overhead view data set, the overall performance of the detection model is improved. The tracking accuracy is also given in Figure. 14. We also compared the newly trained YOLOv3 with other deep learning models. The True detection and False detection rate of different deep learning models are depicted in Table. 1. From the results, it can be seen that transfer learning improved the results significantly for the overhead view data set. The false detection rate of different deep learning models are very small, about 0.7% to 0.4% without any training, which reveals the effectiveness of deep learning models. Different pre-trained object detection models are tested on the overhead data set. Although the models were trained on the different frontal data sets, they still show good results by achieving an accuracy of 90%. In Figure. 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f In this work, a deep learning-based social distance monitoring framework is presented using an overhead perspective. The pre-trained YOLOv3 paradigm is used for human detection. As a person's appearance, visibility, scale, size, shape, and pose vary significantly from an overhead view, the transfer learning method is adopted to improve the pre-trained model's performance. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f no symptoms but are infected with the virus also play a part in the virus spread [3] . Therefore, it is necessary to maintain at least 6 feet distance from others, even if people do not have any symptoms. Social distancing associates with the measures that overcome the virus' spread, by minimizing the physical contacts of humans, such as the masses at public places (e.g., shopping malls, parks, schools, universities, airports, workplaces), evading crowd gatherings, and maintaining an adequate distance between people [4] , [5] . Social distancing is essential, particularly for those people who are at higher risk of serious illness from COVID-19. By decreasing the risk of virus transmission from an infected person to a healthy, the virus' spread and dis- 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 In the past decades, computer vision, machine learning, and deep learning have shown promising results in several daily life problems. Recent improvement in deep learning allows object detection tasks [9] more effective. Researchers [10] , [11] , [12] , often utilize these methods to measure social distancing among people across the moving frames, as seen in Figure. Deepsort to monitor social distancing on Oxford Town Center, and (f) [12] . In this work, we used an overhead view to provide an effective framework for social distance monitoring. Some scholars, e.g. [13] , [14] , [15] , [16] , [17] , [18] , [19] , and [20] use an overhead perspective for human detection and tracking. The overhead perspective offers a better field of view and overcomes the issues 5 • To present a deep learning-based social distance monitoring framework using an overhead view perspective. • To deploy pre-trained YOLOv3 for human detection and computing their bounding box centroid information. In addition, a transfer learning method is applied to enhance the performance of the model. The additional training is performed with overhead data set, and the newly trained layer is appended to the pre-trained model. • In order to track the social distance between individuals, the Euclidean 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f distance is used to approximate the distance between each pair of the centroid of the bounding box detected. In addition, a social distance violation threshold is specified using a pixel to distance estimation. • Utilizing a centroid tracking algorithm to keep track of the person who violates the social distance threshold. • To assess the performance of pre-trained YOLOv3 by evaluating it on an overhead data set. The output of the detection framework is assessed with and without the transition of learning. Furthermore, the model performance is also compared with other deep learning models. The rest of the work discussed in the paper is structured as follows. The related work is presented in Section.2. A deep learning-based social distance monitoring framework has been presented in Section.3. The overhead view data set used for training and testing during experimentation is briefly discussed in Section.4. The detailed analysis of output results and performance evaluation of the model with and without transfer learning is also illustrated in this Section. The conclusion of the given work with potential future plans is provided in Section.5. After the rise of the COVID-19 pandemic since late December 2019, Social distancing is deemed to be an utmost reliable practice to prevent the contagious virus transmission and opted as standard practice on January 23, 2020 [23] . During one month, the number of cases rises exceptionally, with two thousand to four thousand new confirmed cases reported per day in the first week of February 2020. Later, there has been a sign of relief for the first time for five successive days up to March 23, 2020, with no new confirmed cases [24] . This is because of the social distance practice initiated in China and, latterly, adopted by worldwide to control COVID-19. Kylie et al. [25] investigated the relationship between the region's economic situation and the social distancing strictness. The study revealed that moderate stages of exercise could be allowed for evading 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 Until now researchers have done considerable work for detection [29] , [30] , & [31] , some provides an smart healthcare system for pandemic using Internet of Medical Things [32] , & [33] . Prem et al. [34] studied the social distancing impacts on the spread of the COVID-19 outbreak. The studies concluded that the early and immediate practice of social distancing could gradually reduce the peak of the virus attack. As we all know, that although social distancing is crucial for flattening the infection curve, it is an economically unpleasant step. In The drone camera and the YOLOv3 algorithm help identify the social distance 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f and monitor people from the side or frontal view in public wearing masks. Pouw et al., [36] suggested an efficient graph-based monitoring framework for physical distancing and crowd management. [37] performed human detection in a crowded situation. The model is designed for individuals who do not obey a social distance restriction, i.e., 6 feet of space between them. The authors used a mobile robot with an RGB-D camera and a 2-D lidar to make collision-free navigation in mass gatherings. From the literature, we concluded that the researcher had done a considerable amount of work for monitoring of social distance in public environments. But, most of the work is focused on the frontal or side view camera perspective. Therefore, in this work, we presented an overhead view social distance monitoring framework that offers a better field of view and overcomes the issues of occlusion, thereby playing a key role in social distance monitoring to compute the distance between peoples. Researchers use a frontal or side perspective for social distance monitoring, as discussed in Section.2. In this work, a deep learning-based social distance monitoring framework using an overhead perspective has been introduced. The flow diagram of the framework is shown in Figure. 5. The recorded overhead data set are split into training and testing sets. A deep learning-based detection paradigm is used to detect individuals in sequences. There are a variety of object detection models available, such as [38] , [39] , [40] , [41] , [42] and [43] . Due to the best performance results for generic object detection, in this work, YOLOv3 [22] is used. The model used single-stage network architecture to estimate the bounding boxes and class probabilities. The model was originally trained on the COCO (Common objects in context) data set [44] . For overhead view person detection, transfer learning is implemented to enhance the detection model's efficiency, and a new layer of overhead training is added with the existing architecture. 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f After detection, the bounding box information, mainly centroid information, is used to compute each bounding box centroid distance. We used Euclidean distance and calculated the distance between each detected bounding box of peoples. Following computing centroid distance, a predefined threshold is used to check either the distance among any two bounding box centroids is less than the configured number of pixels or not. If two people are close to each other and the distance value violates the minimum social distance threshold. The bounding box information is stored in a violation set, as seen in Figure. 5, and the color of the bounding box is updated/changed to red. A centroid tracking algorithm is adopted for tracking so that it helps in tracking of those people who violate/breach the social distancing threshold. At the output, the model displays the information about the total number of social distancing violations 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f along with detected people bounding boxes and centroids. In this work, YOLOv3 is used for human detection as it improves predictive accuracy, particularly for small-scale objects. The main advantage is that it has adjusted network structure for multi-scale object detection. Furthermore, for object classification, it uses various independent logistic rather than softmax. The model's overall architecture is presented in Figure. 6; it can be seen that feature learning is performed using the convolutional layers, also called Residual Blocks. The blocks are made up of many convolutional layers and skip connections. The model's unique characteristic is that it performs detection at three separate scales, as depicted in Figure. 6.The convolutional layers with a given stride are practiced to downsample the feature map and transfer invariant-sized features [22] . Three feature maps, as shown in Figure. 6, are utilized for object detection. The architecture shown in Figure. 6 is trained using an overhead data set. For that purpose, a transfer learning approach is adopted, that enhance the efficiency of the model. With transfer learning, the model is additionally trained without dropping the valuable information of the existing model. Further, the additional overhead data set trained layer is appended with the existing architecture. In this way, the model takes advantage of the pre-trained and newly 11 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 The architecture shown in Figure. 6 used a single-stage network for the entire input image to predict the bounding box and class probability of detected objects. For feature extraction, the architecture utilizes convolution layers, and for class prediction, fully connected layers are used. During human identification, as seen in Figure. 6, the input frame is divided into a region of S, also called grid cells. These cells are related to bounding box estimation and class probabilities. It predicts the probability of whether the center of the person bounding box is in the grid cell or not. In Equation.1, P r(p) indicates that whether the person present is in the detected bounding box or not. The value of P r(p) is 1 for yes and 0 for not. IoU (pred, actual) determines the Intersection Over Union of the actual and predicted bounding box. It is defined as [22] : Where the ground truth box (actual) manually labeled in the training data set represented with BoxT , and the predicted bounding box is displayed as BoxP . area presents the area of intersection. An acceptable area is predicted and decided for each detected person in the input frame. The confidence value is applied after prediction to achieve the optimal bounding box. For each predicted bounding box, h, w, x, y are estimated, where bounding box coordinates are defined by x, y, and width and height are determined by w, h. The model produces the following predicted bounding box values as seen in Figure. 7 and Equation.3 [22] ; 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 A threshold value is defined that process the high confidence values and discards the low confidence values. Using non-maximal suppression, the final location parameters are derived for the detected bounding box. At last, loss function is calculated, for detected bounding box [22] . The given loss function is the sum of three functions, i.e., regression, classification, and confidence. At each grid cell, if the object is detected, then the classification loss is computed as the squared error of the conditional class probabilities and calculated as [22] ; 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f object, i.e., a person, is added. It is defined as [22] ; Finally the confidence loss is calculated that is given in Equation.6 as [22] : Where, the confidence score is defined as C * , for j th bounding box in grid cell i and 1 obj ij and is equal to 1 in case if in cell i the j th bounding box is responsible for object detection; otherwise it is equal to 0. In case if the object is not detected, then the confidence loss is provided as [22] ; In Equation.7, 1 noobj ij is defined as the complement of 1 obj ij . The bounding box' confidence score C * in cell i and λ noobj is used to weights down the loss during detecting background. As in most cases detected, bounding boxes do not contain any objects that cause a class imbalance problem; therefore, the model is more 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f frequently trained to detect background rather than detect objects. To solve this, the loss is weight down by a factor λ noobj (default: 0.5). After detecting people in video frames, in the next step, the centroid of each detected person bounding boxes shown as green boxes are used for distance calculation, as shown in Figure.8(b) . The detected bounding box coordinates (x, y) are used to compute the bounding box's centroid. Figure. 8(c) demonstrates ac- cepting a set of bounding box coordinates and computing the centroid. After computing, centroid, a unique ID is assigned to each detected bounding box. In the next step, we measure the distance between each detected centroid using Euclidean distance. For every subsequent frame in the video stream, we firstly 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 The detailed descriptions of various experiments carried out in this work are presented in this section. For social distance monitoring, an indoor data set recorded at Institute of Management Sciences, Hayatabad, Peshawar Pakistan is used [16] & [46] , containing video sequences captured from the overhead view. The data collection is divided into 70% and 30% training and testing, respectively. There is no restriction on the mobility of persons throughout the scene. Peoples in the scene move freely; their visual appearance is affected by radial distance and camera position. From example frames, It can be observed that the human's visual appearance is not identical, and peoples heights, poses, scales are varying in the data set. For implementation, we used OpenCV. The experimental results are divided into two subsections; first, the pre-trained model's testing results are discussed, while in the second subsection, the results of the detection model after applying transfer learning and training on the overhead data set are explained. For comparison, the model is tested using the same video sequences. The performance evaluation of the model is also made in this 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 In Figure.9 , the testing results of the social distance framework using a pretrained model [22] has been visualized. The testing results are evaluated using different video sequences. The people in the video sequences are freely moving in the scenes; it can be seen from sample frames that the individual's visual appearance is not identical to the frontal or side view ( Figure.9 ). The person's size is also varying at different locations, as shown in Figure. 9. Since the model 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f only considers human (person) class; therefore, only an object having an appearance like a human is detected by a pre-trained model. The pre-trained model delivers good results and detects various size person bounding boxes, as shown with green rectangles in Figure. it can be seen that after person detection, the distance between each detected bounding box is measured to check whether the person in the scene violates the social distance or not. In Figure. 9(e), & (h), two people at the center of the scene are marked with red bounding boxes as they violate or breaches the social distancing threshold. Some miss detections also occur that are manually labeled with a yellow cross in sample frames. From the sample frames, it can be seen that a person is effectively detected at several scene locations. However, in some cases, the person's appearance is changing; therefore, the model gives miss detections. The reason for miss detection maybe, as the pre-trained model is applied, and an individual's appearance from an overhead view is changing, which may be misleading for the model. The model is now tested for the same test video sequences, as discussed in the above sub-section. The experimental findings reveal that transfer learning significantly increases the detection results, as seen in Figure. 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f and the social distance between people is also computed, as shown in the sample frames. In sample frames of Figure. Figure. 12(e), the violation is detected; however, the number of people present in the scene is small as compared to Figure.12(b) , where all people are maintaining social distance, and therefore not a single violation is observed. In Figure. 12 (d), (e) & (f), due to close interactions between people, violation is recorded by automated system. The same behavior can be found in Figure. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f to (g). In Figure. 12 (d), (e) & (f), multiple people are walking, and entering in the scene are detected and monitored. The framework effectively detected the breach of social distance between people and marked the bounding box as red rectangles if people are too close to each other. Figure 12 : Results of social distance monitoring, using transfer learning. It can be seen that the detection performance of the model is improved after transfer learning. In sample frames, the people in green rectangles maintain social distancing while in red rectangles are those who breach/violate the social distance. Different quantitative metrics are used in this work to evaluate the performance of the framework for social distance monitoring using a deep learning model and an overhead perspective. To assess the efficiency of the detection model, Precision, Recall, and Accuracy is used. Furthermore, the findings are also compared with other deep learning models. For estimation of Precision, Recall 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 Figure. 13. It can be analyzed that when the model is additionally trained for overhead view data set, the overall performance of the detection model is improved. The tracking accuracy is also given in Figure. 14. We also compared the newly trained YOLOv3 with other deep learning models. The True detection and False detection rate of different deep learning models are depicted in Table. 1. From the results, it can be seen that transfer learning improved the results significantly for the overhead view data set. The false detection rate of different deep learning models are very small, about 0.7% to 0.4% without any training, which reveals the effectiveness of deep learning models. Different pre-trained object detection models are tested on the overhead data set. Although the models were trained on the different frontal data sets, they still show good results by achieving an accuracy of 90%. In Figure. 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f In this work, a deep learning-based social distance monitoring framework is presented using an overhead perspective. The pre-trained YOLOv3 paradigm is used for human detection. As a person's appearance, visibility, scale, size, shape, and pose vary significantly from an overhead view, the transfer learning method is adopted to improve the pre-trained model's performance. The model is trained on an overhead data set, and the newly trained layer is appended with the existing model. To the best of our knowledge, this work is the first attempt that utilized transfer learning for a deep learning-based detection paradigm, used for overhead perspective social distance monitoring. The detection model gives bounding box information, containing centroid coordinates information. Using the Euclidean distance, the pairwise centroid distances between detected bounding boxes are measured. To check social distance violations between people, an approximation of physical distance to the pixel is used, and a threshold is defined. A violation threshold is used to check if the distance value violates the minimum social distance set or not. Furthermore, a centroid tracking algorithm is used for tracking peoples in the scene. Experimental results indicated that the framework efficiently identifies people walking too close and violates social distancing; also, the transfer learning methodology increases the detec- 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 J o u r n a l P r e -p r o o f Dashboard (Online Cluster Computing International Journal of Machine Learning and Cybernetics 9th IEEE Annual Ubiquitous Computing, Electronics Mobile Communication Conference (UEMCON) of the Peoples Republic of China (Online Megapixels: Origins, ethics, and privacy implications of publicly available face recognition image datasets 22nd International Conference on E-Health Networking Transactions on Emerging Telecommunications Technologies 22nd International Conference on E-Health Networking, Applications and Services (IEEE Healthcom 2020) Advances in neural information processing systems Proceedings of the IEEE conference on computer vision and pattern recognition Proceedings of the IEEE conference on computer vision and pattern recognition Proceedings of the IEEE international conference on computer vision Advances in neural information processing systems Proceedings of the IEEE conference on computer vision and pattern recognition Dashboard (Online URL: "https Cluster Computing International Journal of Machine Learning and Cybernetics 9th IEEE Annual Ubiquitous Computing, Electronics Mobile Communication Conference (UEMCON) of the Peoples Republic of China (Online Megapixels: Origins, ethics, and privacy implications of publicly available face recognition image datasets 22nd International Conference on E-Health Networking Transactions on Emerging Telecommunications Technologies 22nd International Conference on E-Health Networking, Applications and Services (IEEE Healthcom 2020) Advances in neural information processing systems Proceedings of the IEEE conference on computer vision and pattern recognition Proceedings of the IEEE conference on computer vision and pattern recognition Proceedings of the IEEE international conference on computer vision Advances in neural information processing systems Proceedings of the IEEE conference on computer vision and pattern recognition Declaration of Interest Statement Title: A Deep Learning-Based Social Distance Monitoring framework for COVID-19 Authors Institute of Management Sciences, 1-A, Sector E-5 This work is partially supported by FCT/MCTES through national funds and when applicable co-funded EU funds under the project UIDB/50008/2020; and by Brazilian National Council for Scientific and Technological Development