key: cord-0997265-7vlf2v3l
authors: Ansari, Mohd. Aquib; Singh, Dushyant Kumar
title: Monitoring social distancing through human detection for preventing/reducing COVID spread
date: 2021-04-14
journal: Int J Inf Technol
DOI: 10.1007/s41870-021-00658-2
sha: 8f5432f8a56b61edc690a58b6182e71212c12be0
doc_id: 997265
cord_uid: 7vlf2v3l

COVID-19 is a severe epidemic that has put the world in a global crisis. Over 42 Million people are infected, and 1.14 Million deaths are reported worldwide as on Oct 23, 2020. A deeper understanding of the epidemic suggests that a person’s negligence can cause widespread harm that would be difficult to negate. Since no vaccine is yet developed, social distancing must be practiced to detain COVID-19 spread. Therefore, we aim to develop a framework that tracks humans for monitoring the social distancing being practiced. To accomplish this objective of social distance monitoring, an algorithm is developed using object detection method. Here, CNN based object detector is explored to detect human presence. The object detector’s output is used for calculating distances between each pair of humans detected. This approach of social distancing algorithm will red mark the persons who are getting closer than a permissible limit. Experimental results prove that CNN based object detectors with our proposed social distancing algorithm exhibit promising outcomes for monitoring social distancing in public areas.

COVID-19 (Corona Virus) is an infectious disease that has proved to be an epidemic, declared by the World Health Organization (WHO). It was first reported in Wuhan, China, in late 2019. As of Oct 23, 2020, 217 countries and regions around the world are affected by COVID-19 and reported approximately 42 Million confirmed cases and 1.14 Million deaths. Figure 1 illustrates the total number of cases and the total number of deaths from Jan 22, 2020, to Oct 23, 2020 [1] . According to World Health Organization [2] , a person can become infected with COVID-19 if he comes in contact with other virus-infected persons. Till date, no medicine or vaccine is yet developed to quell this deadly virus. Therefore, there is a need to look for an alternative control measure to prevent the spread of this fatal virus.

As it is well said, prevention is better than cure, WHO has suggested several safety measures to minimize the transmission of coronavirus. In the present scenario, social distancing [3, 4] has proved to be one of the most exquisite alternative methods as a spread stopper. Social distancing can also be referred to as ''physical distancing,'' which means maintaining a distance between yourself and the people around you. Social distancing helps to lessen physical contact or interaction between possibly COVID-19 infected persons and healthy individuals. According to WHO's standard prescriptions, everyone should keep a distance of at least 6 feet between each other to follow the social distancing. This is a prominent way to break the chain of contagion. Therefore, all the affected countries have adopted social distancing.

Monitoring social distancing in real-time scenarios is a challenging task. It can be possible in two ways: manually and automatically. The manual method requires many physical eyes to watch whether every individual is following social distancing norms strictly. This is an arduous process as one can't keep their eyes for monitoring continuously at 24 9 7. Automated surveillance systems [5, 6] replace many physical eyes with CCTV cameras. CCTV cameras produce video footage, and an automated surveillance system inspects this footage. The system raises alerts when any suspicious event occurs. In view of this alert, security personnel can take relevant actions. Therefore, the automated monitoring system has surpassed several limitations of the manual monitoring method.

This research aims to limit the impact of the coronavirus epidemic with minimal harm to economic artifacts. In this paper, we have proposed an effective automatic surveillance system that helps to locate each person and monitors them for the social distancing parameter. This application is suitable for both indoor and outdoor surveillance scenarios. It can be used significantly in various places like railway stations, airports, megastores, malls, streets, etc. The proposed approach can be seen as a combination of two main tasks, mentioned as:

(i) Human detection and tracking (ii) Monitoring of social distancing among humans

In the first task, this research addresses the problem of human detection and tracking [6, 9, 12] in the surveillance video. Human detection is a two-stage process that involves the localization of an object in the first stage and classification of the localized object in the second stage. This paper has presented a human detection technique based on visual specific learning through deep neural networks in the video feed. The second task focuses on calculating distance among humans in public areas using our proposed algorithm. The decision is made on social distancing if followed. If not, then the persons who do not follow the social distancing criteria are highlighted with a red rectangle. On seeing this, security personals can take any action related to social distancing rules so that it can be followed strictly.

This paper is structured into five sections. Section one describes the motivation and introductory knowledge of social distancing. Section two is designed to provide a vast study on traditional and recent approaches to various human detection techniques. Section three focuses on deep learning based human detection models. The experimentations and its detailed analysis are styled in section four. At last, the conclusion followed by the future scope is described in section five.

In 2001, a very popular approach for object detection was proposed by Viola and Jones [10] . They used Haar features for features extraction and cascade classifiers with adaboost learning algorithm for classification purposes. This method is 15 times faster than traditional approaches. Fu-Chun Hsu et al. [11] proposed a hybrid approach to detect the head and shoulders by fusing motion and visual characteristics. The authors found that the Histogram of Oriented Optical Flow (HOOF) descriptor is a better choice for segmenting the moving object in video sequences and can handle cluttered and occluded environments efficiently. Vijay and Shashikant [13] proposed a real-time pedestrian detection for advanced driver assistance. This system detects the pedestrian using Edgelet features to improve the accuracy and a classifier based on the k-means clustering algorithm to lessen the system complexity. Suman Kumar Choudhury et al. [14] proposed an advance pedestrian system by incorporating the background subtraction technique to extract moving objects, Silhouette Orientation Histogram, and Golden Ratio Based Partition to extract meaningful information from the moving objects and HIKSVM for object classification. This system can deal with occlusion efficiently and achieved accuracy up to Int. j. inf. tecnol. 98.36%. Seemanthini and Manjunath [15] deployed the human detection technique for an action recognition system. Singh et al. [17] proposed a human detection framework for extensive surveillance in the city through CCTV cameras. They used the background subtraction technique to segment moving objects, HOG descriptor to extract features and SVM for object classification.

Earlier, object detection frameworks implemented the Sliding Window concept [18] for object localization within an image. According to this approach, an image is divided into a particular size of blocks or regions. Further, these blocks are categorized into their respective classes. Various handcrafted feature extraction techniques like HOG [8] , SIFT [19] , LBP [29] , etc. are used to evaluate the attributes or features. Furthermore, these attributes are used to build the classifier to locate the object on the image's grid. However, this grid-based archetype requires high computational cost and sometimes yields high false-positive rates. Therefore, an effective object classification & localization framework is needed to detect several objects with diverse scales within an image. Additionally, it should reduce the computational cost and false-positive rate. Recently, significant advances have been observed in object detection using deep convolutional neural network (CNN) [20, [22] [23] [24] . Convolutional neural networks (CNN) are a class of intensive, feed-forward artificial neural networks that have been used to perform accurately in computer vision tasks, such as image classification and detection. CNN is capable of extracting robust features with the help of the convolution process. Its strong attribute representation capability played a vast role in object detection [7, 21] . Aichun, Tian, and Qiao [16] proposed a deep hierarchical model for multiple human upper body detection. This model employs a candidate-region convolutional neural network (CR-CNN) with multiple convolutional features to accommodate the local as well as contextual information from the image and has achieved accuracy up to 86%.

The researches presented in literature illustrate that the object detection is of vital role in computer vision due to its number of practical use cases, e.g., face detection, pedestrian, detection, activity recognition, medical imaging, etc. This paper has extended the role of object detection to reduce the vivid spread of COVID-19. Therefore, we aim to develop an application for analyzing social distancing among persons using an efficient object detector.

3 Proposed state of the art framework for monitoring social distancing

The overall scenario of monitoring for social distancing in public, as proposed, is presented here in Fig. 2 . CCTV cameras available at any public place can be used for surveillance, i.e., monitoring social distancing. Video stream/frame sequences received from these cameras are fed to the object detection and tracking module for locating human presence in the scene. The parameters like 'centroid' of object/person location and 'distance' among many such centroids are evaluated for measuring the degree of social distancing practiced. An alert is generated in changing the color of the bounding box of humans detected, from green to red. The color of the bounding box is green until there is a permissible distance between any two persons. As when this decreases, the color of bounding boxes changes to red, which presents the social distancing violation.

Sliding window based region proposals is a simple and straightforward approach to design an efficient object detector. According to this approach, the image or frame is divided into the particular size of blocks or regions. Further, these blocks are categorized into their respective classes. The categorization of blocks can be possible by different machine learning and deep learning paradigms. It might also be possible that regions contain part of the object, which introduces many bounding boxes around the object. To deal with this problem, the Non-Maximum Suppression (NMS) [26] algorithm is used to locate the object correctly within an image, which suppresses the low bounding boxes and keeps only the best. This paper has exercised deep learning based technique to detect the presence of human with the help of the sliding window based region proposal algorithm. The proposed technique is quite helpful in object detection and localization, which is described in Sect. 3.1. Furthermore, these employed techniques are used in the social distancing algorithm to see if people are following the distancing criterion. The algorithm of social distancing is described in Sect. 3.2.

Convolutional neural network (CNN) [7, 25] has drawn much attention to the research community's attitude and can be successfully embedded in a broader image classification paradigm. It takes an image as input, assigns significance to different objects within an image based on trainable weights & bias, and effectively differentiate each object. This paper introduces two CNN based sequential models to detect the presence of an individual within an image. The general overview of these proposed models is shown in Table 1 . These models consist of a convolutional layer, pooling layer, flatten, fully connected layer 1 & 2, and output layer. The only difference between these two models is that Model 1 consists of two convolutional layers with two pooling layers, while Model 2 consists of three convolutional layers with three pooling layers. Due to this variation, Model 1 produces approximately 10,402,993 trainable parameters, whereas Model 2 produces approximately 2,861,297 trainable parameters. Figure 3 shows the graphical structure of Model 2, which takes a color image of size 128 9 64 9 3 as input and produces its predicted value as output. It has three convolutional layers, three pooling layers, two fully connected layers, and one output layer. The first convolutional layer involves 32 filters of convolution of each size of 3 9 3, while the second and third convolutional layer involves 48 filters and 64 filters of convolution, respectively. The convolutional layer uses (1, 1) stride value. The pooling layer involves (2, 2) pool size to reduce the size of an image. Two fully connected layers (FC) of size 512 and 128 respectively are used to train the network. The size of the output layer is one neuron that indicates returns True or False value. We used 'Relu' activation function in convolutional layers and fully connected layer. While 'Sigmoid' function is used in the output layer that yields output vectors where each element is a probability. A dropout rate of 30% is used in the first FC layer to overcome the overfitting problem.

It is the second phase of our proposed framework. The proposed social distancing monitoring algorithm carried two main functions. Function1 helps to find out the locations of the objects in an image. It uses the human detection technique and provides the human locations in the form of coordinate values like X A (left), Y A (top), X B (right), and Y B (bottom). From these coordinate values, the centroid values of different objects are identified. The evaluation of the centroid value for an object is shown in Eqs. 1 and 2.

where X A , Y A, X B , and Y B are the coordinate values (left, top, right, bottom) of an object. X and Y are centroid coordinates or values. Further, these parameters are passed to the next function to measure social distancing. Function2 finds out the distance between two objects using Euclidean distance [27] , which decides the closeness between them and shown in Eq. 3. The decision is made on comparing this distance vector with the pre-define threshold value. If Euclidean distance is less than some threshold value, then it is assumed that these two objects are not obeying the criteria of social distancing or have not made enough distance between them. On breaching these security concerns, the spread of the coronavirus could be possible. So, an alert is generated to the security personals by drawing the red rectangle around the objects. Therefore, an intended person or observer can take appropriate action or ask them to maintain social distance.

where (X 1 , X 2 ) and (Y 1 , Y 2 ) are centroid values of two objects. 

In this paper, CNN based techniques have been developed to detect the presence of humans. In addition, the practice of social distancing is performed from these proposed techniques. All the experimentations have been performed on Intel core i3-5005 CPU@2.00 GHz processor of 64-bit type system and Google Colab in Python. We used the INRIA image dataset [28] for training purposes. It consists of a total of 6562 images in which 4146 images are negative, and 2416 images are positive. We split our image dataset into training and testing module, in which 2316 positive and 4046 negative images are used for training purposes, and 100 positives and 100 negative images are used for testing purposes. This dataset contains static images and incorporates variations in humans with 64 9 128 resolution. In testing with real-time video sequences for sliding window-based modules, the minimum window size is (64, 128), step size is (10, 10), the downscale is 1.25. It process approximately 567 windows each of size 64 9 128 for an image of size 264 9 400 9 3.

The proposed technique has adopted the CNN architecture for human detection. It uses sliding window concept for region proposal and Convnet for human detection. As a part of experimentation for deriving an optimized model, the two different models, namely Model 1 and Model 2, have been proposed. These Models are hyper-tuned with different parameters like Batch size, Dropout rate, Activation function, Optimizer, and Epochs. Table 2 illustrates the hyper-parameter tuning for different variants of these Models.

These proposed models (namely Model 1 and Model 2) are trained and tested over different hyper-parameters and provide appropriate outcomes, presented in Table 3 . Model 1 is hyper tuned with '8' batch size, '30%' dropout rate, 'Relu' activation function for the convolutional layer and FC layer, 'Sigmoid' activation for the output layer, 'Adam' optimizer, and '120' epochs. It yields 97% testing accuracy. The structure of Model 2 is hyper-tuned in the same way as Model 1, except both model has a different structure, dropout rate, and optimizer parameter. Model 2 yields 98.50% testing accuracy.

On performing the experimentations, it is observed that the training and testing costs of CNN based models are highly expensive while running in our system (i3 Processor with 4 GB Ram). Conversely, it runs smoothly in the Google Colab platform (in GPU environments) with less timing cost. Table 4 shows the overall training and testing time comparison through our system and Google Colab.

Here, an image of size 264 9 400x3 is used to evaluate the testing time of the proposed model. Figure 4 illustrates the accuracy and loss curve of over 120 epochs for Model 1 and Model 2. On analyzing both Models, we find that Model 2 provides more encouraging results than Model 1, which offers higher accuracy and lower loss value. Table 5 shows the comparative analysis of our proposed Models with existing human detectors. Upon exploration, it is found that both models provide excellent results. But, Model 2 has achieved the highest accuracy among all and proved to be the most efficient human detection technique.

An appropriate positioning and placement of the camera for likely receiving the video stream in a physical real-time system is a most challenging task. In the experiment's context, it is seen that if the camera is placed near to the objects/humans, the object seems bigger and if the camera is placed away from objects, the object size reduces in the images captured. This creates a problem in acquiring relevant features for object/human detection. Therefore, the camera location is adjusted based on practical calibration taking our algorithm in view. Figure 5 exhibits some resulting images for performing social distancing, which carries raw detection before applying NMS and final detection result after applying NMS.

This article suggests deep learning based human detection techniques to monitor social distancing in the real-time environment. These techniques have been developed with the help of deep convoluted network that has used sliding window concept as a region proposal. Further, they are used with the social distancing algorithm to measure the distancing criteria among people. This evaluated distancing criteria decide whether two peoples are following social distancing norms or not. The extensive experiments were performed with CNN based object detectors. In experiments, it is found that CNN-based object detection models are better in accuracy than others. Sometimes, it produces some false positive instances when dealing with real-time video sequences. In the future, different modern object detectors like RCNN, Faster RCNN, SSD, RFCN, YOLO, etc. may be deployed with the self-created dataset to increase detection accuracy and reduce the false positive instances. Additionally, a single viewpoint obtained from a single-camera can't reflect the result more effectively. Therefore, the proposed algorithm may be set for different views through many cameras in the future to get more accurate results. 

COVID-19 coronavirus pandemic

World Health Organization (2020) Coronavirus disease (COVID-19) advice for the public

Considerations relating to social distancing measures in response to the COVID-19 epidemic. European Centre for Disease Prevention and Control

Review of Coronavirus Disease-2019 (COVID-19)

Tracking Movements of Humans in a Real-Time Surveillance Scene

Pedestrian orientation estimation using CNN and depth camera

A survey of deep learning-based object detection

Human action recognition in video

An adaptive hybrid GMM for multiple human detection in crowd scenario

Robust real-time object detection using a boosted cascade of simple features

Human head detection using histograms of oriented optical low in low quality videos with occlusion

Real-time moving human detection using HOG and Fourier descriptor based on CUDA implementation

Vision based pedestrian detection for advanced driver assistance

Improved pedestrian detection using motion segmentation and silhouette orientation

Human detection and tracking using hog for action recognition

Multiple human upper bodies detection via candidate-region convolutional neural network

Human crowd detection for city wide surveillance

Human detection and tracking for video surveillance: A cognitive science approach

SIFT and tensor based object detection and classification in videos using deep neural networks

Classification of mammogram images using multiscale all convolutional neural network (MA-CNN)

Recent progresses on object detection: a brief review

Tracking people by detection using CNN features

Review of deep learning techniques for object detection and classification

Image annotation using deep learning: A review

Deep cascade learning

Daedalus: breaking non-maximum suppression in object detection via adversarial examples

An enhanced CBIR using HSV quantization, discrete wavelet transform and edge histogram descriptor

A hybrid approach for image retrieval using visual descriptors