key: cord-0538563-c3pn7904
authors: Darapaneni, Narayana; Kumar, Shrawan; Krishnan, Selvarangan; Hemalatha, K; Rajagopal, Arunkumar; Nagendra,; Paduri, Anwesh Reddy
title: Implementing a Real-Time, YOLOv5 based Social Distancing Measuring System for Covid-19
date: 2022-04-07
journal: nan
DOI: nan
sha: 46ab5c2aeb3b59e80406c5632581e828fb5e2fbe
doc_id: 538563
cord_uid: c3pn7904

The purpose of this work is, to provide a YOLOv5 deep learning-based social distance monitoring framework using an overhead view perspective. In addition, we have developed a custom defined model YOLOv5 modified CSP (Cross Stage Partial Network) and assessed the performance on COCO and Visdrone dataset with and without transfer learning. Our findings show that the developed model successfully identifies the individual who violates the social distances. The accuracy of 81.7% for the modified bottleneck CSP without transfer learning is observed on COCO dataset after training the model for 300 epochs whereas for the same epochs, the default YOLOv5 model is attaining 80.1% accuracy with transfer learning. This shows an improvement in accuracy by our modified bottleneck CSP model. For the Visdrone dataset, we are able to achieve an accuracy of upto 56.5% for certain classes and especially an accuracy of 40% for people and pedestrians with transfer learning using the default YOLOv5s model for 30 epochs. While the modified bottleneck CSP is able to perform slightly better than the default model with an accuracy score of upto 58.1% for certain classes and an accuracy of ~40.4% for people and pedestrians.

The coronavirus disease was reported in December 2019 in China. Soon the virus caused a global outbreak, and the World Health Organization (WHO) announced the situation as pandemic [1] . The data published by WHO on 4th November 2021 confirms 247.96 million infected people and a scary number of 5,020,204 deaths globally. On 8 July 2020, the WHO announced "There is emerging evidence that COVID-19 is an airborne disease that can be spread by tiny particles suspended in the air after people talk or breathe, especially in crowded, closed environments or poorly ventilated settings" According to WHO, the minimum distance between individuals must be at least 6 feet to ensure adequate social distance among the people.

The main challenges are attaining a high level of accuracy, lighting conditions, occlusion, and real-time performance. The key goals of this work are as follows:

 To present YOLOv5 deep learning-based social distance monitoring tool using an overhead view perspective.  To deploy pre-trained YOLOv5 for person detection and computing their bounding box centroids. In addition, a transfer learning method is applied to improve the performance of the model trained on COCO and Visdrone dataset.  To assess the performance of a custom defined model "YOLOv5 modified CSP (Cross Stage Partial Network)" on COCO and Visdrone dataset without transfer learning.  In order to track the social distance between individuals, the Euclidean distance is used to approximate the distance between each pair of the centroid of the bounding box detected. In addition, a social distance violation threshold is specified using a pixel to distance estimation.  To assess the social distancing model performance on an overhead data set.

The rest of the work discussed in this paper is structured as follows. The related work is presented in Section 2. The salient features of the dataset which we used to train and validate is presented in Section 3. A deep learning-based social distance monitoring framework has been presented in Section 4. The detailed analysis of output results and performance evaluation of the model with and without transfer learning is also illustrated in this Section 5. The future scope and challenges are given in the Section 6. The conclusion of the given work with potential future plans is provided in Section 7.

Convolutional Neural Networks have played a very crucial role in complex object classification and feature extraction, including human detection. Over the past few decades, convolutional neural networks (CNN), faster region-based CNN (Faster RCNN), and region-based CNN (RCNN) used region proposal techniques to produce the objectness score prior to its classification and later created the bounding boxes around the interested objects for visualization and statistical analysis. With the development of GPUs, faster CPUs, and extended memory capacities, the researchers were able to build the CNN systems with high accuracy and fast detection compared to conventional models. Even though the above-mentioned methods are efficient but endure in terms of detection speed, long time training and achieving better accuracy, there are still remaining issues to be solved. Since all these CNNs based model approaches classification, YOLO (You Only Look Once) considered a different approach and used a regression-based method to dimensionally split the bounding boxes and translate their class probabilities. In YOLO, the framework efficiently splits the image into multiple portions representing bounding boxes with the probability scores for class for each portion to consider as an object. YOLO provides significant improvements in terms of speed at the cost of reduced efficiency as well as an object detector module that exhibits powerful generalization capabilities to represent the whole image.

After an object is detected, classification methods can be used to identify a human on the basis of motion-based features, shape, or texture. In shape-based methods, shape related information about moving regions are defined to detect the human. Due to limitations in standard template matching schemes, this method performs not so great which is further augmented by incorporating part-based template matching approach. Dalal et al. [51] suggested texture-based schemes such as histograms of oriented gradient, which makes use of higher dimensional features along with the support vector machine to detect the humans.

Human identification in image or video sequences is a very crucial part in the field of computer vision and object detection and it is an important subbranch in this field. Although many researchers have worked on human action recognition and human detection, it is mostly either limited to indoor applications or faces accuracy issues under outdoor challenging conditions which is not limited to lighting conditions. Other research works employ manual-tuning methodologies to classify/identify people's activities, however, limited functionality has been an issue.

Recent research shows that gait and face recognition techniques can be used for further identification of humans in surveillance video. However, tracking and detection of people specially under a crowd is difficult at times due to full or partial occlusion problems. Leibe et al. [17] came up with a solution based on trajectory estimation while Andriluka et al. [50] proposed tracklet-based detectors as a solution in detecting partially occluded people.

With many social applications, crowd counting emerged as a key area of research. Eshel et al., [13] focused on person count and crowd detection by suggesting multiple height homographies for head top detection and overcame the occlusions problem related to video surveillance applications. Chen et al. [52] developed an application for electronic advertising using the concept of crowd counting. For a similar application, Chih-Wen et al. [53] proposed a vision-based people counting model. Following the footsteps, Yao et al. [54] captured inputs from stationary cameras in order to carry out background subtraction in order to train the model for the foreground shape and appearance of the crowd in videos.

Rahim A et al. [55] proposed a framework which utilizes the YOLOv4 model for real-time object detection. They also proposed the social distance measuring approach in their YOLOv4 model framework to specify the risk factor based on the calculated distance and safety distance violations. In this model they introduced a single motionless time of flight (ToF) camera to capture the video sequence with various lighting conditions.

Mahdi Rezaei et al. [56] proposed the framework which uses YOLOv4 based Deep Neural Network (DNN) model to automate human detection in crowded places both indoor and outdoor environments using common CCTV cameras. They used the DNN model in combination with an adapted inverse perspective mapping (IPM) technique and SORT tracking algorithm to detect people and social distance monitoring.

Imran Ahmed et al. [57] proposed a deep learning platform based on YOLOv3 model in detecting humans and correspondingly track their social distance using an overhead perspective. The detection algorithm uses a pre-trained model that is connected to an extra trained layer using an overhead data set. The detection model detects humans using bounding box information. The bounding box centroid values using Euclidean distance used to evaluate the pairwise distances between humans to evaluate social distance violations.

Sergio Saponara et al. [58] proposed a deep learning model using YOLOv2 to detect and track people in indoor and outdoor scenarios. The proposed approach required the images acquired through thermal cameras to establish a complete AI system for people tracking, social distance classification, and body temperature monitoring.

A few other research works propose new loss functions in order to effectively address the problem of crowded detections. For example, Occlusion-aware R-CNN suggests aggregation loss in order to enforce proposals to be as close to the corresponding objects and reduces the internal region distances of proposals related to the same objects. Repulsion Loss adds an extra penalty to proposals intertwined with several ground truths. Furthermore, advanced NMS strategies are put forward to reduce the crowdedness issues to some degree, but they still use IoU as the metric to calculate the difference between detected objects, which results in limiting the performance on recognizing highly overlapped instances in crowded boxes.

 AISKYEYE team collected the VisDrone2019 dataset at the Lab of Machine Learning and Data Mining, Tianjin University, China. 

The Visdrone dataset contains 288 video clips made up of 261,908 frames and 10,209 images. 

The images and videos have been captured by various drone-mounted cameras, covering a wide range of aspects including location (taken from 14 different cities separated by thousands of kilometers in China), environment (urban and country), objects (pedestrian, vehicles, bicycles, etc.), and density (sparse and crowded scenes). 

In object detection tasks, we focus on ten object categories of interest including pedestrian, person, car, van, bicycle, awning-tricycle, bus, truck, motor, and tricycle.

 COCO, short for Common Objects in Context, is a large image recognition / classification, object detection, segmentation, and captioning dataset.

The COCO Dataset has 121,408 images 

The COCO Dataset has 883,331 object annotations 

The COCO Dataset has 80 classes 

The COCO Dataset median image ratio is 640 x 480

Our team has collected and curated over 100s of videos in order to test the social distancing output indicators on the data using YOLOv5. 

This study aims to implement the first working model of YOLOv5 based social distancing model using YOLOv5s, YOLOv5s6, and YOLOv5s6 modified bottleneck CSP trained on COCO dataset. In addition, we have made modifications to existing YOLOv5 architecture and came up with a new architecture named YOLOv5s6 modified backbone CSP.

We have also trained, validated, and detected the model performances on two datasets namely COCO and Visdrone.

Architecture of YOLOv5:

Anchor boxes consist of a set of predefined bounding boxes having certain height and width. These boxes are created to capture the scale and aspect ratio of the object classes to be detected and are chosen based on training dataset object sizes. Backbone: Feature Extraction-It's a deep neural network (DNN) composed of convolution layers. Backbone is used to extract the essential features and therefore selection of the backbone is critical to ensure performance of object detection. Usually, to train the backbone, pre-trained neural networks are used. Where, SPP is Spatial Pyramid Pooling Conv is convolution Layer Concat is Concatenate function PANet is a Path Aggregation Network.

For multi-object detection, YOLO works well specially when one grid cell is associated with each object. Anchor box is helpful in case of overlap between objects which enables in detecting multiple objects in one grid cell. 

To the best of our knowledge, we couldn't find any YOLOv5 social distancing algorithm successfully implemented and also using YOLOv5s, YOLOv5s6, and YOLOv5s6 modified bottleneck CSP on COCO dataset. This is a major gap which we wanted to address in our paper. In order to execute, we took the YOLOv5 based model and incorporated a detailed social distancing algorithm to successfully capture "Low Risk", "Medium Risk", "High Risk". 

There is an existing implementation using YOLOV5 in github however the model is incomplete and also doesn't support additional architectures due to which is not published. We used this as reference and built a YOLOv5 social distancing model which is capable of working on multiple architectures used in the YOLOv5 model as well on our custom architecture (modified backbone CSP). By doing this, we have successfully implemented the first working social distancing model based on YOLOv5 and other supporting architectures based on YOLOv5.

The below benchmark comparisons are captured in the previous sections and the same can be considered for this section as well since the section heading/coverage is duplicate in nature.

Our key objective is to implement a working YOLOv5 based social distancing model using YOLOv5s, YOLOv5s6, and YOLOv5s6 modified bottleneck CSP architecture. There is no existing social distancing working model based on the above YOLOv5 architectures. From the screenshots below the model's sample output can be seen which clearly indicates the risk category based on centroid distance calculations.

Parameters for High, Medium, and Low risk: 

For the Visdrone dataset, we are able to achieve an accuracy of upto 56.5% for certain classes and especially an accuracy of ~40% for people and pedestrians with transfer learning using the default YOLOv5s model for 30 epochs. While the modified bottleneck CSP is able to perform slightly better than the default model with an accuracy score of upto 58.5% for certain classes and an accuracy of ~42.1% for people and pedestrians. This clearly proves that the custom model is performing better than the default model on the Visdrone dataset.

All of the benchmarking tests and comparisons were conducted on the same hardware and software. 

Sample Outputs using our Custom Model Trained and Evaluated on Visdrone Dataset: Fig. 11 . Model Sample Outputs for Visdrone Dataset

We have learned to successfully train a model with and without predefined weights as well on different datasets thereby able to evaluate the model performance within a model as well as across models for both object detection and social distancing. Taking into account the importance of social distance in managing and reducing the probability of COVID-19 disease from continuously spreading which can cause the healthcare system to collapse due to high numbers of patients, our project can offer a smart solution to the public to monitor and remind them to maintain the distance when in public areas. Due to infrastructure limitation, we weren't able to train beyond a certain number of epochs which would have eventually helped us in getting better accuracy scores. In terms of data collection, we could partner with a few establishments which witness large footfalls such as supermarkets, malls, theatres, hospitals, banks, etc. and carry out the social distancing on a live feed and raise an alarm as when it is needed. In the future, additional backend processes will be included that allow advanced statistical analysis to be done which can be used by the authority, facilities or building owner to monitor the level of compliance among the people or visitors.

This research presented an intelligent surveillance system for social distancing classification. The proposed technique achieved promising results for people detection in terms of evaluating the accuracy and precision of the detector comparable to other YOLOv5 models. We have tested and concluded that the model performs well on side angle, top angle, overhead and drone angle. Any new angle other than the ones mentioned, the model may not work. The application is not intended to capture social distancing from Drone angle as the objects looks smaller and there would be significant overlaps in the prediction labels which might make the readability difficult to visualize.

As of now, the model is trained on separate datasets one by one but as an effective mechanism, we can consolidate different datasets thereby being able to train the model on multiple image variances which will help us achieve a better prediction on any dataset without needing to re-train. There may also be genuinely raised concerns about privacy and individual rights which can be addressed with some additional measures such as prior consents for such working environments, hiding a person's identity in general, and maintaining transparency about its fair uses within limited stakeholders. The proposed approach can be implemented in a distributed video surveillance system, Drone surveillance system and other similar surveillance systems. It is a suitable solution for the authorities to visualize the compliance of people with social distancing at a confidence of more than 80% with limited training itself. With a significant amount of training, the accuracy scores can reach above 90%.

COVID-19) Dashboard | WHO Coronavirus (COVID-19) Dashboard With Vaccination Data

CrowdHuman: A Benchmark for Detecting Human in a Crowd

Fast R-CNN

Rich feature hierarchies for accurate object detection and semantic segmentation

CityPersons: A Diverse Dataset for Pedestrian Detection

Considerations relating to social distancing measures in response to COVID-19" -second update

Faster R-CNN: Towards real-time object detection with region proposal networks

An implementation of Faster R-CNN with study for region sampling

You only look once: Unified, realtime object detection

Deep residual learning for image recognition

Faster R-CNN: Towards realtime object detection with region proposal networks

Convolutional neural network for person and car detection using YOLO framework

Homography based multiple camera detection and tracking of people in a dense crowd

Cascader-CNN: Delving into high quality object detection

Automatic recognition and analysis of human faces and facial expressions: A survey

Pedestrian detection in crowded scenes

YOLOv3: An incremental improvement

Inception-v4, inception-resnet and the impact of residual connections on learning

Fcos: Fully convolutional one-stage object detection

Learning data augmentation strategies for object detection

M2det: A single-shot object detector based on multi-level feature pyramid network

Medium: YOLOv3: A Huge Improvement

Object detection and distance measurement

Automatic recognition and analysis of human faces and facial expressions: A survey

Rich feature hierarchies for accurate object detection and semantic segmentation

Very deep convolutional networks for large-scale image recognition

Automatic recognition and analysis of human faces and facial expressions: A survey

Megapixels.cc: Origins, ethics, and privacy implications of publicly available face recognition image datasets

Pedestrian detection via body part semantic and contextual information with DNN

Gated feedback refinement network for dense image labeling

Handling occlusions with frankenclassifiers

Focal Loss for Dense Object Detection

Encoder-decoder with atrous separable convolution for semantic image segmentation

Multi-label Learning of Part Detectors for Heavily Occluded Pedestrian Detection

Mask-Guided Attention Network for Occluded Pedestrian Detection

Object as Distribution

Facial expression recognition and recommendations using deep neural network with transfer learning

AConvnetforNon-maximumSuppression

Mask-Guided Attention Network for Occluded Pedestrian Detection

Multi-label Learning of Part Detectors for Heavily Occluded Pedestrian Detection

Parallel feature pyramid network for object detection

SSD: Single shot multibox detector. ECCV

SSD: Single Shot MultiBox Detector

Bi-box Regression for Pedestrian Detection and Occlusion Estimation

Mask-Guided Attention Networks for Occluded Pedestrian Detection

Proposal free network for instance-level object segmentation

Activity & emotion detection of recognized kids in CCTV video for day care using SlowFast & CNN

Rich feature hierarchies for accurate object detection and semantic segmentation

People-tracking-by-detection and people detection-bytracking

Histograms of oriented gradients for human detection

An online people counting system for electronic advertising machines

Automatic face detection and recognition for attendance maintenance

Fast human detection from joint appearance and foreground feature subset covariances

Monitoring social distancing under various low light conditions with deep learning and a single motionless time of flight camera

DeepSOCIAL: Social Distancing Monitoring and Infection Risk Assessment in COVID-19 Pandemic

Imran Ahmed

A deep learning-based social distance monitoring framework for COVID-19

Implementing a realtime, AI-based, people detection and social distancing measuring system for Covid-19