key: cord-0150503-d04fl28i authors: Aznan, Nik Khadijah Nik; Brennan, John; Bell, Daniel; Jonczyk, Jennine; Watson, Paul title: On the Complexity of Object Detection on Real-world Public Transportation Images for Social Distancing Measurement date: 2022-02-14 journal: nan DOI: nan sha: 54e5557790a08d4ba2d7488591113bb5a1502e25 doc_id: 150503 cord_uid: d04fl28i Social distancing in public spaces has become an essential aspect in helping to reduce the impact of the COVID-19 pandemic. Exploiting recent advances in machine learning, there have been many studies in the literature implementing social distancing via object detection through the use of surveillance cameras in public spaces. However, to date, there has been no study of social distance measurement on public transport. The public transport setting has some unique challenges, including some low-resolution images and camera locations that can lead to the partial occlusion of passengers, which make it challenging to perform accurate detection. Thus, in this paper, we investigate the challenges of performing accurate social distance measurement on public transportation. We benchmark several state-of-the-art object detection algorithms using real-world footage taken from the London Underground and bus network. The work highlights the complexity of performing social distancing measurement on images from current public transportation onboard cameras. Further, exploiting domain knowledge of expected passenger behaviour, we attempt to improve the quality of the detections using various strategies and show improvement over using vanilla object detection alone. There have been many studies on monitoring social distancing to provide safety measures in public spaces in recent years, primarily because of the COVID-19 pandemic. Social distancing refers to a distance maintained between one person to another to minimise physical contact [1] . One of the ways the virus spreads is through close contact between people; therefore, social distancing can reduce or slow down the spread [2] . By taking advantage of surveillance camera footage, social distancing monitoring systems have been developed that can be used to to either enforce social distancing in public spaces [3] or provide advice and guidelines on how effectively social distancing works [4] - [9] . However, no study shows the efficacy of social distancing measurement on public transportation. Public transportation is an essential facility for daily transportation. According to the United Kingdom Department of Transport, in the year 2019/2020, there were 4524 million passenger journeys on local bus services in Great Britain, 1745 million passenger journeys on the national rail network, and 1337 million passenger journeys on London underground [10] . These numbers showed that there were many journeys with a high-density of passengers and so automatic social distancing measurement would have been helpful. Whilst, social distancing on public transportation is critical as the virus spreads more easily in the enclosed spaces, other factors such as good ventilation on public transportation will also help with transmission mitigation and are highly recommended [11] . However, social distancing is crucial when it is hard to achieve a good ventilation. Convolutional Neural Networks (CNN) are a state-of-the-art machine learning approach that has been shown as a practical solution in object detection tasks within computer vision [12] . Object detection architectures such as region-based CNN (R-CNN) like Faster R-CNN , Mask R-CNN, and YOLO v3 (You Only Look Once version 3) provide solutions for detecting objects and their locations in an image [13] . These large architectures are trained on massive benchmark datasets for multiple days to achieve their predictive performance. Despite advancements in computer vision, object detection tasks still face challenges when dealing with real-world images, especially where the images suffer from degradation and occlusion [13] . In order to accurately monitor social distancing on public transportation, ideally object detection should be able to detect the presence and position of every passenger on board by using the surveillance CCTV system on the vehicle. However, based on our observation of these images, the quality of the images is poor, with very low-resolution and low frame-per-second when compared to the benchmark dataset used to train object detection algorithms such as MS COCO that has high quality and clean images [14] . Additionally, the camera's physical position in the vehicle often means it is far from the location of the passenger seats and there are fittings such as hand rails and seats that prevent clear line of sight. This results in social distancing monitoring on public transportation being complex when compared to monitoring using CCTV in streets or malls. There are also possible false positives on the footage caused by either image degradation or the capture of partial images of people through the window and therefore outside the vehicle. In this study, we present three different object detection algorithms, Faster R-CNN, Mask R-CNN and YOLO v3, to investigate the performance of these algorithms in detecting passengers from the low-resolution onboard CCTV cameras from both busses and underground trains. We then explore using our domain knowledge on the expected behaviour of passengers to improve the accuracy. We also use densitybased clustering of the detected passengers to examine the possibility of automatic social distancing measurement on public transportation. This paper highlights the complexity of achieving social distancing measurement from onboard low-resolution CCTV. To our knowledge, this study is the first to address such an issue on public transportation datasets. Using state-ofthe-art computer vision algorithms such as Mask R-CNN, we believe performing social distance measurement in reallife will require some adjustment in terms of the hardware specification and the physical location of the surveillance camera. In summary, the main contributions of this study are: • A study of social distancing measurement on public transportation based on the existing, low-resolution onboard surveillance cameras in the vehicles. • We show that domain knowledge based on expected passenger behaviour on public transportation can improve the results by certain condition. • We highlight the complexity of detecting passengers and social distancing automatically via object detection algorithms on real world CCTV footage and suggest how this could be improved. Recently, taking advantage of the performance of CNN in detecting objects and people, there has been great interest in developing methods for the detection of social distancing which we review in this section. For example, YOLO v3 has been used as the object detection architecture for detecting pedestrians in [4] - [9] , primarily because it works better in a real-time setting [7] . Other versions of YOLO architecture have also been used for object detection, such as YOLO v2 [1] and YOLO v4 [15] . Other work in the literature has focused on comparing different object detection algorithms at detecting pedestrians, for example, the authors in [16] compare Faster R-CNN and YOLO v4, the authors in [7] compare Faster R-CNN, Single Shot Detector (SSD) and YOLO v3 and the authors in [17] used YOLO v3 after comparing to RetinaNet and Mask R-CNN. In order to measure the distance between people, the authors in [4] , [5] , [16] transformed the camera view into a bird's eye view to get the location of the detected person from the top-down perspective. The centroid of the detected person is used to measure the Euclidean distance to any other person in the image. The Euclidean distance is also used in [6] , [8] , [15] , [18] , [19] as well without any camera transformation for social distance measurement. In [9] , the position of the detected pedestrian is estimated based on the bottom-centre point of the bounding box, rather than the centroid, before computing the distance between every pedestrian pair. Instead of tracking individual location, the work in [17] measures the pedestrian density in a given image, via the use of a heatmap and temporal density distribution. The author in [20] proposed a model, entitled DeepSO-CIAL, based on a modified YOLO v3 to detect, track and estimate the distance between people. They applied the Simple Online and Real-time (SORT) tracking technique that uses Kalman filter and Hungarian optimisation to track and assign a unique ID to each detected people. To then detect the distance between people, they used inverse perspective geometric mapping. The authors in [3] proposed a multimodal Convolutional Variational Autoencoder (CVAE) fusion framework where the input data is from a combination of visual, infrared, audio and accelerometer sensors. The authors argue this multimodality allows for a greater performance level than offered by considering only visuals. Most of the work in social distancing measurement has been trained and evaluated on CCTV footage of pedestrians walking for example, from Oxford Town Centre [7] , [16] , [20] and New York [17] . Other studies have used data from low light environments [15] and three different videos from town centers and beaches [8] . While earlier work focuses on CCTV footage that captures people mostly walking in a wide spacious area, in this paper we address the complexity of social distancing measurement on public transportation, especially those due to low-resolution images, non-optimal camera placement and physical limitations. This paper analyses the complexity of applying the social distancing measurement via onboard low-resolution CCTV, especially public transportation. Our analysis is based on footage from a London underground line and from Go North East busses. The majority of the London Underground lines have highresolution CCTV cameras. However, this study investigates the complexity of detecting passengers not only in low-resolution images that can be found on a specific line on London underground and busses but also because of the occlusion caused by the camera's physical location. We randomly choose five clips from the Go North East bus journey and five from a London underground train to visualise the outcome of this analysis. For privacy reasons, all the images in this paper have been post-processed with a Gaussian blur after the experiments to hide any passengers and their belongings. This section discusses three different state-of-the-art object detection approaches on images from low-resolution onboard surveillance cameras on public transportation to assess how well the pre-trained networks detect the passengers. We also present a post-processing technique we used to investigate if we can improve the quality of the detections. Object Detection is one of the most used applications within Computer Vision and has been widely used in object (a) V01 tracking, autonomous vehicle, medical imaging and many more application. The goal of object detection is to locate each object in an image [21] . In this paper, we show three different object detection algorithms to see if the algorithms can detect the passengers using the onboard CCTV camera footage from the bus and underground train. The Faster R-CNN is a region-based CNN that introduces another convolutional network (on top of the CNN for the classification task) as Region Proposal Network (RPN) to generate the pixel coordinates of the potential objects in an image [22] . A region-based network is a network that has a region proposal module that generates a variety of bounding boxes with different sizes and aspect ratios as the proposed object regions. This network will return object classes with the score and the bounding boxes for each object as the pixel coordinates [23] . 2) Mask R-CNN: Mask R-CNN used the RPN from Faster R-CNN, but instead of just the bounding box, it indicates the binary mask of each object and the bounding box. The binary mask gives 1 to the pixel that belongs to the objects and 0 otherwise. Mask R-CNN is considered an instance segmentation because of the mask segmentation, given the pixel by pixel location and the bounding box of every object. The granularity can be beneficial in applications such as autonomous vehicles and pose estimation, where precision is essential [24] . 3) YOLOv3: Unlike region-based object detection CNN use two networks, one for object localisation and another for feature extraction, You Only Look Once (YOLO) introduces single-stage architecture using the regression-based method [25] . YOLO v3 has a deep end-to-end CNN that deals with feature extraction and object localisation, making this unified architecture very quick to train [26] . We show examples of these three object detection approaches as in Figure 1 . As the performance of Mask R-CNN is better than Faster R-CNN or YOLOv3, in this paper, we use Mask R-CNN running on Detectron2 framework [27] as our passenger detection. There are some possible false positives in the footage that caused by the degradation of images or the camera angle, especially on the bus that captures partial images from outside the vehicle. In this paper, we explore the way to eliminate any false positive detection by using bounding box analysis. We are interested in reducing the false positives via the usage of domain knowledge on the expected behaviour of the passengers on public transportation by eliminating any bounding boxes from the prediction that only appear for a few frames. The assumption is that, once people are on the bus, their bounding boxes will frequently appear within the same region for at least a few minutes. If a "person" is detected for less than a certain number of frames, we can assume it is a misidentification, for example, a pedestrian or cyclist outside of the bus. Our bounding box analysis is based on a few assumptions: • Most of the time, all passengers are roughly in the same place (they are sitting in the same seat) throughout the journey. • There is no more than one false positive in a frame. • We vary the frame window between 4 frames to 300 frames (1 second to 60 seconds given 4fps of the video). • We vary the centre of the bounding box within ±10 and ±15 pixels. Based on the preliminary experiments on comparing three different object detection models, we use Mask R-CNN running on the Detectron2 framework for passenger detection in this paper. We compare the headcount from the Mask R-CNN to the ground truth to see how accurately the model detects passengers on public transportation. We randomly choose five different camera footages from the bus journey and five from the underground train, and manually count the number of passengers of each journey as the ground truth headcount. We also randomly select a frame range where the headcount is constant to monitor the performance of the Mask R-CNN. To study whether the model under or over predicts, we plot the relative change in percentage as in Figure 2 . We measure the relative changes of the headcount for every frame based on : where y i is the ground truth number and x i is the prediction number. The percentage above zero indicates under prediction, while the percentage less than zero indicates over prediction (False Positive). Table I shows the headcount of the prediction using Mask R-CNN to the ground truth on the frames where the ground truth headcounts are constant (i.e. between stops) to show the performance of Mask R-CNN on detecting passengers. We also calculate the mean absolute relative changes to see average relative changes across all frames on the footage. Figure 2 and Table I show that the predictions often differ greatly from the ground truth headcount. The bus footage plots indicate that false positives happen in all the footage. Some of the examples of object detection via Mask R-CNN can be seen in Figure 3 . The current literature in Section II focuses on pedestrian detection, where the pedestrian in the footage is mostly standing up, which shows more visible features than people sitting down. Figure 4 shows an example of a passenger being easily detected when standing up compared to sitting down. Object detection works very well when the model can learn features that discriminate between different classes:-in our case, between a person and a non-person. However, we have to work with low-resolution CCTV images, making it hard for the object detection network (Mask R-CNN) to identify the features within the images. Apart from that, there are many passengers that are occluded because of the CCTV's physical location, for example, in Figure 5 . In Figure 5a , we can see that a passenger is blocking another passenger, and the Mask R-CNN is blending this passenger's head to another passenger's lower body as a detection. Similarly, two passengers in Figure 5b have been detected as a passenger. There are pedestrians outside of the bus in Figure 5c and Figure 5d that have been detected, contributing to the Figure 5c has not been detected, perhaps because of the unusual angle of the passenger and the hardcoded metadata causing some occlusion of the passenger. In addition to the low-resolution camera, the physical location of the surveillance camera on the vehicle can also be an issue. For example; • Having a camera far away from the passenger seats' location results in some people only appearing as a few pixels in the frame. • Other obstacles may occlude a clear line of sight to the passengers, such as seats, handrails, or other passengers in front of the others. • As the CCTV also captures partial images from outside of the vehicle, false positives may occur, especially if any pedestrians or cyclists are captured within the frame. • The inconsistent brightness on the vehicle will either create shadows that can be mistaken as a person or may result in low illumination in the images, again degrading the network's performance. Based on Figure 2 in Section V-A, all the bus footages (V06, V07, V08, V09, V10) have a significant number of false-positive detections. From the observation, the location of the CCTV cameras and the positions and the large windows contribute to the false positive detections on the bus. As the CCTV captures partial images from outside of the bus, false positives may occur, as when pedestrians or cyclists are captured within the frame. In addition, the inconsistent brightness on the bus will either create shadows that can be mistaken as a passenger, again degrading the performance of the object detection network. Therefore, we ran the bounding box analysis to investigate if our assumptions are correct and improve the relative change of passenger detection. As shown in Figure 6 , this bounding box analysis method manages to remove some of the false positives. However, some of the true positives have been removed as well. The mean absolute relative change for these five videos now are; 76.1%, 77.4%, 23.2%, 51.5%, and 40.4%. One of the observations that contribute to this is some of the passengers are sitting where they are obscured from a clear line of sight of the CCTV camera. Thus, the only time they were detected was when they walked to their seat or walked out from the bus. By eliminating the bounding box that appears less frequent around the same region, the detection of these passengers will have eliminated as well, resulting in them not being counted. (a) A part of the passenger has been detected. (b) A passenger is stretching out the leg. (c) A false positive of a seat next to a detected passenger. (d) The lower body of the passenger is not detected. Using bounding boxes from the detected passengers, we can obtain the centroid of the boxes and measure the Euclidean distances of any adjacent passengers. We use this information and the safe distance threshold to measure the social distancing between passengers. We used Density-Based Spatial Clustering of Applications with Noise (DBSCAN) as density-based clustering where it clusters based on the distance between the nearest point [28] . Figure 7 shows the example of a social distancing measurement on public transportation. The green bounding box indicates safe social distancing, amber indicates not safe, and the red bounding box indicates dangerous social distancing. As the social distancing relies on passenger detection, the passenger that has not been detected did not contribute to the overall clustering algorithm, causing inaccurate measurement of the social distancing. It is also similar to any false positives that add inaccurate bounding boxes that will affect the accuracy of the social distancing measurement. It is also important to note that this social distancing measure is purely based on the passenger detection based on the footage. There is not enough information to conclude whether the passengers are perhaps from the same household. The occlusion or the wrong pixel detection during passenger detection causes the wrong size of the bounding box, leading to inaccurate social distancing measurement as the clustering is based on the centre of the bounding box. Some of the inaccurate social distancing measurement examples can be seen in Figure 8 that shows a part of a passenger has been detected. However, the size of the bounding box is smaller than the actual ground truth, making the passenger's centroid a bit far from the next passenger. Contrary, as the passenger in Figure 8b stretched out the legs, the bounding box became bigger and pushed the centroid closer to the other passenger. A false positive is detected next to a detected passenger in Figure 8c , resulting in a false danger social distancing measurement. The seats on the vehicle can sometimes block some part of passengers resulting in a smaller bounding box than the actual ground truth. The two passengers sitting next to each other in Figure 5d are measured as at a safe distance because of the big difference in the bounding box size. With the recent progress in machine learning for computer vision, there is much literature investigating the best way to tackle the issue of social distancing through the use of surveillance cameras in public spaces, primarily because of the COVID-19 pandemic. However, there is no related study on social distancing on public transportation up to date. CNNs have been shown capable in various object detection tasks within computer vision, especially when the image quality is good. In this work, we highlight the complexity of applying social distancing measurement using these methods on onboard low-resolution CCTV from public transportation. Using domain knowledge, we analyse the detection output to see if it is possible to improve the detection, especially in the headcount. Finally, we measure the social distancing between passengers using density-based clustering. In order to accurately monitor social distancing on public transportation, object detection should be able to detect every passenger on board by using the surveillance CCTV system on the vehicle. The physical location of the camera on public transportation, typically distant from the passenger seat location causes some passengers to appear as only a few pixels in the image. Additionally, the nature of the public transportation with obstacles such as seats, handrails and other passengers may occlude clear line of sight to the passengers. In addition, inconsistent image illumination and partial scenes captured from outside of the vehicles can be a potential issue degrading model predictive performance. The current onboard surveillance cameras on public transportation hinders satisfactory object detection and, consequently, affects the social distancing measurement. Other training approaches may be required to help models account for the challenges of these real-world images. Additionally, better quality hardware and improved physical placement of the camera should be put in place to improve automatic social distance measurement. Implementing a real-time, ai-based, people detection and social distancing measuring system for covid-19 Effects of social distancing on the spreading of covid-19 inferred from mobile phone data A multimodal deep fusion graph framework to detect social distancing violations and fcgs in pandemic surveillance Deep learning and computer vision-based social distancing detection system Deep learning-based bird eye view social distancing monitoring using surveillance video for curbing the covid-19 spread Social distancing analyzer using computer vision and deep learning Monitoring covid-19 social distancing with person detection and tracking via fine-tuned yolo v3 and deepsort techniques A proposal to ensure social distancing with deep learning-based object detection Social distancing detection with deep learning model Rail statistics How can airborne transmission of covid-19 indoors be minimised? Object detection in 20 years: A survey Deep learning for generic object detection: A survey Microsoft coco: Common objects in context Monitoring social distancing under various low light conditions with deep learning and a single motionless time of flight camera A vision-based social distancing and critical density detection system for covid-19 Referencefree video-to-real distance approximation-based urban social distancing analytics amid covid-19 pandemic A safety measuring tool to maintain social distancing on covid-19 using deep learning approach A novel social distancing analysis in urban public space: A new online spatio-temporal trajectory approach Deepsocial: Social distancing monitoring and infection risk assessment in covid-19 pandemic Applications of object detection system Faster r-cnn: Towards real-time object detection with region proposal networks Rich feature hierarchies for accurate object detection and semantic segmentation Mask r-cnn You only look once: Unified, real-time object detection Yolov3: An incremental improvement A density-based algorithm for discovering clusters in large spatial databases with noise We thank the UK Research and Innovation (UKRI), Department of Transport (DfT), Transport for London (TfL) and Go North East.