key: cord-0079078-37tm93tc authors: Kundu, Srimanta; Maulik, Ujjwal title: Passenger Surveillance Using Deep Learning in Post-COVID-19 Intelligent Transportation System date: 2022-05-26 journal: Trans Indian Natl DOI: 10.1007/s41403-022-00338-y sha: 848de47f1f81470576d49987dc423f5f7a33a1ec doc_id: 79078 cord_uid: 37tm93tc Intelligent Transport System should be renovated in many aspects in post-pandemic situation like COVID-19. The passenger-count inside a car will be restricted based on the vehicle capacity and the COVID-19 hot-spot zone. Traffic rules will be impacted to align with a similar contagious outbreak. The on-road ‘Yellow-Vulture’ cameras need to incorporate such surveillance rules to monitor related anomalies for preventing contamination. To maintain safe-distance, an automatic surveillance system will be preferred by the Government very soon. Moreover, facial mask usage during the journey has become an essential habit to stop the spread of the infection. In this article, we have proposed a deep-Learning based framework that employs an augmented image data set to provide proper surveillance in the transport system to maintain the health protocols. Fast and accurate detection of the number of passengers inside a car and their face masks from the traffic inspection camera feed has been demonstrated. We have exploited the advantages of the popular Transfer Learning approach with novel variations of images while performing the training. To the best of our knowledge, this is the first attempt to watch over in-vehicle social-distancing in post-pandemic circumstances through deep-Learning based image analysis. The superiority of the proposed framework has been established over several state-of-the-art techniques using different numerical metrics and visual comparisons along with a support of statistical hypothesis test. Our technique has achieved [Formula: see text] testing accuracy in various adverse conditions. Zero-shot evaluation has been explored for the Real-Time-Medical-Mask-Detection data set Wang et al. (Real-Time-Medical-Mask-Detection, 2020a https://github.com/TheSSJ2612/Real-Time-Medical-Mask-Detection/, Accessed 14 Nov 2020), where we have attained [Formula: see text] accuracy that manifests the generalization of the network. The current pandemic situation will impact every industry in the world. The automotive industry will not be an exception. All the domains in this industry such as car design, manufacturing, marketing strategy, maintenance, ride safety, etc. will need to address the upcoming restrictions. The vehicle monitoring system will have a domain shift too from its existing version. In general, Government and the Traffic Management Authority install surveillance systems to increase on-road safety and security. Those systems inspect pollution, speed of a vehicle, anomalies in driving behavior and other traffic rule violations. 'Yellow-Vulture' cameras have become very popular in UK's road 1 which captures several offenses, such as smoking, drinking, etc. inside the car. However, in the near future, some additional constraints related to COVID-19, such as social-distancing, mask-wearing, etc., need to be compulsorily considered. Therefore, it is essential to address issues related to contagious diseases in Intelligent Transportation System (ITS) as well. In the last few years, several research works have been reported to address issues related to the surveillance system in the vehicular field. Baran et al. (2016) showed a smart camera-based system to detect anomalies in ITS. In the paper, Mehboob et al. (2019) have presented one intelligent traffic management system. Zhang et al. (2019) proposed an efficient vehicle detection technique for traffic surveillance data in real-time. In the paper, Manzoor et al. (2019) , recognized vehicles on the basis of generation, make and model which will be used for surveillance. In Gu et al. (2018) , the authors proposed a unique network model for multi-UAV surveillance. Drone-based traffic management is also becoming popular in the last few years (Ali 2019; Ong and Kochenderfer 2017) . However, in most cases, video processing is commonly used in traffic management systems (Xia et al. 2016; Hsieh et al. 2006) . Pramanik et al. (2021) have developed a video surveillance system for road safety with pre-event detection feature. Different vehicle detection and counting techniques have been applied by the researchers for the last decades (Tang et al. 2017; Balid et al. 2017) . Detection of several driving anomalies has been deployed for traffic management in many articles (Yuan et al. 2016; Hua et al. 2018) . Seatbelt violation has been captured efficiently by Elihos et al. (2018) . In recent pandemic state, researchers also have put their effort to identify and enforce social distancing norms (Punn et al. 2020; Yang et al. 2020; Saponara et al. 2021) . In (Hörcher et al. 2021) , authors have shown different possibilities of implementing social distancing in public transport. The spread of COVID through transport is becoming a huge challenge to the world now. ITS should impose such relevant norms to break the chain of COVID-transmission. In this paper, we have mainly focused to prevent the spread of such contagious diseases through enforcing rules in ITS. The key intention here is to identify the passengers' mask-wearing status inside a vehicle. We have deployed a Deep Learning (DL) based approach to deal with the mask status of any passenger seating inside the vehicle. At the same time, the passenger count is another important criterion to maintain during the pandemic time; this indirectly takes care of the social distancing aspect. Though face detection is a well-known problem in computer vision, detection of tiny degraded faces inside a vehicle for passenger counting is not a widely explored problem. A near real-time integrated process flow for the passenger counting and mask detection from image frames has never been explored to the best of our knowledge. In this work, we not only design a robust framework for counting the number of passengers inside a vehicle but also designed a face mask detection algorithm using transfer learning with very high accuracy. • A robust data set with huge variations and real-life use cases have been prepared. Special attention has been given to the side-face, hand-covered and low-resolution images which will be essential for the traffic surveillance. The data set is publicly available at https:// github. com/ srima ntacse/ MaskS urvei llance. • An efficient deep learning model to identify face mask over the human face with the adoption of very tiny face detection has been implemented. We have shown that the proposed model outperforms the other state-of-theart techniques with the extensive comparison as well as statistical hypothesis testing. • As per the authors' knowledge, this is the first work to find out the mask-wearing status in inside-vehicle surveillance. This sort of traffic monitoring will be very essential in the current COVID-19 related pandemic context. The rest of this paper is organized as follows. Literature survey for the mask detection has been presented in Sect. "Related Works of Mask Detection". In Sect. "Some Basics of Deep Learning", an overview of different deep learning concepts, used in this article, have been briefed. The data set preparation and model surveillance camera setup have been presented in Sect. "Modeling Image Data Set". Thereafter, the experimental setup and the methodology are discussed in Sect. "Proposed Framework". We have summarized the results under different scenarios in Sect. "Results and Evaluations". Finally, we discuss the overall achievement in short and conclude the paper in Sect. "Conclusions and Discussion". With the prevalence of the COVID-19 pandemic and the implementation of social-distancing, road transportation rules have been modified and imposed with new criteria. In this article, we have outlined a different aspect of surveillance considering the post-COVID-19 circumstances. Here, we have focused on the two important ideas in the vehicular inspection process. First, we have put our attention into the safe distance norm inside the car. The government of different countries has enforced regulations to control the COVID-19 spread through transport. In several countries, passenger count is getting restricted, where four-wheelers can only have three occupants, including the driver, to maintain social-distancing (Online 2000a, b) . For the next few years, this kind of rule should be there to control the contagious spread. In this article, we have incorporated a robust face detection technology for counting the number of passengers within a car. In this regard, from the road-side installed surveillance camera, video frames or snippets will be taken. Second, the use of face-mask is another mandatory criterion to prevent the spread via the transport system. Therefore, the identification of facial mask has become an important topic of research in very recent times. Researchers have started putting efforts in this particular area to detect mask over the human face which is very critical preventive step (Rahmani and Mirmahaleh 2020) . A feature-based technique that uses the images from mobile camera, has been developed by Chen et al. (2021) . They have put focus on the feature extraction for the specific types of masks. K-Nearest Neighbor algorithm has been utilized for the detection purpose. Ejaz et al. (2019) have utilized the well known PCA technique for the recognition of mask over human face. In Nieto-Rodríguez et al. (2015) , the authors have experimented to identify the presence of a mandatory medical-mask in the operation theater. Though, their application areas were quite novel, the performance of the proposed algorithms were not satisfactory due to the use of traditional machine learning based methods, and recently developed DL approaches provide better performance in the mentioned tasks in terms of accuracy. Loey et al. (2021) have recently presented a Support Vector Machine (SVM) based approach over ResNet architecture for face mask detection. Though, the accuracy of the model is high, it has not been validated for the noisy image set. Moreover, the ResNet feature computation is bit complex due to the higher number of parameters. In Wang et al. (2020b) , authors have achieved 95% recognition accuracy using the face-eye-based multi-granularity model. Single-shot detector (SSD) has been experimented in Nagrath et al. (2020) for identifying the faces from the images and afterward, the MobileNet-V2 baseline architecture has been used for the classification task Though, specifically the tiny face use cases, side-face, hand-covered use cases were not considered in this paper. Thus, the robustness of the proposed framework is low and it is difficult to use in real-world conditions. At the same time, several other CNN frameworks with different backbone architectures can be deployed for serving the same purpose. For example, AlexNet (Krizhevsky et al. 2012) can be used as a backbone of a low complexity process. As VGG (Simonyan and Zisserman 2014) is known to be a robust feature extractor, VGG based models such as VGGFace-ResNet (Parkhi et al. 2015) have also been developed for this purpose. Specialized models for face classification such as FaceNet (Schroff et al. 2015) can be modified to detect face mask in images. VGGNet (Simonyan and Zisserman 2014) , AlexNet (Krizhevsky et al. 2012 ) have similar drawback of having huge numbers of parameters 138M and 62M, respectively. In this respect, Inception-V3 is comparatively lighter network to train to adapt in any Transfer Learning (TL) framework. Table 1 outlines a comparative sketch of the state-of-the-art approaches in this domain. For this task, we have initially prepared a robust data set with significant variations for the mask prediction over a human face image. Different essential conditions such as noise-induced image, day-night mode, angle of faces, the direction of heads, etc. have been considered to enrich the overall database. We have also put our effort to deal with several other use cases, such as side-face, hand-covered, tiny face with varying illumination, imaging height, resolution etc. A separate on-road image set has been accumulated for the real-time evaluation of the model. Concurrently, a network model has been proposed and optimized with different variations of these image data inputs during the training which ultimately provides more accurate outcomes. The Transfer Learning (TL) process has been exploited in our framework. We have explored the popular Deep Neural Network (DNN) Inception-V3 in the experiment. Using TL mechanism, the model has been re-trained through the proposed network with our data set. The result has demonstrated the efficiency of the framework. We have compared our approach with the recent state-of-the-art DL techniques, such as SSDMNV2, ResNet-SVM, AlexNet, VGGFace-ResNet, FaceNet using several numerical metrics viz. Accuracy, Precision, Recall, F1-Score, Specificity and Jaccard score (JS). Visual comparison and other related results also prove the superiority of the proposed approach. Along with these, we have showcased the efficiency of the proposed method through zero-shot evaluation steps with the Real-Time-Medical-Mask-Detection (RTMDD) data set experimented in Nagrath et al. (2020) ; this supports that the method is not bottleneck by the over-fitting problem. The research benefits of the proposed framework mainly include real-time traffic surveillance. Preventing the spread of COVID-19-like contagious diseases through the transport system is one important motive of this research. Deep Learning (DL) is an enhanced and flexible extension of Machine Learning (ML) to improve learning algorithms' structure and make them easy to use. Being a subset of ML, DL is immensely used with very large data with a simpler framework to deploy any complex model (Schmidhuber 2015; LeCun et al. 2015; Ghosh et al. 2019) . DL along with Convolutional Neural Networks (CNNs) has been tremendously applied in several computer vision researches. The major advantage of DL is having the potential to support huge amounts of data during the learning. As an obvious impact, the accuracy will increase automatically by a considerable margin. Transfer Learning (TL) (Pan and Yang 2009 ) is a popular concept in ML which has been successfully applied in different application areas (Bird et al. 2020; Pan et al. 2010) . It utilizes the knowledge acquired while resolving one problem and applies it to a different but related domain. In CNN based classification problems, usually, the output layer (i.e., the final layer used in the network) needs to be replaced with a new layer as per the problem domain. The main idea behind this approach to achieve higher accuracy with fewer data. It allows any DL-based framework to generate a robust feature set from the original input and with a small connected network we can achieve better accuracy. The baseline performance might improve due to this knowledge transfer along with a faster model development time. Better generalization can be attained with this approach. The same notions have been utilized in our setup to enrich our proposed baseline network. We have achieved higher accuracy with comparatively less data (compare to ImageNet Deng et al. 2009 ) in our experiment. ImageNet (Deng et al. 2009 ) is an image database categorized as per the WordNet hierarchy. Typically, the training data set was comprised of 1,000,000 images, with 50,000 for a validation set and 150,000 for a test set. This data set has been used by researchers in several articles and machine vision projects (Russakovsky et al. 2015; Deng et al. 2014 ). Google has provided this popular Deep Learning Convolutional architecture. The model includes a combination of symmetric and asymmetric logical blocks (Szegedy et al. 2016a (Szegedy et al. , 2015 . Inception-V3 has been trained using a set of 1,000 classes from the ImageNet data set. There is a total of 42 deep layers (Tsang 2018) in the architecture and it is significantly efficient than VGGNet (Simonyan and Zisserman 2014) . In addition to the Inception-V2 architecture, it used RMSProp Optimizer, Factorized 7x7 convolutions and BatchNorm in the Auxiliary Classifiers. The overall architecture is shown in Fig. 1 . In post-COVID-19 road transportation, the mask should be a mandatory item that has to be used all the time while going outside mainly. But some people are deliberately guarding their faces by hand that should be properly identified and penalized. We have focused mainly on two aspects while preparing the data set. First, we planned to prepare a substantial data set for the generic mask detection in the human face. Second, we have focused on the on-road inside car image set which will be very much important for the surveillance purpose. Several perspectives are required for accurate mask detection. The result of the human face and mask identification strongly depends on the image resolution and dots per inch (dpi). These two qualities will be affected if the distance from the camera increases. Even the position and size of the face will be another vital point in this context. Therefore, while designing the data set, we have intentionally introduced the blurring effect for increasing data variation. The side face images with masked and non-masked variations have been added as well to predict them correctly irrespective of face direction. Different angles of the face need to be considered here. We have contemplated all these in the data set for mask detection. To construct a substantial database, we took an available data set of Bhandary (2020) as a baseline which has 690 masked and 686 non-masked faces. In the mentioned data set, all the face images were taken from the front side and the mask has been superimposed on the non-masked images to prepare the same. The data set has been enriched by adding another 1906 images with the existing one. The major contribution is that different side face images have been added and used during the training. At the same time, some images of the hand located on top of the mouth have been additionally included which is obviously a restricted use-case in the post-COVID-19 situation. A total of 650 such images are there in the data set. Again there is another important aspect, in the testing phase, if the image quality of the individual face is not good then it may hamper the testing accuracy. The chance of getting noisy images is higher in case of gantry camera input while capturing the vehicle in motion. Keeping this limitation in mind, we have introduced blurred images in the data set. Total 400 masked and 130 non-masked images are converted to blurred with the kernel size (20 × 20) . Table 2 shows the categorywise distribution for the blurred image set. Considering the feasible angle of head position in the real scenario, we have augmented those images with two rotation angles 10 • and 20 • to increase the variation. The total number of images in the data set is now 3282. Figure 2 shows the comparative statistics of images in the overall database. Figure 3 shows some sample images of the enriched data set in different conditions. In our experiment, we have mostly considered low-height surveillance cameras, such as gantry-setup. The tentative height of this camera should be around 6.5ft above ground level. It will not be elevated so high which may prevent getting the proper view of the inside vehicle. As shown in Fig. 4 , the region of interest should be focused on the side windows and the windshield of a vehicle from a certain height. We have collected such images that will actually replicate the real gantry camera scenario for the test simulation. Figure 5 shows such images which have been used in our experiment. This kind of mask surveillance will be (Bhandary 2020 ) and our enhanced data set the part of traffic 'Yellow-Vulture' 2 shortly. The height of the surveillance camera has to be set experimentally based on the dimension of several vehicles. Mostly for the Light Motor Vehicle (LMV) the optimum height range is 4ft to 7ft. In our experiment, we have mostly considered LMV and the height from the ground level has been considered to be 6.5 ft . For the windshield images (a maximum distance of 25 to a minimum distance of 6.5 from the camera), the angle range is +75 • to +45 • . For the side window images, a short distance range (a maximum distance of 10ft to a minimum distance of 4.5ft from the camera) will be considered, where the angle range lies approximately in between +55 • to +35 • . Figure 4 shows this sample calculation for the front window screen only. On the other hand, for SUVs or big buses, this height should be higher. To examine all the use cases we need to consider a multi-camera setup installed in different heights. Inside vehicle camera is another solution for this surveillance. However, this will increase the cost and create an extra burden for individual installation and necessary changes. We have deployed our framework in google colab 3 system, where the training of the proposed DNN network and testing of several use cases have been performed separately. We have tuned different DL parameters of the proposed network and used them after proper optimization. The google colab environment which has been chosen for our experiment has a baseline operating system (OS) Linux − 4.19.104 (Ubuntu-18.04) with 12.7 GB RAM. We have deployed our experiment using tensorflow and keras environment along with the required APIs. We have utilized the pre-trained Inception-V3 weights at the beginning and appended fully connected network components with average pooling followed by flatten. Subsequently, 3 sets of consecutive dense and different activation layers have been used with an intermediate dropout. The pool size in the average pooling step was kept as (5 × 5) . The three activation layers used in the network are relu, sigmoid and softmax consecutively. The initial dropout percentage has been kept as 50% . Subsequently, it is reduced to 40% . Figure 6 shows the overall training architecture of the Deep Neural Network (DNN) used in the experiment. 70% of the images from the newly built data set have been used for training and the remaining images have been kept for testing. Again, 10% of the training data has been used as the validation set during the training. Adam optimizer has been used during the training with a learning rate . We have tested the range of from 0.000001 to 0.01 at 9 discrete levels [0.000001, 0.000005, 0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01] to select the optimal value. The experimentally found optimum value of , where the proposed model has achieved the best result is 0.0001. The learning rate has been uniformly distributed in the decay parameter of the optimizer over the epochs which has been set to 30. To note, for all the other algorithms too, we have run for the same number of iterations and compared the results. 'binarycross-entropy' has been utilized for this binary classification task as the loss function. The two dropout layers have been used for better generalization in the network which will reduce the chance of over-fitting. In the proposed approach, we have used the enhanced data set as mentioned in Sect. "Modeling Image Data Set" to train the deep model. The tiny face detector model (Hu and Ramanan 2017) has been adopted to find out the face images from the given input. In our experiment, the TL concept has been deployed to build the deep network for mask detection. The baseline Inception-V3 model has been built on imagenet data set. This data set has few variations of mask (normal-mask, gas-mask, ski-mask and oxygen-mask) images as well. Therefore, our mask identifying target problem domain, D t , is a subset of the Inception-v3's problem domain, D s ( D t ⊂ D s ). Subsequently, the learning task T t is also a subset of the corresponding source domain's learning task T s ( T t ⊂ T s ). From a thousand class problem, we have boiled down to two class problem using this TL framework. After removing the last output layer of the base model, we have introduced unique two-class layers (masked and nonmasked) at the end in Mask Prediction node, as shown in Fig. 6 . Algorithm 1 has shown the detailed structure of the process. Two input parameters, the vehicle image I with size M × N × 3 and the threshold parameter for the passenger limit will be taken by the algorithm for providing the proper on-road surveillance. The parameter will be set based upon the vehicle type and capacity. The social-distancing norm will be strongly dependent on this parameter. For example, for four-wheeler LMV, the value of should be set to less than or equal to ( ≤ ) 3 including the driver. Let's assume, there are N Faces faces present in the given input I M×N×3 . The function R has performed the resizing operation over every face image to make them a fixed size of ( m × m × 3 ). Here, in the experiment, we have kept m=n=224. For the i th face, the corresponding feature set feature i (f i 1 , f i 2 , ..f i k ) with cardinality k has been computed using E . The generated Inception-V3 based TL model ( G ) has considered this feature set to provide the confidence level of prediction which has been noted as Conf Test i . Subsequently, the mask-wearing status of i th face has been processed from Conf Test i using the function Stat. Based on the status and face count, the rule violation decision will be taken. During the training time, following the similar process, Conf Train i has been generated for the i th For computing the loss (binary-cross-entropy) during the training phase, we have used the below Eq. 1. N Train total denotes the total number of images used in the training phase of the experiment: (1) Fig. 6 ; while At least one face remains in the image I do In the testing phase, we have used the tiny face detector model to extract the individual face images with the proper bounding box. As per the flow shown in Fig. 7 , the frame extractor module will extract images from the video feed taken from the surveillance camera. Based on the face count, we can make a decision about the rule violation. The threshold value would be taken as prior information to the system. For a special case, only the driver without a mask inside a car would have been warned for ignoring the safety measure. At the final step of the testing phase, using the proposed mask detection model we can detect the mask from the face images. The learning has been demonstrated through the visualization of the saliency map for individual training images. Figure 8 demonstrates six sample saliency images with the respective original images. These sample images indicate that the model has learned the overall facial part in case of normal images and the face part without the covered zone in case of masked images. As shown in Fig. 9 the training accuracy has been gradually increased and the loss has been decreased with the epoch while training. Moreover, for statistically demonstrating the efficiency of the generated model we have run the repeated K-Fold validation process separately. The repetition parameters we have set to 5 and the K has been set to 5 as well in K-Fold. For displaying the outcome, we have drawn the boxplot of the corresponding validation accuracy for different folds in different run. Figure 10 shows that our model exhibits the minimal span of the box in the boxplot, while the other models have shown varying accuracy in different runs. To hypothetically demonstrate the supremacy, we have computed the pairwise (state-of-the-art technique vs proposed model) p value of the Wilcoxon rank Wilcoxon (1945) which has been shown in Table 3 . Here we can see, there is not such similarities in between the proposed models and other models used for the comparison in the experiment. As per the hypothesis, the obtained p values, less than 0.05, are supposed to be statistically significant. The mean and standard deviation of the different models have been noted in the same Table 3 . This clearly shows the primacy of the proposed model. To demonstrate the superiority of the proposed approach we have tested on multiple variant images as well as on-road images of the vehicle. The primary objective of the method is to count the number of passengers inside the car, and at the same time, we need to check whether they wear a mask or not during the journey. The testing car database contains images from different angles and heights taken by the surveillance camera, night mode pictures, mirror-reflected face images, glass-covered faces etc. These images are mainly used for testing purposes. For the evaluation process, we have utilized several numerical indices viz. Accuracy, Precision, Recall, F1-Score, Specificity and Jaccard Score (JS) using Eqs. 2, 3, 5, 4, 6 and 7, respectively, along with the visual comparisons: (2) Accuracy = For comparing the confidence level of the prediction, we have chosen the best two models, i.e., SSDMNV2 and ResNet-SVM as per Table 4 . Here we have used several unusual and critical environment conditions, such as lowquality images, side-face images and hand-covered images for the comparison. Table 5 clearly demonstrates the superiority of the proposed approach in all those conditions. 400 blurred, 50 side-face and 200 hand-covered images have been experimented. SSDMNV2 and ResNet-SVM have correctly identified 174 and 351 blurred images, respectively, while our proposed model has predicted 389 successfully. Similar types of results have been observed, while testing side-face and hand-covered category too is shown in Table 5 . Figures 11 and 12 show the confidence level of prediction, Fig. 11 Confidence level comparison with SSDMNV2, where SSDMNV2 predicted correctly, as mentioned in where the state-of-the-art techniques predicted correctly. For example, as shown in Fig. 11 , among SSDMNV2's correct prediction of 174 images, the confidence value of the prediction of our proposed model is higher for 71% of such images. For side-face and hand-covered images, our model has beaten SSDMNV2 by 100% and 97% margin in terms of confidence level. The same observation is applicable for ResNet-SVM model as well for thees same scenarios that are depicted in Fig. 12 . Figure 13 shows few sample test images, where the recommended approach correctly predicted but other existing models failed. Figure 14 shows one sample testing flow with four passengers inside a car and all of them have used masks while traveling. The confidence level of prediction has appeared with the bounding box. The back seat passengers' tiny faces also have been captured and predicted the mask-wearing status successfully. Figure 15 shows the stepwise results for these best-performed algorithms in various road conditions with the confidence level of successful predictions. Those conditions have been mentioned explicitly in the first column of the same Fig. 15 . In this zero-shot evaluation process, we have used the data set that has been experimented by Nagrath et. al. in the paper (Nagrath et al. 2020) . We have tested on 10, 212 images in this data set with 5520 non-masked and 4692 masked images, respectively. Our model has successfully predicted 5511 non-masked and 4336 masked images. Collectively 9847 ( 96.4% ) successful prediction is a remarkable figure without deploying training, while SSDMNV2 has achieved 92.6% accuracy after the training phase. Figure 16 shows some sample evaluation output over the mentioned data set. Figure 16a and b show few masked and non-masked outputs, respectively. The face detection module has taken almost uniform time (around 3 s) for any number of human faces. However, the mask detection process has been conducted sequentially for every face image. It has linearly grown with the number of faces present in a single image. The average time taken for this process is around 0.5 s for one human face. The time graph is shown in Fig. 17 . So, generally, the surveillance framework will take around 4 s to complete the process for a standard vehicle with 3 passengers (maintaining social-distancing norm). The timing measurement has been performed on the same google colab setup as mentioned in Sect. "Experimental Setup". With Graphics Processing Unit (GPU) or Tensor Processing Unit (TPU) system, the timing graph can be improved further. The advantages of Transfer Learning have been exploited to develop new traffic surveillance which is extremely useful during COVID-19-like pandemic situation. Our proposed network has outperformed all the other state-of-the-art techniques in terms of several metrics. The zero-shot evaluation has reached 96.4% accuracy as well which supports the potency of the technique. This framework will also be applicable for similar contagious diseases, where the mask will be compulsory to use. Considering the post-COVID-19 urgency, different traffic rules need to be imposed in the transport system. The surveillance strategy should also be Fig. 15 Results of two best-performed state-of-the-art techniques and proposed model over unusual and challenging on-road conditions with inside-vehicle passengers for mask surveillance along with confidence levels of individual prediction modified accordingly. We have demonstrated an efficient way to identify the on-road rule-breakers. The developed deep model has achieved high accuracy in different conditions. As a scope of future research, one can work to enhance the training data set from a different perspective related to mask-wearing. Designing proper network to deal with the hazy and rainy images is another future scope. Researchers can apply better deep learning based tracking mechanism for the surveillance. Hyper-parameter optimization is another area to work on. For handling the night-mode on-road images more accurately, we can generate a large number of such samples using Generative Adversarial Networks. For managing varying heights of several types of vehicles, one can use a multi-camera setup. In addition, the multi-view of a specific vehicle will help to provide higher confidence in the outcome. Real field testing with this type of multicamera setup will be another challenging task. Currently, the authors are working in these directions. Traffic management for drones flying in the city Intelligent vehicle counting and classification sensor for real-time traffic surveillance A smart camera for the surveillance of vehicles in intelligent transportation systems Crossdomain mlp and cnn transfer learning for biological signal processing: eeg and emg Face mask assistant: Detection of face mask service stage based on mobile phone Imagenet: A large-scale hierarchical image database Scalable multi-label annotation Implementation of principal component analysis on masked and non-masked face recognition Comparison of image classification and object detection for passenger seat belt violation detection using nir & rgb surveillance camera images Understanding deep learning techniques for image segmentation Multiple moving targets surveillance based on a cooperative network for multi-UAV Social distancing in public transport: mobilising new technologies for demand management under the covid-19 crisis Automatic traffic surveillance system for vehicle tracking and classification Finding tiny faces Vehicle tracking and speed estimation from traffic videos Advances in face detection and facial image analysis Imagenet classification with deep convolutional neural networks Deep learning A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the covid-19 pandemic Real-time vehicle make and model recognition system Video surveillance-based intelligent traffic management in smart cities SSDMNV2: A real time DNN-based face mask detection system using single shot multibox detector and MobileNetV2 System for medical mask detection in the operating room through facial attributes Markov decision process-based distributed conflict resolution for drone air traffic management A survey on transfer learning Transfer learning in collaborative filtering for sparsity reduction A real-time video surveillance system for traffic pre-events detection Monitoring covid-19 social distancing with person detection and tracking via fine-tuned yolo v3 and deepsort techniques Coronavirus disease (covid-19) prevention and treatment methods and effective parameters: A systematic literature review ImageNet large scale visual recognition challenge Implementing a realtime, ai-based, people detection and social distancing measuring system for covid-19 Deep learning in neural networks: an overview Facenet: A unified embedding for face recognition and clustering Very deep convolutional networks for large-scale image recognition Going deeper with convolutions Rethinking the inception architecture for computer vision Rethinking the inception architecture for computer vision Vehicle detection and recognition for intelligent traffic surveillance system Masked face recognition dataset and application Individual comparisons by ranking methods Face recognition in unconstrained videos with matched background similarity Towards improving quality of video-based vehicle counting method for traffic flow estimation A vision-based social distancing and critical density detection system for covid-19 Anomaly detection in traffic scenes via spatial-aware motion reconstruction Vehicle detection in urban traffic surveillance images based on convolutional neural networks with feature concatenation