key: cord-0529290-f4axguom authors: Protasov, Saian; Karpyshev, Pavel; Kalinov, Ivan; Kopanev, Pavel; Mikhailovskiy, Nikita; Sedunin, Alexander; Tsetserukou, Dzmitry title: CNN-based Omnidirectional Object Detection for HermesBot Autonomous Delivery Robot with Preliminary Frame Classification date: 2021-10-22 journal: nan DOI: nan sha: 286bb40f253570e4ae2a236b9dc52ab5f9b27b02 doc_id: 529290 cord_uid: f4axguom Mobile autonomous robots include numerous sensors for environment perception. Cameras are an essential tool for robot's localization, navigation, and obstacle avoidance. To process a large flow of data from the sensors, it is necessary to optimize algorithms, or to utilize substantial computational power. In our work, we propose an algorithm for optimizing a neural network for object detection using preliminary binary frame classification. An autonomous outdoor mobile robot with 6 rolling-shutter cameras on the perimeter providing a 360-degree field of view was used as the experimental setup. The obtained experimental results revealed that the proposed optimization accelerates the inference time of the neural network in the cases with up to 5 out of 6 cameras containing target objects. During recent years, the e-commerce market has been growing at an astounding speed. According to [1] , it is expected to grow up to more than EUR 2.5 trillion by the year 2023. Such rapid growth has inevitably led to the gain in the global delivery market. One of the key parts of delivery services is last-mile delivery from the nearest warehouse or retail store directly to the customer's door. This part of the delivery process directly influences customer satisfaction, which is essential for logistic companies' success. Such delivery was typically performed using vehicles. However, due to traffic congestion the delivery time using vehicles has significantly increased [2] , thus forcing logistics providers to develop new methods of last-mile delivery that rely less on traffic situation. For example, the combination of delivery vans and electric bikes has shown significant increase in delivery efficiency [3] . The Covid-19 pandemic of 2020 has created a rapidly growing trend for minimizing personal contact and robotic automatization in all areas of human life and industry application, e.g., for stocktaking automation in warehouses [4] - [7] , shopping malls [8] , disinfection of hospitals and offices [9] , [10] , and autonomous chargers [11] . For epidemiological security, and due to vast advances in automation technology, various contact-less last-mile delivery methods are currently being developed. According to [12] , up to 80% of Business to Customer (B2C) deliveries can be automated. All authors are with Space Center of Skolkovo Institute of Science and Technology, Moscow, Russia ivan.kalinov@skolkovotech.ru, {saian.protasov, pavel.karpyshev, pavel.kopanev, nikita.mikhailovskiy, alexander.sedunin, d.tsetserukou} @skoltech.ru Multiple large companies, such as Amazon, Google, DHL, UPS etc. are already testing automated delivery using Unmanned Aerial Vehicles (UAVs). However, the use of UAVs is limited by their low lifting capacity, range, and costeffectiveness. The aerial vehicles can also become a threat in case of malfunction, and are prohibited in urban areas of certain countries. For these reasons, the use of Autonomous Ground Vehicles (AGV) has been extensively researched as well. Several companies, such as Starship Technologies, ANYmal and FedEX, are already developing ground robots for last-mile delivery. The characteristics of aforementioned robots make them perfect for short-distance deliveries in urban areas. Namely, the Starship Technologies robot is a 6-wheeled platform that weighs around 45 kilograms and is capable of moving within a 5-kilometer radius at pedestrian speed with 2.6 kg of payload [13] . Other approaches to autonomous last-mile delivery were introduced by the ANYmal team with their four-legged robot and the FedEX Bot. Despite multiple robots being developed at the moment, the autonomous last-mile delivery AGVs still face numerous difficulties. Such machines need robust and precise localization technologies, as well as reliable obstacle detection and avoidance algorithms. For obstacle detection and avoidance, the robot needs precise and real-time information about its surroundings. This can be achieved using visible-and infrared-spectrum cameras, LiDARs, ultrasonic distance sensors, and other types of data acquisition devices. The data flow from all these devices is huge even for desktop computers, not to mention small and low-power mobile devices. Increasing the computational power of a mobile robot results in increased energy consumption, and thus, to limited operation time, working range, and cost efficiency. Because of this, mobile robots are often subject to trade-off between accuracy and working time. The particular example of such task is pedestrian detection. For this purpose, the robots are usually equipped with multiple visible range cameras faced in all directions, thus allowing for 360-degree field of view to detect surrounding pedestrians and perform necessary avoidance maneuvers. For example, the robot by Starship technologies is equipped with 9 cameras, with 3 facing in the front direction, 2 in the rear direction and 4 on the sides. These cameras create a data flow which is extremely difficult for mobile computing units to process. Object detection algorithms are widely used in the area of mobile robotics. Typically, objects to detect depend on the task that the robot is performing. For example, P. Karpyshev et al. [14] successfully used the Mask-RCNN neural network for apple trees disease detection. I. Kalinov et al. [5] presented a heterogeneous robotic system with a different task -the detection of barcodes during warehouse stocktaking using UAVs. In this case, the authors used the U-Net convolutional neural network for the detection and semantic segmentation of barcodes. Patrick K. Chemeli et al. [15] used the Single Shot Detector to locate and identify objects using a robotic arm. All these tasks cannot be solved from the point of view of optimizing neural network algorithms. In the cases described above, the robots either had few cameras, or there were enough computing resources for prompt detection, such as dedicated desktop-grade GPUs. Bernd Poppinga et al. [16] offered an ultra-lightweight architecture for object detection during robotic soccer competitions. However, such architectures are often used for very specific tasks under certain conditions. In this article, we propose a method to increase the speed of an object detection neural network for a robot with a high number of cameras. The efficiency of using neural networks in the detection task is an indisputable fact. At the moment, there is a large number of different neural network architectures for object detection. All detectors can be divided into 2 types: two-stage detectors and one-stage detectors. One of the most popular two-stage architectures for object detection is Faster RCNN [17] , which is a continuation of the ideas in Fast RCNN [18] and RCNN [19] . The Faster R-CNN architecture is formed as follows: An image is fed to the input of a convolutional neural network, where a feature map is formed. The feature map is processed by the RPN layer: a sliding window is traversed over the feature map. The center of the sliding window is linked to the center of the anchors, areas that have different aspect ratios and different sizes. The authors use 3 aspect ratios and 3 sizes. Based on the intersection-over-union (IoU) metric, the degree of intersection of anchors and true marked rectangles, a decision is made about the current region -whether there is an object in it or not. Next, the FastCNN algorithm is used: the feature map with the obtained objects is transferred to the RoI layer, followed by the processing by fully connected layers and classification, as well as determining the displacement of the potential objects' regions. One-stage detectors work in a different way. The class and coordinates are predicted in one step from the anchor bounding boxes that tightly cover the picture with different scales and different aspect ratios. The representatives of this class are SSD [20] and YOLO [21] . A one-stage detector only requires one pass through the neural network and predicts all bounding boxes in one go. It is much faster than two-stage detectors and is more suitable for mobile devices. In the article by Jonathan Huang et al. [22] multiple onestage and two-stage detectors are compared using different metrics of accuracy and speed of execution (Fig. 2) . The authors conclude that SSD with MobileNet provides the best accuracy among the fastest detectors, and R-FCN [23] and SSD neural networks are faster on average, but cannot outperform Faster R-CNN in accuracy if speed is not the main goal. The proposed paper aims to improve the efficiency of object detection on systems with multiple cameras and limited computational power, i.e., mobile delivery robots. In the scope of this article, we use the HermesBot delivery robot, which includes six rolling shutter cameras for obstacle and pedestrian detection, as the hardware platform for experiments. The authors propose to use an additional lightweight classification network before the detection network in order to skip frames with no target objects present. In this case, in most situations, fewer calculations will have to be performed in order to get reliable information about target objects around the robot. The performed experiments have shown that if even one out of six cameras does not contain objects of interest, the proposed algorithm shows a significant increase in processing speed. Our team has developed a platform for autonomous delivery robot software research. The render of the robot is presented in Fig. 3 . The platform is equipped with all hardware components necessary for outdoor movement, as well as software modules for movement control, path planning, SLAM, and obstacle detection and avoidance. The in-depth description of the robot is presented in sections below. The robot localization is mainly performed using two sets of RealSense cameras on the front and back sides of the platform: A RealSense D435 depth camera that is capable of acquiring 1280x720 depth data at up to 90 fps, and a RealSense T265 tracking camera with dual fisheye lenses and a high-precision IMU for acquiring high quality visual odometry. Pedestrian detection is performed using six RasPi NoIR V2 cameras located on both sides of the robot, facing sideways and at the angle of 45 degrees in all directions. Each camera has the maximum resolution of 3280 × 2464 and 62.2 degrees field of view. This allows for a 360-degree FoV and guarantees robust detection of stationary and moving obstacles. Close-range obstacle detection is performed using eight ultrasonic rangefinders operating at 10 Hz. Additional localization data is acquired from the Garmin GNSS receiver. Wheel odometry (WO) is collected using encoders with 3840 count per revolution. A VLP-16 3D LIDAR is installed on top of the platform for ground truth dataset collection. The robot also includes light sources. All the calculations and robot control are performed using an Intel NUC8i7BEH computing module with Intel Core i7-8559U processor and 32Gb of RAM. Image processing and pedestrian detection is performed on an NVIDIA Jetson AGX Xavier computing unit with a 512-core Volta GPU containing Tensor Cores. The whole system architecture was developed with the use of the Robot Operating System (ROS). The schematic representation of the software architecture is presented in Fig. 4 . The data from all sensors is sent to the Intel NUC computing module, where it is preprocessed and sent to the Fig. 3 . Hardware equipment of HermesBot: 1. Velodyne VLP-16 3D LIDAR, 2. RealSense D435 depth camera, 3. RealSense T265 tracking cameras with IMU, 4. Vent, 5. Ultrasonic Rangefinders SR04Tv3-US, 6. Wheel encoders with 3840 counts per revolution, 7. Rolling-shutter RasPi NoIR V2 camera with Sony IMX219 8-megapixel sensor. 8. Garmin GNSS receiver dataset collection module and respecting computing modules depending on the type of sensor. Wheel odometry, IMU, GNSS and tracking camera data is sent to the Perception module. Data from depth and RGB cameras is sent to the Jetson computing module for processing. Also, this data, along with the ultrasonic sensor readings, is transferred to the Point Cloud Maker module where the robot surroundings are modelled. Then all the processed data is sent to the Mapping Module that creates the complete map of the robot surroundings. This data is processed by the Local Path Planner module, and, along with the collision avoidance and Global Planner data, is sent to the PID Control module which, in turn, directly controls the wheels. [20] architecture with EfficientNet-B0 [24] as the feature extraction backbone. These architectures were chosen because of their precision and computational efficiency. The SSD architecture is based on a simple feed-forward neural network that creates a set of fixed-size bounding boxes and scores each box for the presence of class instances. Fig. 5 shows the operating principles of SSD. First, the image is passed through a feature extractor (backbone), which extracts features at various convolutional layers. To acquire more spatial information about the image, additional convolutional layers extract more features for the detection block. This architecture does not use linear classification head because classification is performed along with the detection. All the features detected at various convolution layers are transferred to the object detection block. The detection is also carried out using convolution layers. Firstly, the features pass through 2 layers of convolution in parallel, one for classification for each prior box, and the other gives the probability of finding an object in each prior box. Then, numerous windows of various sizes and shapes is formed on the image. These windows are assumed to contain only one object, and a classifier is used to obtain the likelihood / Once the detector generates a large number of bounding boxes, they must be limited to only one per object in the image. Non-Maximum Suppression (NMS), which is basically a form of a clustering algorithm, is the most commonly used approach for this task [25] . EfficientNet [24] is one of state-of-the-art and fastest feature extraction backbones for classification models. The method of this neural network is based on the scaling of three parameters -the depth of the layers, the number of input and output channels and the spatial size. The previously developed methods scaled the dimension of the neural network arbitrarily (e.g., the number of layers and parameters). The proposed method in EfficientNet evenly scales parts of the neural network with fixed scaling factors. The efficiency of such scaling is directly influenced by the architecture of the network. To improve the performance of the neural network, the authors proposed to choose the initial architecture automatically using the neural architecture search framework. As a result, the initial model used the architecture with Inverted Residual blocks, similar to MobileNetV2 [26] and MnasNet [27] . The initial model was then scaled up and created the class of models called "EfficientNets". The EfficientNet-B0, chosen as the backbone in this article, has a much smaller number of parameters, which allows it to run on mobile platforms. In our work, the SSD was trained using the PASCAL VOC dataset, which is considered one of the classic datasets for detection along with the COCO dataset for pedestrians detection and classification. From all classes in the dataset, only class "person" was used. In future work other classes, such as various animals, can also be added in the training since animals can appear on roads as well as humans. The architecture of this neural network was used as the baseline, and the performance of our modifications will be compared with this default model. Despite being state-of-the art in computational efficiency, the chosen network architectures still struggle to provide real-time information on detected pedestrians, especially if the images have to be analyzed six times, once for each camera installed on the robot. To improve the speed of human detection algorithm, we propose an approach which eliminates the need to process a frame if no objects of interest are present in it. Since the situation when all the robot's cameras detect pedestrians at the same time is relatively infrequent, this approach will significantly improve the speed of the detection algorithm. In order to achieve this task, we added another classification layer prior to the extra feature extraction convolutional layers. This classifier is trained for classification of two types of images: the ones containing humans and empty ones. After this classification, only images containing humans are sent to the SSD for detection (Algorithm 1). It is known that the computation complexity of 2D convolution layers depends on the number of d tensor channels, kernel size k and sequence length n (O(k · n · d 2 )) [28] ; therefore, such modification should save computing power, since there will be no need to use additional convolutions for extra features and convolutions for detection in images not containing objects of interest. Fig. 6 shows that the classification layer is located immediately after the convolution of the original backbone, which is structurally similar to the EfficientNet-B0 original classifier. The classification layer consists of a 1x1 convolutional layer, global pooling, and a dense layer with the output of 2 channels, which corresponds to the binary classification task. The original efficientNet-B0 has an input image size of 224x224, so, before entering the classification layer, the tensor had spatial dimensions of 7x7. SSD requires 300x300 images on input, so the same layer before classification will have other dimensions -10x10. To train the SSD, the MultiBox loss was used, which is a linear combination of two losses as it was described in [20] . For the loss function, the CrossEntropy loss was used, and, for convergence on bounding boxes, the usual smooth L1 Loss was used: In our work, the image must be binary classified before entering the detection, so we need to use an additional loss in order for the neural network to be able to learn. We decided not to deviate from the pipeline and also use the linear combination of classification loss and MultiBox loss. The resulting loss looks as following: Where L binary is binary cross entropy loss function for classification of human or non-human cases. In this case, β is denoted as the hyperparameter for choosing the dimension of the loss function order. The training of the SSD and the modified SSD happened for 300 epochs. The initial learning rate was equal to 1e-3 and dropped 10 times at epochs 200 and 270. The SGD optimizer was used with momentum 0.9 and weight decay 5e-4. During the experiments, seven cases were examined with 0 to 6 camera readings containing target objects. For each of the cases, the experiment went as following: a sequence of 300 images (50 for each camera) containing or not containing target objects were sent to the input of the neural network, and the computation time was then averaged and compared with the computation time of the same images using the traditional network architecture. The experiments were conducted on the desktop computer with Nvidia 1070 GPU and Jetson Xavier computation module in order to compare desktop and mobile calculation times. The time taken to transfer images from RAM to GPU memory was not taken into account. In the experiment, only the GPU operating time was considered. The results of the experiments are presented in Table 1 and Fig. 7 . Preliminary binary classification approach has shown significant decrease in the computation time for all cases except the one with all images containing target objects. However, the significant increase in processing time in other cases states that in certain environments the use of this method will significantly reduce the mean time for image processing. We presented a novel approach to improving the computational efficiency of detection neural networks used in multi-camera setups. A lightweight classification network was implemented after the feature extraction step and before the detection step of the SSD neural network. A decision algorithm was implemented in order to skip images where no target objects were detected using the preliminary classifier. A set of experiments was conducted using different cases of target object presence for the computational efficiency validation of the proposed approach. The experimental results state that in most cases the proposed algorithm significantly reduces the computational complexity of object detection. The gain in computational efficiency reached more than 24% with no objects present in the frames, and only decreased by 6% when all cameras included target objects. This method is suitable for other detection architectures, where a classifier is used as a feature extractor. VI. FUTURE WORK Our future research will be devoted to determining the density of people around the robot in typical urban environments, such as city center, business districts, residential areas, and industrial areas. The varying pedestrian density in these areas may prove the efficiency of the proposed method in certain scenarios. It is also planned to test our method considering other important classes to detect (e.g., bicycles, E-scooters, dogs, cars). Building a collaborative solution in dense urban city settings to enhance parcel delivery: An effective crowd model in paris Evaluating the impacts of using cargo cycles on urban logistics: integrating traffic, environmental and operational boundaries High-precision uav localization system for landing on a mobile collaborative robot based on an ir marker pattern recognition Warevision: Cnn barcode detection-based uav trajectory optimization for autonomous warehouse stocktaking Impedance-based control for soft uav landing on a ground robot in heterogeneous robotic system Warevr: Virtual reality interface for supervision of autonomous robotic system aimed at warehouse stocktaking Customer behavior analytics using an autonomous robotics-based system Ultrabot: Autonomous mobile robot for indoor uv-c disinfection Ultrabot: Autonomous mobile robot for indoor uv-c disinfection with non-trivial shape of disinfection zone Deltacharger: Charging robot with inverted delta mechanism and cnndriven high fidelity tactile perception for precise 3d positioning Forget drones, here come delivery robots Autonomous mobile robot for apple plant disease detection based on cnn and multi-spectral vision system Real time object detection using single shot multibox detector network for autonomous robotic arm JET-Net: Real-Time Object Detection for Mobile Robots Faster R-CNN: towards real-time object detection with region proposal networks Rich feature hierarchies for accurate object detection and semantic segmentation SSD: single shot multibox detector Yolov3: An incremental improvement Speed/accuracy trade-offs for modern convolutional object detectors R-FCN: object detection via regionbased fully convolutional networks Efficientnet: Rethinking model scaling for convolutional neural networks Learning non-maximum suppression Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation Mnasnet: Platform-aware neural architecture search for mobile Attention is all you need