key: cord-0613343-z7bntqb2 authors: Hsu, Hung-Min; Yuan, Xinyu; Zhu, Baohua; Cheng, Zhongwei; Chen, Lin title: Package Theft Detection from Smart Home Security Cameras date: 2022-05-24 journal: nan DOI: nan sha: 5fdc8bc92f9b2d81cb1192311b5f426292562ffe doc_id: 613343 cord_uid: z7bntqb2 Package theft detection has been a challenging task mainly due to lack of training data and a wide variety of package theft cases in reality. In this paper, we propose a new Global and Local Fusion Package Theft Detection Embedding (GLF-PTDE) framework to generate package theft scores for each segment within a video to fulfill the real-world requirements on package theft detection. Moreover, we construct a novel Package Theft Detection dataset to facilitate the research on this task. Our method achieves 80% AUC performance on the newly proposed dataset, showing the effectiveness of the proposed GLF-PTDE framework and its robustness in different real scenes for package theft detection. During the COVID-19 pandemic, the significant increase of online shopping drives the rapid growth of e-commerce. According to the 2020 package theft statistics report 1 , the package stolen rate has increased from 36% in 2019 to 43% in 2020. Especially, 64% of respondents have been stolen more than once. Approximately 144 million customers were affected with an average loss per household of 106 USD. The motivation of this work is to build an effective and practical solution to detect the package theft events, leveraging widely adopted security cameras. Developing the intelligent computer vision system of automatic package theft detection (PTD) is necessary to alleviate the labor and time of the society. Inspired by anomaly detection system, the target of PTD can be considered to extract the patterns in a specific time period window of the package stealing behavior. In another word, PTD is a special case of video understanding that can be handled by differentiating package stealing behavior from normal patterns. Real-world package theft events are complicated since the environment and human behavior are diverse and varied. It is impractical to exhaust all possible package theft event patterns. Therefore, it is essential to detect the abnormal package pickup behavior rather than modeling prior information on theft behaviors. 1 https://www.crresearch.com/blog/2020-package-theft-statistics-report Training a neural network for PTD is challenging in practice. It's not reliable to simply use the labeled normal patterns in the training data, and then expect any other patterns in the real-world videos deviating from the trained model as the package theft event. Specifically, it is impossible to predefine all possible package pickup/delivery behaviors. Moreover, the boundary between a normal and an anomalous behavior for package pickup/delivery is ambiguous, therefore it is debatable to purely use the training data to detect package theft. In this paper, we propose a novel package theft detection framework by using the weakly labeled training videos and human pose information. Our contributions in this work are summarized as follows, • Propose PTD system with new Global and Local Fusion Package Theft Detection Embedding framework (GLF-PTDE). • Introduce the human pose information into PTD solution to improve performance. • Build a novel package theft detection dataset to facilitate PTD research. Based on the experimental results, the proposed neural network can identify the package theft segments in a video based on the high anomaly scores. The most related work of PTD is anomaly detection in the computer vision community including human behavior and traffic monitoring [1] . Nowadays, due to the success of deep learning technologies, a larger amount of neural networks are proposed to deal with various application tasks. For example, [2] applies 3D deep neural networks for crowded scene object detection and localization. The most popular anomaly detection scenario is human violence or abnormal event detection crowd scene. [3] proposes a neural network to generate the anomaly score for videos. Then, [4] uses a double fusion framework to integrate the appearance and motion features for anomaly detection. Firstly, the input video is split into small segments, and the appearance features and human pose features are generated for each frame. After that, the concatenation operation is applied to combine these two features as the input of C3D to generate Global and Local Fusion Package Theft Detection Embedding (GLF-PTDE). Finally, three FC layers are applied to generate the package theft confidence score for each segment. For global feature extraction, we use the whole image sequences as the inputs called as Global Package Theft Detection Embedding (G-PTDE). G-PTDE considers the foreground and background information to determine the segments are package stealing or not. Thus, we can express G-PTDE f G−P T DE as: where φ denotes C3D convolution operation [5] , which is applied to aggregate the temporal information into the segmentlevel embedding. D is the dimension of C3D output. The video is divided into several segments, and then X is the set of appearance features of each image in a segment. Therefore, f G−P T DE is segment-level embedding, for each segment we generate the corresponding PTD score. In our framework, we not only consider the appearance embedding feature but also the human pose features since the human pose can be used to detect package theft semantically. We use OpenPose [6] to obtain the human pose information, which is trained on COCO dataset to generate the human pose features. These inferred human pose 2D keypoints (body joints) are referred to as the Local Package Theft Detection Embedding (L-PTDE). Unlike the G-PTDE also considers the background information, L-PTDE is the representation of foreground information so that it only pays attention to the local information. L-PTDE feature consists of 18 joints of each identity in each frame, whose dimension is 18 × 3. The 3 comes from the (x, y) 2D joint coordinates and the confidence score. Following the same processing as f G−P T DE , the video is divided into several segments. Assume the segment size is L, S = T L is the number of segments, and D is the dimension of the segment-level feature. T is the length of the video. Therefore, the L-PTDE f segment K can be defined as the follows: Most anomaly detection systems only focus on the entire image, which is easily influenced by the background information. Therefore, the obvious flaw is that the information of the human pose is missing. The package theft detection should pay more attention to the walking posture to enhance the foreground information. In our system, we incorporate the global and local features to generate a robust embedding f GLF −P T DE by concatenating these two features as the input for the C3D network so that we involve both the original entire image features from the whole video sequence, and features of different human joints, e.g., head, shoulder, hip, legs. Fig.1 shows the entire pipeline of GLF-PTDE. The proposed Global and Local Fusion Package Theft Detection Embedding (GLF-PTDE) f GLF −P T DE is defined as the follows: where f segment,i K denotes the human pose feature; φ GLF is a C3D convolution operation for global and local information fusion; ⊕ represents the concatenation operation; i denotes the index of segment; S is the number of segments; D means the dimension of GLF-PTDE. The same as anomaly detection, the definition of package theft detection is difficult and subjective. On the other hand, the representative of training data is not sufficient to represent all the package theft patterns. Thus, our package detection is defined as low likelihood pattern detection instead of binary classification problem. Inspired by [3] , we also treat the PTD as a regression problem, which means that we aim to determine the segments of package stealing to have the highest anomaly scores. For the regression problem, we have to generate the embedding for the regression head. Since we aim to generate the highest scores for package stealing video segments, it is straightforward to apply the ranking loss to train the embedding. where S pt and S n represent package theft and normal video segments, f (S pt ) and f (S n ) represent the corresponding predicted scores, respectively. The ranking function should be modified to satisfy the segment-level annotations, therefore, the rank function is defined as follows: where B a means the set of instances in the dataset. The loss function is followed the time structure of package theft videos since the real-world package stealing behaviors occur in a short period of time so that the package theft scores are sparse. Moreover, the input video is split into many segments, the package theft scores in consecutive segments should be smooth. Therefore similar to [3] , we minimize the difference of scores for adjacent video segments, and then the temporal smoothness can be maintained. The definition of loss function is as follows: where max(0, 1 − max i∈Ba f (S i pt ) + max i∈Ba f (S i n )) is the ranking loss for multiple instance learning; There is no existing dataset specific for package theft detection. We construct a new one for evaluation, which consists of 120 surveillance videos with an average video length of about 52 seconds (28.8 frames per second) and covers four categories: package theft, normal pickup, normal delivery and irrelevant. The category distribution on video numbers is shown in Table 1 We extract the GLF-PTDE to detect the package theft, which is extracted from the fully connected (FC) layer FC6 of the C3D network [5] . All of the videos are re-sized to 240 × 320 and adjusted to 30 frames per second. For each segment, we use C3D to calculate the 16-frame video clip feature followed by L2 normalization, and then we take the average of all 16frame clip features to generate the segment feature. We concatenate the CNN feature and human pose feature as the input of C3D, and the human pose feature is the normalized image coordinate of each human joint and the corresponding confidence score. Here we use 18 joints so the human pose feature size is 18 × 3. The pose features are extracted by the COCO dataset pre-trained model since we do not have the human pose ground truth in our package theft dataset. This pseudo pose feature is sufficient enough to improve the overall performance, however, fine-tuning on the pose model should be useful to obtain further improvement. After C3D feature extraction, we obtain a 4096-D feature vector and it is past to a 3-layer FC neural network to estimate the package theft confidence score. The three FC layers are 512, 32 and 1, respectively. The training process is conducted as training from scratch. The optimizer is Adagrad with learning rate 0.01 and the training epochs is 5000. The Receiver Operating Characteristic curve (ROC) of G-PTDE is shown in the Fig 2, based on which the threshold of package theft detection sets at 0.2. The parameters of sparsity and smoothness constraints are both 8 × 10 −5 . There is no overlapping frames between segments for efficiency consideration. In our experiments, we compare our method with the stateof-the-art approach of anomaly detection [3] , which is noted as G-PTDE in this paper. In terms of metric, we use the corresponding area under the curve (AUC) to evaluate the performance of our method since only a small portion of a long video contains package theft. We conduct three experiments settings in Table 2 Table 2 . PTD performance on different training datasets Moreover, we also dive into the performance analysis in three different normal condition comparisons: normal delivery, normal pickup, and irrelevant. According to Table 3 , it shows that the significance of performance improvement of GLF-PTDE is from the normal delivery versus package theft. GLF-PTDE is obviously having more capability of distinguishing for the normal delivery and package theft, which is important for package theft detection system. The improvement brought by GLF-PTDE is 14% on normal delivery, and 1% and 2% for the normal pickup and irrelevant actions, respectively. Delivery Pickup Irrelevant Package Theft G-PTDE [3] 0.54 0.67 0.85 GLF-PTDE 0.68 0.68 0.87 Table 3 . PTD performance on different normal conditions We propose a deep learning approach to detect package theft for home surveillance videos. Due to the complexity of these Anomaly candidate identification and starting time estimation of vehicles from traffic videos Deep-cascade: Cascading 3d deep neural networks for fast anomaly detection and localization in crowded scenes Realworld anomaly detection in surveillance videos Detecting anomalous events in videos by learning deep representations of appearance and motion Learning spatiotemporal features with 3d convolutional networks Openpose: realtime multi-person 2d pose estimation using part affinity fields