key: cord-0047202-n8zgwxmc authors: Quddus Khan, Akif; Khan, Salman; Ullah, Mohib; Cheikh, Faouzi Alaya title: A Bottom-Up Approach for Pig Skeleton Extraction Using RGB Data date: 2020-06-05 journal: Image and Signal Processing DOI: 10.1007/978-3-030-51935-3_6 sha: 86a8019354125825823770861646da72d562d8c9 doc_id: 47202 cord_uid: n8zgwxmc Animal behavior analysis is a crucial task for the industrial farming. In an indoor farm setting, extracting Key joints of animals is essential for tracking the animal for a longer period of time. In this paper, we proposed a deep network that exploits transfer learning to train the network for the pig skeleton extraction in an end to end fashion. The backbone of the architecture is based on an hourglass stacked dense-net. In order to train the network, keyframes are selected from the test data using K-mean sampler. In total, 9 Keypoints are annotated that gives a brief detailed behavior analysis in the farm setting. Extensive experiments are conducted and the quantitative results show that the network has the potential of increasing the tracking performance by a substantial margin. Automatic behavior analysis of different animal species is one of the most important tasks in computer vision. Due to variety of applications in the human social world like sports player analysis [1] , anomaly detection [2] , action recognition [3] , crowd counting [4] , and crowd behavior [5, 6] , humans have been the main focus of research. However, due to the growing demands of food supplies, vision-based behavior analysis tools are pervasive in the farming industry and demands for cheaper and systematic solutions are on the rise. From the algorithmic point of view, other than the characterization of the problem, algorithm design for humans and the farm animals are similar. Essentially, behavior analysis is a high-level computer vision task and consists of feature extraction, 3D geometry analysis, and recognition, to name a few. As far as the input data is concerned, it could be obtained through smart sensors (Radio-frequency identification [7] , gyroscope [8] , GPS [9] ). Depending on the precision of measurements, such sensors give acceptable results but using such sensors has many drawbacks. For example, in most cases, it is required to remove the sensor from the animal to collect the data. Such a process is exhausting for the animals and laborious for the human operator. Compared to this, a video-based automated behavior analysis offers a non-invasive solution. Due to cheaper hardware, it is not only convenient for the animals but also cost-effective for the industry. Automatic behavior analysis and visual surveillance [10] [11] [12] has been used for the security of public places (airports, shopping malls, subways, etc.) and turned into a mature field of computer vision. In this regard, Hu et al. [13] proposed a recurrent neural network named MASK-RNN for the instance level video segmentation. The network exploits the temporal information from the long video segment in the form of optimal flow and perform binary segmentation for each object class. Ullah et al. [14] extracted low level global and Keypoint features from video segments to train a neural network in a supervised fashion. The trained network classifies different human actions like walking, jogging, running, boxing, waving and clapping. Inspired from the human social interaction, a hybrid social influence model has been proposed in [15] that mainly focused on the motion segmentation of the moving entities in the scene. Lin et al. [16] proposed a features pyramid network that extract features at different level of a hierarchical pyramid and could potentially benefit several segmentation [13, 17, 18] , detection [19, 20] , and classification [21, 22] frameworks. Additionally, in the field of cybersecurity [23, 24] , such techniques are very beneficial. By addressing the problem of scale variability for object detection, Khan et al. [20] proposed a dimension invariant convolution neural network (DCNN) that compliment the performance of RCNN [19] but many other state-of-the-art object detectors [4, 25] could take advantage of it. Inspired by the success of deep features, [26] proposed a two-stream deep convolutional network where the first stream focused on the spatial features while the second stream exploit the temporal feature for the video classification. The opensource deep framework named OpenPose proposed by Cao et al. [27] focuses on the detection of Keypoints of the human body rather than the detection of the whole body. Detection of Keypoints has potential applications in the pose estimation and consequently behavior analysis. Their architecture consists of two convolutional neural networks were the first network extract features and gives the location of the main joints of the human body in the form of a heat map. While the second network is responsible for associating the corresponding body joints. For the feature extraction, they used the classical VGG architecture. The frameworks like OpenPose are very helpful in skeleton extraction of the human body and potentially, it could be used in tracking framework. For example, the Bayesian framework proposed in [28] works as a Keypoint tracker, where any Keypoints like the position of head [29] , or neck or any other body organ can be used to do tracking for longer time. Such Keypoints can be obtained from a variety of human pose estimation algorithms. For example, Sun et al. [30] proposed a parallel multi-resolution subnetworks for the human pose estimation. The idea of a parallel network helps to preserver high resolution and yield high quality features maps that results in better spatial Keypoints locations. Essentially, in such a setting, the detection module is replaced by [27, 30] . In this regard, a global optimization approach like [31] could be helpful for accurate tracking in an offline setting. By focusing only on pose estimation of humans, Fang et al. [32] proposed a top-down approach where first the humans are detected in the form of bounding boxes and later, the joints and Keypoints are extracted through a regional multi-person pose estimation framework. Such a framework is helpful in not only in the localization and tracking of tracking in the scene, but also getting the pose information of all the targets sequentially. For a robust appearance model that could differentiate between different targets, a sparse coded deep features framework was proposed in [33] that accelerate the extraction of deep features from different layers of a pre-trained convolution neural network. The framework helps handle the bottleneck phenomenon of appearance modeling in the tracking pipeline. Alexander et al. [34] used transfer learning [35] and fine-tuned ResNet to detect 22 joints of horse for the pose estimation. They used the data collected from 30 horses for the within domain and out of domain testing. The work by Mathis et al. [36] analyzed the behavior of mice during the experimental tasks of tail-tracking, and reach & pull of joystick tasks. They also analyze the behavior of drosophila while it lays eggs. The classical way of inferring behavior is to perform segmentation and tracking [37] first, and based on the temporal evolution of the trajectories, perform behavior analysis. However, approaches like [38] can be used to directly infer predefined actions and behaviors from the visual data. In addition to the visual data, physiological signals [39, 40] , and acoustic signals [41] can be used to identify different emotional states and behavioral traits in farm animals. Compared to the existing methods, our proposed framework is focused on the extraction of the key joints of the pig in an indoor setting. The visual data is obtained from a head-mounted Microsoft Kinect sensors. Our proposed framework is inspired by [42] where a fully-convolutional stacked hourglass-shaped network is proposed that converts the image into a 16-channel space representing detection maps. For the part detection, the thresholds are set from 0.10 to 0.90. These thresholds are used while evaluating the recall, precision, and Fmeasure metrics for both the vector matching and euclidean matching results. Such an analysis provides a detailed overview of the trade-offs between precision and recall while maintaining an optimal detection threshold. The loss function, the optimizer, and the training details are also given in Sect. 3. The qualitative results are mentioned in Sect. 4 and the remarks are given in Sect. 5 which concludes the paper. The rest of the paper is organized in the following order. In Sect. 2 the proposed method is briefly explained including the Keypoints used in the experiment, the data filtration and annotation, and the augmentation. Model architecture along with the loss function, the optimizer, and the training details are elaborated in Sect. 3. The qualitative results are given in Sect. 4 and the remarks are given in Sect. 5 which concludes the paper. The block diagram of the network is given in Fig. 1 . It mainly consists of two encoder-decoder stacked back to back. The convolution neural network used in each encoder-decoder network is based on the dense net. The network takes the input as the visual frame. To train the model, we annotate the data by first converting videos into individual frames, and then annotating each frame separately by specifying the important key points on the animal's body. After sufficient training, the model returns a 9 × 3 matrix for each frame, where each row corresponds to one keeping, the first two columns specify the x and y coordinates of the detected point, and the third column contains the confidence score of the model. After obtaining the x and y coordinates for each frame, we visualize these key points on each frame and stitch all the individual frames into a single video. A total of nine (Nose, Head, Neck, Right Foreleg, Left Foreleg, Right Hind leg, Left Hind Leg, Tail base, and tail tip) key points is being focused for each pig. The given RGB data consists of three sets with three pigs, six pigs, and ten pigs. Each dataset has 2880 images. To get better and more accurate results, a larger dataset was required. However, K-Mean clustering is applied to each dataset for selecting the most informative frames. As a result, only 280 images are extracted from the larger dataset. The count is approximately 10% of the size of the original dataset. Data annotator developed by Jake Graving is used to annotate the dataset. It provides a simple graphical user interface that reads key points data from the CSV file and saves the data in .h5 format once the annotation is completed. DeepPoseKit works with augmenters from the imgaug package. We used spatial augmentations with axis flipping and affine transforms. The proposed framework is based on an hourglass densenet which is an efficient multi-scale deep-learning model. The architecture consists of an encoder and decoder where dense nets are stacked in sequential order. Densenet is a densely Connected Convolutional Networks [43] . DenseNet can be seen as the next generation of convolutional neural networks that are capable of increasing the depth of the model with every decreasing the number of parameters. Mathematically, the loss function is: where x is the input sample and y corresponds to the network prediction. We used the callback function using ReduceLROnPlateau. ReduceLROnPlateau automatically reduces the learning rate of the optimizer when the validation loss stops improving. This helps the model to reach a better optimum at the end of training. While training a model on three pigs' data, first a test was run for data generators. Creating a TrainingGenerator from the DataGenerator for training the model with annotated data is an important factor. The TrainingGenerator uses the DataGenerator to load image-keypoints pairs and then applies the augmentation and draws the confidence maps for training the model. The vali-dationSplit argument defines how many training examples to use for validation during training. If a dataset is small (such as initial annotations for active learning), we can set this to validationSplit = 0, which will just use the training set for model fitting. However, when using callbacks, we made sure to set monitor = "loss" instead of monitor = "valloss". To make sure the Training Reduce learning rate parameters saves useless resource utilization and model overfeeding. For this particular reason, the parameter is set to reduce the learning rate by 0.2 if the loss does not improve after 20 iterations. Another parameter that is used to prevent resource exploitation is Early Stopping. Patience is set 100 iterations which means training would stop automatically if the loss does not improve after 100 iterations. Training started at a loss of 220, after running 400 iterations, the loss stopped showing improvement at 4.5. In the test case, when given the same video from which dataset was generated, very accurate results are produced. The proposed framework is implemented in Python with the support of Keras backend by Tensorflow. The processing is performed on Nvidia P-100 with 32 GB RAM. The qualitative results are shown in Fig. 2 . It can be seen that the keypoints are successfully extracted from the pig joints. However, sometimes, when the pigs are very close to each other, the extracted keypoints are associated with the wrong animal. It is simply because the network is trained on very limited data. Training the network with more data to help improve the association of keypoints. We proposed a deep network for animal skeletonization based on only RGB data. The network exploits use varies data augmentation and transfer learning to finetune the parameters. The backbone of the network is based on an hourglass stacked dense-net. In order to train the network, keyframes are selected from the test data using K-mean sampler. In total, 9 Keypoints are annotated that gives a brief detailed behavior analysis in the farm setting. Experiments are conducted on pig data and the quantitative results that training the network with only 280 frames yields promising results. Disam: density independent and scale aware model for crowd counting and localization Anomalous entities detection and localization in pedestrian flows Vision-based action recognition of construction workers using dense trajectories Person head detection based deep model for people counting in sports videos Crowd behavior identification Dominant motion analysis in regular and irregular crowd scenes Measuring the drinking behaviour of individual pigs housed in group using radio frequency identification (RFID) Stacked LSTM network for human activity recognition using smartphone data GPS tracking of free-ranging pigs to evaluate ring strategies for the control of cysticercosis/taeniasis in Peru Facial emotion recognition using hybrid features Distributed deep learning model for intelligent video surveillance systems with edge computing Crowd motion analysis: segmentation, anomaly detection, and behavior classification MaskRNN: instance level video object segmentation Human action recognition in videos using stable features A hybrid social influence model for pedestrian motion segmentation Feature pyramid networks for object detection Density independent hydrodynamics model for crowd coherency detection PedNet: a spatio-temporal deep convolutional neural network for pedestrian segmentation Fast R-CNN Dimension invariant model for human head detection Hierarchical semantic image matching using cnn feature pyramid Single shot appearance model (SSAM) for multi-target tracking Modeling attack and defense scenarios for cyber security exercises Detecting windows based exploit chains by means of event correlation and process monitoring Faster R-CNN: towards real-time object Two stream model for crowd video classification OpenPose: realtime multi-person 2D pose estimation using part affinity fields Hog based real-time multi-target tracking in Bayesian framework Headbased tracking Deep high-resolution representation learning for human pose estimation A directed sparse graphical model for multi-target tracking RMPE: regional multi-person pose estimation A hierarchical feature model for multi-target tracking Pretraining boosts out-of-domain robustness for pose estimation Hand-crafted vs deep features: a quantitative study of pedestrian appearance model Markerless tracking of user-defined features with deep learning Deep feature based end-to-end transportation network for multi-target tracking Implementation of machine vision for detecting behaviour of cattle and pigs An image based prediction model for sleep stage identification Frequency-dependent changes in resting state electroencephalogram functional networks after traumatic brain injury in piglets Use of vocalisation to identify sex, age, and distress in pig production Multi-pig part detection and association with a fully-convolutional network Densely connected convolutional networks