key: cord-0464337-tfbrnqbf authors: Jiao, Yingxia; Wang, Xiao; Chou, Yu-Cheng; Yang, Shouyuan; Ji, Ge-Peng; Zhu, Rong; Gao, Ge title: Guidance and Teaching Network for Video Salient Object Detection date: 2021-05-21 journal: nan DOI: nan sha: 0046e38532ef0632ea3ade769029c702a51dbfbe doc_id: 464337 cord_uid: tfbrnqbf Owing to the difficulties of mining spatial-temporal cues, the existing approaches for video salient object detection (VSOD) are limited in understanding complex and noisy scenarios, and often fail in inferring prominent objects. To alleviate such shortcomings, we propose a simple yet efficient architecture, termed Guidance and Teaching Network (GTNet), to independently distil effective spatial and temporal cues with implicit guidance and explicit teaching at feature- and decision-level, respectively. To be specific, we (a) introduce a temporal modulator to implicitly bridge features from motion into the appearance branch, which is capable of fusing cross-modal features collaboratively, and (b) utilise motion-guided mask to propagate the explicit cues during the feature aggregation. This novel learning strategy achieves satisfactory results via decoupling the complex spatial-temporal cues and mapping informative cues across different modalities. Extensive experiments on three challenging benchmarks show that the proposed method can run at ~28 fps on a single TITAN Xp GPU and perform competitively against 14 cutting-edge baselines. Video salient object detection (VSOD) has been a longstanding research topic in the area of computer vision, which aims to predict conspicuous and attractive objects in a given video clip. It has been applied to autonomous cars, action segmentation, and video captioning. In recent years, much progresses [1, 2] have been witnessed in handling unconstrained videos, but there remains a large room to improve that has not yet been adequately explored. Motion (e.g., optical flow [3] and trajectory [4] ) and appearance features (e.g., color [5] and superpixel [6] segmentation) are both crucial cues for understanding the dynamic salient object with an unconstrained background. Several works have been made to develop spatial-temporal convolution neural networks (CNNs) for learning discriminative appearance and motion features, in which recurrent memory [7, 8] and 3D convolution [9] are frequently used. However, they are hindered by the following problems. For the (a) typical UNet-shaped [10] framework with full decoder for aggregating feature pyramids. (b) our pipeline utilise implicit guidance to bridge teacher (i.e., motiondominated) branch and student (i.e., appearance-dominated) branch. For requiring explicit knowledge from the teacher branch, we utilise the teacher partial decoder (T-PD) under deep supervision to get the motion-guided mask, and use it to teach the decoding phase of student partial decoder (S-PD). former, it is unable to handle spatial and temporal cues simultaneously. Besides, it only processes the inputs sequentially due to the transmissible temporal memory, and thus, its efficiency is largely limited. As for the latter, 3D CNNs are difficult to optimise if the number of temporal layers is large, due to exponential growth of the numerical solution space. In addition, it is overloaded by high computation cost (∼1.5× memory cost than 2D CNNs). Thus, it is imperative to separately model a spatial and temporal representation scheme, which can bring considerable benefits for VSOD. To efficiently fuse the motion and appearance cues, research has explored different cross-modal strategies with separate branches, and has achieved encouraging results. Earlier works (e.g., [11] ) address this issue via encoding motion and appearance features individually and directly aggregating them. However, it may unintentionally introduce feature conflicts due to different characteristics between them. The other, more natural idea is to excavate the relationship between motion and appearance cues in a guided fashion. Thus, recent cutting-edge methods (e.g., [12] ) encourage utilising the motion-dominated features to guide the encoding/decoding of appearance-dominated features, achieving satisfactory results. However, these methods serve it as the problem of implicit feature propagation, which is primarily hindered by its inexplicable feature transmission/learning process. To alleviate the above concerns, we propose a framework, equipped with implicit guidance and explicit teaching strate- The illustration of our GTNet. The dual-branch framework is bridged by the implicit guidance (red lines) in the temporal modulator. The student partial decoder excavates the appearance-dominated features under the explicit teaching (yellow lines) from the motion-guided mask, which stems from the teacher partial decoder. Please refer to § 2 for more details. gies at feature-and decision-level, towards effective and efficient VSOD. As shown in Fig. 1 , the proposed scheme serves the motion-dominated feature as teacher knowledge. It then uses these features to implicitly guide the encoding of appearance-dominated (i.e., student) features, while explicitly teaching the decoding of student partial decoder via the teacher partial decoder (T-PD). Attributing to these strategies, this efficient scheme can generate satisfactory results in challenging scenarios. To the best of our knowledge, GTNet is the pioneering work to explore both implicit guidance and explicit teaching mechanism for VSOD. Our main contributions are as follows: • We emphasise the importance of both implicit guidance and explicit teaching strategies for the spatial-temporal representations. It is based on the observation that motion-guided features and masks provide discriminative semantic and temporal cues without redundant structures, contributing to efficient decoding phase of an appearance-dominated branch. • We introduce the temporal modulator that contains two sequential attention mechanisms from channel and spatial view, working in a deeply collaborative manner. • Comprehensive experiments on 3 benchmarks demonstrate that the proposed method can run at ∼28 fps on a single TITAN Xp GPU and get competitive performance against 14 cutting-edge VSOD models using 3 metrics, making it a potential solution for practical application. Given a series of input frames {X A t } T t=2 and corresponding optical flow map {X M t } T t=2 , which is generated by optical flow generator (i.e., RAFT [3] ). Note that we discard the first frame X A 1 and optical flow map X M 1 in our experiment due to the impact of the frame-difference algorithm. As shown in Fig. 2 , we first feed X A t and X M t into the proposed dual-branch architecture and generate appearancedominated features {f A k } 5 k=1 and motion-dominated features {f M k } 5 k=1 at frame t, which is implemented by two separate ResNet50 backbones. Second, we use temporal modulator (TM, see § 2.2) to enhance the motion-dominated (i.e., teacher) features from the spatial-and channel-aware view and transfer them to the appearance-dominated (i.e., student) branch with the implicit guidance strategy. Then, we aggregate top-three motion-dominated features {f M k } 5 k=3 via a teacher partial decoder (T-PD) and generate a motion-guided mask Z M at frame t. This mask is used for explicitly teaching the aggregating of top-three appearance-dominated features {f A k } 5 k=3 via another symmetric student partial decoder (S-PD) (see § 2.3) and generate the final prediction map Z A at frame t. To maintain the semantic consistency between these two different features, we utilise the temporal modulator (TM) for implicitly transferring the motion-dominated features from teacher branch to student (i.e., appearance-dominated) branch. Specifically, the implicit guidance strategy works collaboratively at each feature pyramid level k. The function of implicit guidance strategy is formulated as: where k ∈ {1, 2, 3, 4, 5}. The temporal modulator function F T M is defined as two sequential attention processes, including the channel-wise A k C and spatial-wise A k S attention functions at level k. Thus, this operation can be defined as: Specifically, for the former one: where P max (·) denotes the adaptive max-pooling operation for the candidate feature at the spatial-aware. g ·; W k C means the dual fully-connected layers, which is parameterized by learnable weights W k C . Besides, σ[x] and represents the activation function and channel-wise multiplication operation. In our default implementation, we adopt the widely used sigmoid function to activate the input feature, which is formulated as σ[x] = 1/(1 + exp(−x)). For the latter one: where R max (·) denotes the global max-pooling operation along the channel dimension for the input feature. g ·; W k S means the convolution layer with kernel 7×7. ⊗ is the spatialwise multiplication operation. Extensive experiments (see § 3) show the effectiveness of these operations in transferring resultful motion patterns from teacher into student branch. In addition to the implicit guidance between teacher-student branches, we propose to propagate the motion-guided mask Z M containing rich motion cues to teach the decoding of topthree appearance-dominated features f A k , k ∈ {3, 4, 5}. Teacher Partial Decoder. Given the two groups of crossmodal and multi-level features fused by the appearance-and motion-dominated features, we propose to efficiently utilise them to generate an accurate prediction map at each modality. Thus, for the time-cost trade-off, we first introduce the teacher partial decoder to aggregate the motion-dominated feature at the top-three level suggested by [13] . To reduce the computation redundancy while maintaining the representational capability of candidates, we use receptive field blocks [14] to get the motion-refined features (i.e., r M k = F M,k RF (f M k )) before the teacher partial decoder F T P D . It is formulated as: Specifically, it can be formulated as two steps: Step-I: Step-II: We term the Step-I as feature broadcasting phase, which can broadcast the strong semantic features to the weakly semantic features. means the element-wise multiplication operation of multiple inputs iterated by i, which is parameterized by learnable weights W k i . δ(·) is an upsampling function to ensure the shape matching of feature. In Step-II, we get the intermediate motion-guided mask Z M via typical UNetshaped [15] decoder F U (removing bottom-two layers). Motion-guided Mask Propagation. To effectively leverage the motion-guided mask, we explicitly propagate the motionguided mask Z M to the top-three appearance-dominated features f tm k derived from the student branch (see Equ. 1). This explicit teaching operation can be formulated as: where k ∈ {3, 4, 5}. ⊕ and ⊗ denote the element-wise addition and multiplication operation, respectively. Student Partial Decoder. Adopting the same formulation of teacher partial decoder defined above, we propagate the motion-guided mask Z M into the student partial decoder and produce the result Z A . This process can be defined as: In the inference phase, we serve the output Z A followed by sigmoid function as the final prediction. Implementation Details. We train our network on GTX TI-TAN Xp GPU for computing acceleration with PyTorch. Following the same pipeline illustrated in [12] . We first train the teacher branch on the optical flow map generated. Then, we train the student branch on the DUTS [28] dataset. Final, we train the dual-branch framework on the training set of DAVIS 16 [17] . During the training, we adopt Adam optimizer that the learning rate is initially set to 1e-4 and is decayed by 0.1 every 25 epoch. We resize the input RGB and optical flow map to 352 2 for both the training and test. Without any postprocessing, the inference speed is 28 fps, regardless of flow estimation. The code and result will be made publicly. Quantitative Comparison. To demonstrate the effectiveness of the proposed method, we compare the proposed approach including 11 VSOD methods [21, 22, 23, 24, 25, 26, 7, 8, 1, 12, 27 ] and 3 ISOD methods [18, 19, 20] . The results in Tab. 1 shows the superiority of our GTNet compared to other cutting-edge baselines without any post-processing. Qualitative Comparison. Following the same train settings in Tab. 1, we compare three variants of GTNet, including '+M', '+A', and '+M+A (Ours)'. As shown in Fig. 3 , we can observe that our '+M+A' perform satisfactorily in the visual perception compared to other two variants: (a) '+M' only capture coarse location of dynamic objects with fuzzy boundary, Effectiveness of implicit guidance. As shown as in Tab. 2, we conduct ablation studies via decoupling the temporal modulator (TM) to validate the effectiveness of our implicit guidance strategy. We remove 'CA' and 'SA' (No.#1) and find it is worse than our method (last row), decreasing by 1.6% in terms of S on ViSal [16] and 0.9% in terms of F β on DAVIS 16 [17] . Besides, we observe that the variants of removing 'CA' (No.#2) or 'SA' (No.#3) can improve the performance of No.#1, and combine them together (No.OUR) can further boost the performance. Effectiveness of explicit teaching. As shown in Tab. 3, we further analyse the effectiveness of the explicit teaching strategy via removing the teacher partial decoder (T-PD), student partial decoder (S-PD), and motion-guided mask propagation ('teaching'). We first remove the T-PD on the motion branch (see No.#4) to validate its effectiveness, the performance degradation on the DAVIS 16 by a large margin (1.3% of S). Then, we verify the necessity of S-PD via decoupling it (see No.#5) and observe that the variant without it decrease by Table 3 : Ablation studies for explicit teaching strategy. ViSal [16] DAVIS16 [17] No. T 0.8% F β on ViSal. Further, we remove the motion-guided mask propagation (i.e., yellow lines in Fig. 2 ) before the S-PD (see No.#6) to verify the effectiveness of the explicit teaching strategy proposed. Comparing to No.#6, we find that No.OUR with the teaching mechanism improves 0.7% S and 0.6% F β in ViSal [16] datasets, which demonstrates that the teaching mechanism is critical to the performance. In this paper, we emphasis the importance of both implicit guidance and explicit teaching strategies for spatial-temporal representations. Our dual branch architecture is attributed to two key designs: (i) adopting temporal modulator implicitly transmits the representative motion-dominated cues into an appearance-dominated branch; and (ii) using the motionguided mask to explicitly teach the feature aggregation of appearance-dominated branch. Extensive experiments on three datasets demonstrate that the proposed GTNet performs competitively compared to 14 cutting-edge approaches. Shifting more attention to video salient object detection Slowfast networks for video recognition Raft: Recurrent all-pairs field transforms for optical flow Motion trajectory segmentation via minimum cost multicuts Geodesic saliency using background priors Saliency-aware geodesic video object segmentation Flow guided recurrent neural encoder for video salient object detection Pyramid dilated deeper convlstm for video salient object detection Progressively Normalized Self-Attention Network for Video Polyp Segmentation U-net: Convolutional networks for biomedical image segmentation Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos Motion guided attention for video salient object detection Pranet: Parallel reverse attention network for polyp segmentation U-net: Convolutional networks for biomedical image segmentation Consistent video saliency using local gradient flow optimization and global refinement A benchmark dataset and evaluation methodology for video object segmentation Deeply supervised salient object detection with short connections A bidirectional message passing model for salient object detection Basnet: Boundary-aware salient object detection Segmenting salient objects from images and videos Real-time salient object detection with a minimum spanning tree Video saliency detection via spatial-temporal fusion and lowrank coherency diffusion Scom: Spatiotemporal constrained optimization for salient object detection Weakly supervised salient object detection with spatiotemporal cascade neural networks Video salient object detection via fully convolutional networks Tenet: Triple excitation network for video salient object detection Learning to detect salient objects with image-level supervision