key: cord-0221590-3glx4mji authors: Liu, Yang; Wei, Yushen; Yan, Hong; Li, Guanbin; Lin, Liang title: Causal Reasoning Meets Visual Representation Learning: A Prospective Study date: 2022-04-26 journal: nan DOI: nan sha: c8568dc01e0df4f8885109ddb8d88449a148a273 doc_id: 221590 cord_uid: 3glx4mji Visual representation learning is ubiquitous in various real-world applications, including visual comprehension, video understanding, multi-modal analysis, human-computer interaction, and urban computing. Due to the emergence of huge amounts of multi-modal heterogeneous spatial/temporal/spatial-temporal data in big data era, the lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models. The majority of the existing methods tend to fit the original data/variable distributions and ignore the essential causal relations behind the multi-modal knowledge, which lacks an unified guidance and analysis about why modern visual representation learning methods are easily collapse into data bias and have limited generalization and cognitive abilities. Inspired by the strong inference ability of human-level agents, recent years have therefore witnessed great effort in developing causal reasoning paradigms to realize robust representation and model learning with good cognitive ability. In this paper, we conduct a comprehensive review of existing causal reasoning methods for visual representation learning, covering fundamental theories, models, and datasets. The limitations of current methods and datasets are also discussed. Moreover, we propose some prospective challenges, opportunities, and future research directions for benchmarking causal reasoning algorithms in visual representation learning. This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods, publicly available benchmarks, and consensus-building standards for reliable visual representation learning and related real-world applications more efficiently. With the emergence of huge amounts of heterogeneous multi-modal data including images [1] [2] [3] , videos [4] [5] [6] [7] , texts/languages [8] [9] [10] , audios [11] [12] [13] [14] , and multi-sensor [15] [16] [17] [18] data, deep learning based methods have shown promising performance for various computer vision and machine learning tasks, for example, the visual comprehension [19] [20] [21] , video understanding [22] [23] [24] , visual-linguistic analysis [25] [26] [27] , and multi-modal fusion [28] [29] [30] , etc. However, the existing methods rely heavily upon fitting the data distributions and tend to capture the spurious correlations from different modalities, and thus fail to learn the essential causal relations behind the multi-modal knowledge that have good generalization and cognitive abilities. Inspired by the fact that most of the data in computer vision society are independent and identically distributed (i.i.d), a substantial body of literature [31] [32] [33] [34] adopted data augmentation, pretraining, self-supervision, and novel architectures to improve the robustness of the state-of-theart deep neural network architectures. It has been Fig. 1 The overview of the structure of this paper, including the discussion of related methods, datasets, challenges, and the relations among causal reasoning, visual representation learning, and their integration. argued that such strategies only learn correlationbased patterns (statistical dependencies) from data and may not generalize well without the guarantee of the i.i.d setting [35] . Due to the powerful ability of uncovering the underlying structural knowledge about data generating processes that allows interventions and generalizes well across different tasks and environments, causal reasoning [36] [37] [38] offers a promising alternative to correlation-learning. Recently, causal reasoning has been attracted increasing attention in a myriad of high-impact domains of computer vision and machine learning, such as interpretable deep learning [39] [40] [41] [42] [43] [44] , causal feature selection [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] , visual comprehension [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] , visual robustness [68] [69] [70] [71] [72] [73] [74] [75] , visual question answering [76] [77] [78] [79] [80] [81] , and video understanding [82] [83] [84] [85] [86] [87] [88] [89] . A common challenge of these causal methods is how to build a strong cognitive model that can fully discover causality and spatial-temporal relations. In this paper, we aims to provide a comprehensive overview of causal reasoning for visual representation learning, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causality guided visual representation learning methods. Although there exist some surveys [37, 38, 46, [90] [91] [92] about causal reasoning, these works are intended for general representation learning tasks such as deconfounding, out-of-distribution (OOD) generalization, and debiasing. Differently, our paper focuses on the systematic and comprehensive survey of related works, datasets, insights, future challenges and opportunities for causal reasoning, visual representation learning, and their integration. To present the review more concisely and clearly, this paper selects and cites related work by considering their sources, publication years, impact, and the cover of different aspects of the topic surveyed in this paper. The overview of the structure of this paper is shown in Fig. 1 . Overall, the main contributions of this paper are given as follows. Firstly, this paper presents the basic concepts of causality, the structural causal model (SCM), the independent causal mechanism (ICM) principle, causal inference, and causal intervention. Then, based on the analysis, this paper further gives some directions for conducting causal reasoning on visual representation learning tasks. Note that, to the best of our knowledge, this paper is the first that propose the potential research directions for causal visual representation learning. Secondly, a prospective review is introduced to systematically and structurally review the existing works according to their efforts in the abovepointed directions for conducting causal visual representation learning more efficiently. We focus on the relation between visual representation learning and causal reasoning, and provides better understanding about why and how existing causal reasoning methods can be helpful in visual representation learning, as well as providing inspiration for future researches and studies. Thirdly, this paper explores and discusses some future research areas and open problems related to the use of causal reasoning methods to tackle visual representation learning. This can encourage and support the broadening and deepening of research in the related fields. The remainder of this paper is organized as follows. Section 2 provides the preliminaries, including the basic concepts of causality, the structural causal model (SCM), the independent causal mechanism (ICM) principle, causal inference, and causal intervention. Sections 3 discusses the ways for using causal reasoning to learn robust features, which is the key techniques for visual representation learning. Section 4 review some recent visual learning tasks including visual understanding, action detection and recognition, and visual question answering, including the discussions about the existing challenges of these visual learning methods. Section 5 review the related causality-based visual representation learning works systematically. Section 6 provides a review of existing causal datasets for visual learning. Section 7 proposes and discusses some future research directions, while Section 8 gives the conclusions finally. As the sentence "correlation is not causation" says, two variables are correlated does not mean that one of them causes the other. Actually, statistical learning models correlations of data. By observing a sufficient amount of i.i.d. data, statistical learning method can perform considerably well under i.i.d settings. But when facing problems that do not satisfy i.i.d. assumptions, the performance of those methods often seems poor (e.g. image recognition models tend to predict "bird" when seeing "sky" in the image, since bird and sky usually appear simultaneously in the dataset). Causal learning [36] is different from statistical learning that aims to discover causal relations beyond statistical relations. Learning causality requires machine learning methods not only predict the outcome of i.i.d. experiments but also reason from causal perspective. Causal reasoning can be divided into three levels. The first level is association, statistical machine learning methods mentioned above belong to this level. A typical question of association is "How would the weather change when the sky is turning grey", which asks about the association between "weather" and "the appearance of the sky". The second level is intervention. An intervention-based question asks about the effect of the intervention (e.g. "Would I become stronger if I go to the gym every day?"). Intervention-based questions require us to answer the outcome when taking specific treatment, which can not be answered by only learning data associations (e.g. If we only learn the associations, then if we observe that a man who goes to the gym every day may not be stronger than a professional athlete, we may conclude that going to the gym not always make you stronger). The third level is counterfactual, a typical form of a counterfactual question is "What if I had...", which focuses on the outcome when the condition is not realized. Counterfactual inference aims to compare different outcomes under the same condition but the antecedent of counterfactual question is not real. The structural causal model (SCM) considers the formulation of a causal story. Assume that we have a set of variables X 1 , X 2 ...X i , each variable is a vertex of a causal graph (i.e. a DAG describes causal relations of variables). Then those variables could be written as the outcome of a function: where pa i indicates parents of X i in the causal graph and U i refers to unmeasured factors such as noise. The deterministic function gives a mathematical form of the effect of direct causes of X i on variable X i . Using the graphical causal model and SCM language, we can express joint distributions as follow: Eq. 2 is called a product decomposition of the joint distributions. After the decomposition and graphical modeling, the causal relations and effects of a dataset can be represented as the causal graph and the joint distribution. Independent causal mechanism principle [37] can be expressed as follows: ICM principle: The causal generative process of a system's variables is composed of autonomous modules that do not inform or influence each other. In the probabilistic case, this means that the conditional distribution of each variable given its causes (i.e., its mechanism) does not inform or influence the other conditional distributions. The ICM principle describes the independence of causal mechanisms. If we conceive that the real world is being composed of modules in variable style, then the modules could represent the physically independent mechanisms of the world. When applying ICM priciple to the disentangled factorization Eq. 2, it can be written as [37] : (1) Changing (or performing an intervention upon) one mechanism P (X i |pa i ) does not change any of the other mechanisms P (X j |pa j )(i = j). (2) Knowing some other mechanisms P (X i |pa i ) does not give us information about a mechanism P (X j |pa j )(i = j). The ICM principle guarantees that our intervention on one mechanism does not affect others, which further reveals the possibility of transferring knowledge across domains that have the same modules. The purpose of causal inference is to estimate the outcome shift (or effect) of different treatments. A treatment here refers to an action that applies to a unit. For example, if we have a medicine A, let A = 1 denote applying medicine A and A = 0 denotes not applying medicine A, then A = 1 is a treatment and the recovery of the patient is the outcome of the treatment A = 1. Under this condition, the aim of causal inference is to uncover the effect of applying treatment A. A counterfactual outcome is the potential outcome of an action that has not been taken. For example, if we take treatment A = 1, then the outcome of A = 0 is counterfactual. Then the average treatment effect (ATE) of treatment A = 1 could be write as: where Y (A = a) denotes the potential outcome of treatment A = a. If we have taken treatment A = 1, then Y (A = 0) is the counterfactual outcome. The goal of causal inference is to estimate treatment effects given the observational data, which is usually incomplete in real-world scenarios due to the cost and moral problems. From a counterfactual perspective, we can not always obtain a no-treatment outcome if we apply the treatment. Thus, we need to adopt causal inference to analyze the effect of a certain treatment. Causal intervention for machine learning aims to capture the causal effects of interventions (i.e. variables), and take advantage of causal relations in datasets to improve model performance and generalization ability. The basic idea of causal intervention is using an adjustment strategy that modifies the graphical model and manipulates conditional probabilities to discover causal relations among variables. In this section, we will review two adjustment strategies: front-door adjustment and backdoor adjustment. Assume that we want to gauge the causal effect between X and Y by Bayes' rules, we can have: This conditional distribution could not represents the true causal effect of X on Y , due to the existence of backdoor path X ←− Z −→ Y . Variable Z here is a confounder that not only affects pre-intervention X but also the outcome Y , which would make the conditional distribution a collective effect of X and Z, and thus leads to the spurious correlation. To eliminate the spurious correlation introduced by the backdoor path, backdoor adjustment use do-operator to calculate intervened probability P (Y |do(X)) instead of the conditional probability P (Y |X): Compare with P (Y |X), P (Y |do(X)) replace conditional distribution P (z|X) with marginal distribution P (z). Fig. 2 is a graphical view of the do-operator, the edge from Z to X is deleted in intervened causal graph to block the backdoor path X ←− Z −→ Y , thus X and Z become independent after the intervention. After Fig. 2 An example of backdoor adjustment, the backdoor path from X to Y is blocked by cutting off the edge from Z to X. Fig. 3 The backdoor criterion is not satisfied since Z is an unobserved variable the backdoor adjustment, the intervened distribution P (Y |do(X)) can get rid of the spurious correlation between X and Z and calculate the true causal effect of variable X. Backdoor adjustment measures the causal effect of a variable by finding and blocking backdoor paths points to it. Backdoor criterion may not be satisfied in some causal graphical patterns (e.g. there is no backdoor paths exist in causal graphs, or variables that block the backdoor paths are unobserved). In such case, front-door adjustment pattern can be applied to estimate causal effects. As Fig. 3 shows, assume that the variable Z is an unobserved variable, the backdoor adjustment become invalid because the marginal distribution P (z) is not observed. But if we have an observed mediator variable W in the front-door path X −→ W −→ Y , then we can identify the effect of X on W directly since the backdoor path from X to W is blocked by collider at Y : Note that there is a backdoor path from W to Y : W ←− X ←− Z −→ Y , which can be blocked by applying backdoor adjustment on X: and the total effect of X on Y could be written by summing on W : Then the front-door adjustment formulation is obtained by applying Eq. 6, Eq. 7, and Eq. 8: Front-door adjustment identifies the effect of X on Y by applying do-operator twice, one at the mediator variable W and the other at variable X that blocks the backdoor path. In this way, the unobserved variable Z can be bypassed in intervention. Traditional feature learning methods usually learn the spurious correlation introduced by confounders. This will reduce the robustness of models and make models hard to generalize across domains. Causal reasoning, a learning paradigm that reveals the real causality from the outcome, overcomes the essential defect of correlation learning and learn robust, reusable, and reliable features. In this chapter, we review the recent representative causal reasoning methods for general feature learning, which mainly consist of three main paradigms: (1) Structural causal model (SCM) embedded, (2) Applying causal intervention/counterfactual, and (3) Markov boundary (MB) based feature selection. For embedding structural causal model (SCM), Mitrovic et al. [93] proposed representation learning via invariant causal mechanisms (RELIC) to address self-supervised learning problems and achieved competitive performance in terms of robustness and out-of-distribution generalization on ImageNet. Shen et al. [94] proposed a disentangled generative causal representation (DEAR) learning method for causal controllable generation on both synthesized and real datasets. To apply causal intervention or counterfactual inference for feature learning, Huang et al. [60] proposed a causal intervention based deconfounded visual grounding method to eliminate the confounding bias. Zhang et al. [62] present a causal inference based weakly-supervised semantic segmentation framework. Tang et al. [64] presents a causal inference framework which disentangles the paradoxical effects of the momentum to remove the confounder of long-tail classification. Chen et al. [81] proposed a Counterfactual critic Multi-Agent Training (CMAT) approach to make the visual context properly learned. Causal feature selection, which aims to find a subset of features from a large number of predictive features to reduce computational cost and build predictive models for variables of interest. Recent causality-based feature selection methods uses Bayesian network (BN) and Markov boundary (MB) to identify potential causal features. BN is used as a DAG representing the causal relations between variables and MB implies the local causal relationships between the class variable and the features in its MB. Since the BN of variables may be very large and hard to compute, current causal based methods focus on identifying the MB as a variable or a subset of the MB. For example, Wu et al. [56] introduced PCMasking concept to explain a type of incorrect CI tests in MB discovery and proposed a CCMB algorithm to solve the incorrect test problem. Yu et al. [57] presented the theoretical analyses about the conditions for MB discovery in multiple interventional datasets and designed an algorithm for learning MBs from multiple interventional datasets. Yu et al. [55] formulated the causal feature selection problem with multiple datasets as a search problem and gave the upper and lower bounds of the invariant set, then proposed multi source feature selection algorithm. Yang et al. [52] proposed the concept of N-structures and then designed a MB discovering subroutine to integrate MB learning with N-structures to discover MB while distinguishing direct causes from direct effects. Yu et al. [50] proposed a multi-label feature selection algorithm M2LC, which learns the causal mechanism behind data and is able to select causally informative features and visualize common features. Guo et al. [48] proposed a error-aware markov blanket learning algorithm to solve the conditional independence test error in causal feature selection. Ling et al. [54] proposed an efficient local causal structure learning algorithm LCS-FS, which speeds up parent and children discovery by employing feature selection without searching for conditioning sets. Yu et al. [47] proposed a multiple imputation MB framework MimMB for causal feature selection with missing data. MimMB integrates Data Imputation with MB Learning in a unified framework to make it able for the two key components to engage with each other. Finding causal features improves the explanatory capability and robustness of models. Causal feature selection methods can provide a more convincing explanation for prediction than correlation-based methods. As the ICM principle implies, the underlying mechanism about the class variable can be learned from causal relations and thus can be transferred across different settings or environments. Although the existing causal feature learning methods achieve promising performance, most of them focus on general feature learning without considering a more specific and challenging problem, visual representation learning. Learning: State-of-the-arts Visual representation learning has made great progress in recent years, which can utilize spatial or/and temporal information to complete specific tasks including visual understanding (object detection, scene graph generation, visual grounding, visual commonsense reasoning), action detection and recognition, and visual question answering, etc. In this section, we introduce these representative visual learning tasks and discuss the existing challenges and necessity of applying causal reasoning to visual representation learning. Object detection aims to determine where objects are located in a given image (object localization) and which category each object belongs to (object classification), and label them with rectangular bounding boxes (BBs) to show the confidences of existence. In image object detection, deep learning frameworks for object detection are divided into two types. The first type is to follow the traditional object detection process, generating region proposals firstly, and then classifying each proposal into a different object class. The other type is to treat object detection as a regression or classification problem, and adopt a unified framework to directly obtain the final predictions (category and location). Region proposal-based methods mainly include R-CNN [95] , Spatial Pyramid Pooling (SPP-net) [96] , Fast R-CNN [97] , Faster R-CNN [98] , Feature Pyramid Network (FPN) [99] , Region-based Fully Convolutional Network (R-FCN) [100] , and Mask R-CNN [101] , some of which are interrelated (e.g., SPPnet modifies R-CNN with an SPP layer). Based on regression/classification, the methods mainly include MultiBox [102] , AttentionNet [103] , G-CNN [104] , YOLO [105] , Single Shot MultiBox Detector (SSD) [106] , YOLOv2 [107] , Deeply Supervised Object Detector (DSOD) [108] and Deconvolution Single Shot Detector (DSSD) [109] . The correlations between these two pipelines are connected by anchors introduced in Faster R-CNN. In video saliency object detection, extending state-of-the-art saliency detectors from images to videos is challenging. Li et al. [110] presented a flow guided recurrent neural encoder (FGRNE), which works by enhancing the temporal coherence of the per-frame feature by exploiting both motion information in terms of optical flow and sequential feature evolution encoding in terms of LSTM networks. Li et al. [111] developed a multi-task motion guided video salient object detection network, which learns to accomplish two sub-tasks using two sub-networks, one sub-network is for salient object detection in still images and the other one is for motion saliency detection in optical flow images. Yan et al. [112] presented an effective video saliency detector that consists of a spatial refinement network and a spatiotemporal module. By utilizing the generated pseudo-labels together with a part of manual annotations, the detector can learn spatial and temporal cues for both contrast inference and coherence enhancement. For video salient object detection, how to effectively take object motion into consideration and obtain robust spatial-temporal information is crucial in video salient object detection. However, some non-object, occlusion, motion blur and lens movement make the model hard to concentrate the true interesting object area. Scene graph generation (SGG) aims to describe object instances and relations between objects in a scene. With its powerful representation ability, SGG can encode images [113, 114] and videos [115, 116] as its abstract semantic elements, without any restrictions on the attributes, types and relations between objects. Therefore, the task of SGG is to build a graph structure that associates its nodes and edges well with objects in the scene and their relations, where the key challenge task is to detect/recognize relations between objects. Currently, SGG can be divided into two classes: (1) With facts alone, (2) Introducing prior information. Besides, these SGG methods pay more attention to the methods with facts alone, including CRF-based (conditional random field) SGG [114, 117] , VTransE-based (visual translation embedding) SGG [118, 119] , RNN/LSTMbased SGG [120, 121] , Faster RCNN-based SGG [122, 123] , GNN [124, 125] , etc. Furthermore, SGG adds different types of prior information, such as language priors [126] , knowledge priors [127, 128] , visual contextual information [129] , visual cue [130] , etc. Fig. 4 shows the related work on SGG, it can be clearly seen that most of the methods use the GNN model or introduce relevant prior information when conducting SGG. Existing SGG methods are still far from building a practical knowledge base. There exists serious conditional distribution bias of the relationship in SGG methods. For example, knowing that the subject and object are person and head, it is easily to guess that the relationship is person has head. Visual grounding usually involves two modalities, visual and linguistic data. This task aims to locate the target object in the image according to the corresponding object description (title or description) and the given image. When locating the target object, it is necessary to understand the input description information, and integrate the information of the visual modality for localization prediction. Currently, visual grounding methods can be classified into three types: fully supervised [131] [132] [133] [134] [135] [136] [137] [138] , weakly supervised [139] , and unsupervised [140] . First, the fully supervised methods contain box annotations with object-phrase information. This method can be further divided into two-stage method [132, 133, 135, 138] and one-stage method [136] . The twostage approach is to extract candidate proposals and their features in the first stage through a Region Proposal Network (RPN) [97] or traditional algorithms (Edgebox [141] , Selective Search [142] ). Second, weakly supervised methods [143, 144] only has images and corresponding sentences, and no box annotations for object-phrases in the sentences. Due to the lack of mapping between phrases and boxes, weak supervision will additionally design many loss functions, such as designing reconstruction loss, introducing external knowledge, and designing loss functions based on image-caption matching. Third, there is no image-sentence information in the unsupervised method. Wang and Specia [145] used off-the-shelf approaches to detect objects, scenes and colours in images, and explore different approaches to measure semantic similarity between the categories of detected visual elements and words in phrases. However, due to the existence of linguistic and visual biases, most visual grounding models are heavily depended on specific datasets without good transfer ability and generalization performance. Due to the success of BERT-related models in the field of NLP, researchers have begun to focus on more challenging multi-modal reasoning task, Visual Commonsense Reasoning (VCR). The VCR task needs to combine image information with the understanding of questions, and obtain the correct answer as well as the reasoning process based on the commonsense. Given an image, the image contains a series of bounding boxes with labels. In general, VCR can be divided into two sub-tasks: Q → A task is choosing an answer based on the question; and QA → R task is reasoning based on the question and the answer, explaining why the answer was chosen. Due to the challenging nature of VCR, there are actually relatively few existing studies. Some of them resort to designing specific model architectures [146] [147] [148] [149] [150] . R2C [146] implemented this task with a three-step approach, associating text with objects involved, linking answers with corresponding questions and objects, and finally reasoning about shared representations. Inspired by brain neuron connectivity, CCN [147] dynamically modeled the visual neuron connectivity, which is contextualized by the queries and responses. HGL [148] leveraged visual answering and dual question answering heterogeneous graphs to seamlessly connect vision and language. Zhang et al. [150] proposed a multilevel counterfactual contrastive learning network for VCR by jointly modeling the hierarchical visual contents and the inter-modality relationships between the visual and linguistic domains. Recently, BERT-based pre-training methods have been extensively explored in vision and language domains. In general, most of them adopt a pretraining-then-transfer scheme and achieve significant performance improvements on the VCR [151] [152] [153] benchmarks. These models are usually pretrained on large-scale multimodal datasets (e.g. concept captioning [154] ) and then fine-tuned on VCR. At present, the promising performance of VCR is generally attributed to the pre-trained big model and the external prior knowledge. Compared with simple vision-linguistic domain tasks, the introduction of external knowledge bring new challenges: (1) How to retrieve limited supporting knowledge from external knowledge bases that contain massive data. (2) How to effectively integrate external knowledge with visual and linguistic features. (3) The reasoning process gives interpretability needs supporting facts, which depends heavily on the language structure design. The task of action detection and recognition includes two aspects, the one is to identify all action instances in the video, and the other one is to localize actions spatially and temporally. Nowadays, spatial-temporal action detection or recognition models can be divided into two categories, the first ones [6, [155] [156] [157] [158] [159] [159] [160] [161] [162] [163] is to model spatial-temporal relationships based on Convolutional Neural Networks (CNNs), and the other ones [164] [165] [166] [167] [168] is based on video transformer structures. Besides, the skeleton-based models [169, [169] [170] [171] [172] have been recently attracted great attentions. Sun et al. [156] proposed an actorcentric relational network (ACRN), which used two-stream to extract the central character feature and global background information from the input clip, and then performed feature fusion for action classification. Feichtenhofer et al. [160] proposed a two-stream model named SlowFast networks that contains a Slow pathway and a Fast pathway. Bertasius et al. [168] simply extended the ViT [173] design to video by proposing several scalable schemes for space-time self-attention. Arnab et al. [174] proposed pure-transformer architectures for video classification including several variants of the model by factorising the spatial and temporal dimensions of the input video. Although great progress has been made in spatial-temporal action detection and recognition based on the CNNs or transformer model, there exist some critical problems in terms of the robustness and the transferability of the models. The existing action detection and recognition models rely heavily on scenes and objects. When a model is well-trained in one dataset, it is hard to generalized to another dataset with different scenes. Additionally, the methods are easily focus on some static appearance or background information rather than the true motion area due to the essential correlation learning in most of the models. This may be harmful to the reliability of the model as well as the robustness of the learned spatial-temporal representations. Causal reasoning has the powerful ability of uncovering the underlying structural knowledge about human actions that builds a strong cognitive model that can fully discover causality and spatial-temporal relations. Visual question answering (VQA) is a vision-language task that has received much attention recently. The objective of VQA is: given the image/video and a related question, a machine needs to reason over visual elements and general knowledge to infer the correct answer. The attention mechanism is widely used in VQA models which aims to focus on the critical part of image and question, and find cross-modality correlations. The UpDn [175] framework is a typical conventional VQA method based on attention, which used a top-down attention LSTM [176] for the fusion of visual and linguistic features. Besides using LSTM, transformer [177] can also be adopted to the VQA task, thanks to its powerful scaled dot-product attention block. VLP (Visual-Language Pretraining) models based on BERT [178] shows a remarkable performance in VQA task. ViLBERT [151] is a BERT based visual and language pretraining framework, which uses self-attention transformer block [177] to model in-modality relation and develop a co-attention transformer block to compute cross-modality attention score, and it finally achieves state-ofthe-art on four visual-language tasks including VQA at that time. Compared with the Image QA [26, 27, 175, 179, 180] , video question answering (VideoQA) task is much more challenging due to the existence of extra temporal information. To accomplish the VideoQA problem, the model needs to capture spatial, temporal, visual and linguistic relations to reason about the answer. To explore relational reasoning in VideoQA, Xu et al. [181] proposed an attention mechanism to exploit the appearance and motion knowledge with the question as a guidance. Later on, some hierarchical attention and co-attention based methods are proposed to learn appearance-motion and question-related multi-modal interactions. Le et al. [182] proposed hierarchical conditional relation network (HCRN) to construct sophisticated structures for representation and reasoning over videos. Jiang et al. [183] introduced heterogeneous graph alignment (HGA) nework. Huang et al. [184] proposed location-aware graph convolutional network to reason over detected objects. Lei et al. [185] employed sparse sampling to build a transformer-based model named CLIPBERT and achieve end-to-end video-and-language understanding. Liu et al. [186] proposed a hierarchical visual-semantic relational reasoning (HAIR) framework to perform hierarchical relational reasoning. Although hierarchical attention mechanism successfully improve the visual-language task performance, these models remain a strong reliance on modality bias [25, 187] and tend to capture the spurious linguistic or visual correlations within the images/videos, and thus fail to learn the multi-modal knowledge with good generalization ability and interpretability. Causal reasoning offers a promising alternative to address these challenges. The discovery of causality helps to uncover the causal mechanism behind the data, allowing machine to better understand why, and to make decisions through intervention or counterfactual reasoning. In this section, we summarize some recent approaches for causal visual representation learning, as shown in Table. 1. The causal visual representation learning is an emerging research topic and appears since 2020s. The related tasks can be roughly categorized into several main aspects: (1) Causal visual understanding, (2) Causal visual robustness, (3) Causal visual question answering. In this section, we will discuss these three representative causal visual representation learning tasks. Visual understanding contains several tasks, such as object detection, scene graph generation, visual grounding, and visual commonsense reasoning, etc. However, there exist some chanllenges in these tasks: (1) For image/video object saliency detection, some non-object, occlusion, motion blur and lens movement make the model hard to concentrate the true interesting object area. To this end, causal reasoning can make the model focus on the essential interesting object area by learning robust and reliable visual representations. (2) For the scene graph generation (SGG) problem that contains superficial bias and insufficient generalization ability, causal reasoning can be introduced to mitigate these problems well. For example, an item such as towel is used to bathe in the bathroom, but is is used to wash the face in the office. Introducing causal reasoning into SGG can generalize the functionality of an item to different scenarios. (3) Due to the existence of linguistic and visual biases, most visual grounding models are heavily depended on specific datasets without good transfer ability and generalization performance. This problem can be mitigated by causal reasoning methods, which learn robust and transferable features to mitigate the visual and linguistic biases. (4) For visual commonsense reasoning (VCR), the linguistic biases may directly affects the reasoning performance. Generally, the superficial correlations captured by the existing VCR models can be mitigated by introducing causality that integrates external knowledge, visual and linguistic features into a robust and discriminative representation space. Non-causal visual understanding methods are easily affected by confounders in visual content. Illumination, position, backgrounds, co-occurrence of objects, and other visual factors are confounders that are inevitable in common settings. With traditional correlation learning, spurious correlations introduced by the confounders degrade the robustness of representation learning. For example, since the co-occurrence of "bird" and "sky" are high, the model would learn a strong correlation between them. Thus, when seeing a picture of a floating balloon that also contains "sky", it would also make a confident prediction that it its a picture of a bird. Causal reasoning provides a good solution to address the above problem. By replacing conditional distribution with intervened distribution, the spurious correlation can be eliminated and the machine can learn the real causality. Applying intervention in training procedure is a widely used implementation of causal intervention. In the visual recognition task, Wang et al. [58] combined adversarial training with causal intervention, modeled the different causal effect of mediators and confounders, and designed an adversarial training pipeline to improve the effect of mediators while suppressing the effect of confounders. Yue et al. [59] applied counterfactual inference to zeroshot and open-set visual recognition by proposing a generative causal model to generate counterfactual samples. Confounding effect also exists in visual grounding task, Huang et al. [60] proposed a deconfounded visual grounding framework by conducting interventions in linguistic features. For weakly-supervised semantic segmentation task, Zhang et al. [62] used the structural causal model to formulate the causalities between components, then constructed a confounder set and removed confounders by backdoor-adjustment. The prior bias is also a non-trivial problem in SSG (scene graph generation) task. To reduce the negative impact of training bias in scene graph generation, Tang et al. [63] built a causal graph and extracted counterfactual causality from the trained graph to infer the causal effects of training bias and then remove the negative bias. Besides, causal reasoning can replace the traditional re-weighting and re-sampling methods in resolving long-tailed distribution problems. Tang et al. [64] analysed that the momentum in SGD introduces the unbalanced sample distribution, and then proposed to use counterfactual inference in the test stage to detect and remove the causal effect of momentum item. Wang et al. [65] proposed an unsupervised commonsense learning framework to learn intervened visual features by backdoor adjustment, which can be used in the downstream task as image captioning, visual question answering, and visual commonsense reasoning. The ubiquitous spurious correlation learned by deep learning models reduces the model robustness, which is a potential vulnerability of the conventional deep learning paradigm. In this perspective, the causal learning paradigm can be introduced to avoid the presence of confounding effects and make the model more robust. Confounders are widespread in visual robustness problem, including few-shot learning, classincremental learning, domain adaptation, and generative model, etc. Yue et al. [71] uncovered that the pre-trained knowledge is a confounder in few-shot learning, and developed a few-shot learning paradigm by introducing back-door adjustment to control the pre-trained knowledge. The confounding effect can be leveraged by attackers, Tang et al. [68] proposed a instrumental variable [191] estimation based causal regularization method for adversarial defense. Hu et al. [69] explained the catastrophic forgetting effect in class-incremental learning in terms of causality: the causal effect of old data is zero, and then proposed to distilling causal effect of old data by controlling the collider effect of the causal graph. As ICM inferred, causal mechanisms could be invariant across domains, hence learning invariant causal knowledge is likely to be more superior in robustness. To learn cross-domain knowledge, Yue et al. [70] disentangled semantic attribute in image into causal factors and used Cycle-GAN [192] to generate counterfactual samples in the counterpart domain, then exploited the counterfactual sample and a latent variable encoded by VAE [193] as proxy variables of unobserved attribute for intervention. Apart from generating counterfactual samples, intervention can also be implemented by generative method. Mao et al. [74] argued that conventional randomized control trials and intervention approaches could hardly be used in naturally collected images, then introduced a framework performing interventions on realistic images by steering generative models to generate intervened distribution. For visual question answering, the real causality behind the visual-linguistic modalities and the interaction between the appearance-motion and language knowledge are neglected in most of the existing methods. In recent works, the purpose of introducing causality into visual question answering task is to reduce language bias in VQA task. Strong correlations between the question and the answer will make VQA models rely on the spurious correlations without concerning the visual knowledge. For example, since the answer to the question "What is the color of the apple ?" is "red" in most cases, the VQA model will easily learn the correlation between the word "apple" and the word "red". Thus, when given an image of green apple, the model still predicts answer "red" with strong confidence. Although simply balancing the dataset [25, 194] can partly mitigate the linguistic bias, the spurious correlation still exists in the model. From this perspective, causality-based solution is better than simply balancing the data, since the causal reasoning cuts off the superficial correlations and make VQA models focus on the real causality. Constructing confounder set has been commonly used in causal intervention practice. VC-RCNN [65] constructed object level visual confounder set for performing backdoor adjustment in visual task. Following VC-RCNN, DeVLBert [190] treated nouns in linguistic modality as confounders and constructed language confounder set using their average Bert representation vectors. Besides, DeVLBert incorporated intervention into Bert's [178] pretraining process and combined mask modeling objective with causal intervention. As another implementation of intervention, Yang et al. [79] designed In-Sample attention and Cross-Sample attention module to conduct front-door adjustment, where In-Sample attention module approximates probability P (W = w|x) and Cross-Sample attention module approximates probability P (x). Utilizing these attention modules, a cross modality causal attention network was proposed for VQA task by combining causal attention with previous LXMERT [195] framework. Counterfactual based solutions are also worth noting. Agarwal et al. [188] proposed a counterfactual sample synthesising method based on GAN [189] . Overcoming the complexity of GAN based synthesising method, Chen et al. [81] tried to replace critical objects and critical words with mask token and reassigned a answer to synthesis counterfactual QA pairs. Apart from sample synthesising methods, Niu et al. [78] developed a counterfactual VQA framework that reduce multi modality bias by using the total indirect effect (TIE) [36] for final inference. By blocking the direct effect of one modality, the TIE measures the total causal effect of question and the visual information, thus reduces language bias in VQA. Although the above-mentioned causal visual representation learning methods successfully apply causal reasoning methods to uncover casual mechanisms and achieve promising results, casual reasoning for visual representation learning is still at its infancy stage with many challenges. Firstly, the existing causal visual representation tasks are limited in several computer vision tasks without being applied to more diverse and challenging tasks such as video understanding, human-computer interaction, and urban computing, etc. Secondly, causal reasoning has burgeoned for many visual learning tasks, so far, the existing evaluation datasets are still traditional datasets for correlation learning without proper large-scale benchmarking dataset and pipelines specified for causal reasoning. Thirdly, most of the existing methods focus on causality discovery on either visual or linguistic modality without considering both of them. Therefore, more in-depth analysis of the relations of causal reasoning and visual representation learning are required. Correlation-based models may perform well in existing datasets, which is not for the reason that these models have strong reasoning capability but those datasets cannot fully support the evaluation of the models' reasoning capability. Spurious correlations in these datasets can be exploited by the model to cheat, which means that the model just only concentrates on superficial correlation learning not real causal reasoning, only approximating the distribution of the dataset. For example, in VQA v1.0 [196] dataset for VQA task, the model simply answers "yes" when seeing question "Do you see a...", which will achieve nearly 90% accuracy. Due to this shortcoming in current datasets, researchers need to build benchmarks that can evaluate the true causal reasoning capability of models. In this section, we take image question answering benchmarks and video question answering benchmarks as examples to analyse the current research situation of related causal reasoning datasets, and give some future directions. VQA v1 [196] COCO [197] 614K/-/-Yes No VQA v2 [25] COCO [197] 443K/214K/453K No Yes VQA-CP v1 [187] COCO [197] 245K/-/125K Yes No VQA-CP v2 [187] COCO [197] 438K/-/220K No Yes IV-VQA [188] COCO [197] 257K/11.6K/108K No Yes CV-VQA [188] COCO [197] 8.5K/0.4K/3.7K No Yes AVQA [198] Various 142.1K/8.7K/26.4K Yes Yes Image question answering benchmarks evaluate models' capability of answering natural language questions based on a corresponded image. Recent image question answering benchmarks try to collect or generate balanced QA pairs to make the dataset distribution more balanced in question distribution. VQA v2.0 [25] collects complementary QA pairs by replacing the image and the answer in a QA pairs. VQA-CP [187] resplited the VQA v1 dataset and VQA v2 dataset to construct two new dataset VQA-CP v1 and VQA-CP v2. As Fig. 5 shows, Agarwal et al. [188] constructed IV-VQA and CV-VQA datasets using semantic editing to generate image and reexamine the image by human. Li et al. [198] proposed a human-machine adversarial to collect robust QA Fig. 6 The example QA pairs in AVQA [198] . pairs. Fig. 6 illustrates the adversarial data collecting procedure. In Table 2 , we summarize these datasets in terms of image source, split numbers, collected or not, and rebalanced or not. Current image question answering use various approaches to overcome the bias introduced by unbalanced data. But there is still a lack of large-scale benchmark datasets that support fair and transparent evaluations of the causality behind the data and the reasoning ability of the method. Introducing the causal concept and causal methods like confounders and causal interventions when building benchmark datasets may help resolving the problem of the lack of specific causal reasoning benchmark datasets. The video question answering task is more complex than the image question answering task due to the ubiquitous correlation between spatial and temporal information, i.e., the introduction of complex temporal relations. Thus, improving the spatial-temporal causal reasoning ability of models can improve the performance on this task, but simply approximating data distributions usually do not work. Thus, some recent released benchmark datasets are proposed to evaluate whether the model has the reasoning ability to understand the causal relation knowledge within the visual and linguistic content, as shown in Table 3 . NExT-QA [204] Causal and Temporal Interactions 52K MC&OG Human Fig. 7 A sample in CLEVRER, including four question types: descriptive, explanatory, predictive and counterfactual [199] . CLEVRER [199] contains synthesised videos and automatic generated questions describing the collision of geometric objects. A typical video and question types from CLEVRER is shown in Fig. 7 . It is a balanced and synthetic dataset that contains diagnostic annotations and counterfactuals. VQuAD [200] is also a diagnostic synthesised dataset. It is constructed from a balanced dataset by separating object into attributes like texture and color, and balanced the data distribution based on these attributes. A brief overview of VQuAD objects is shown in Fig. 8 . The VQuAD is a diagnostic dataset that can be used to evaluate the extent of reasoning abilities of various video QA methods. ComPhy [201] is a video QA dataset that focuses on understanding objectcentric and relational physics properties hidden from visual appearances. As shown in Fig. 9 , the ComPhy dataset studies objects' intrinsic physical properties from objects' interactions and how these properties affect their motions in future and counterfactual scenes to answer corresponding questions. AGQA [202] includes tremendous QA pairs which is automatically generated by process. An overview of AGQA is shown in Fig. 10 . The QA pairs in AGQA is generated by parsing videos to scene graphs and using the language composition inference by scene graph to generate QA pairs. SUTD-TrafficQA [203] is a traffic video question answering dataset with six challenging reasoning tasks including basic understanding, event forecasting, reverse reasoning, counterfactual inference, introspection and attribution analysis, to analyse the models' reasoning ability. Fig. 11 shows an example of counterfactual traffic video question answering from SUTD-TrafficQA. To be noticed, the counterfactual traffic video question answering task in Fig. 11 requires the outcome of certain hypothesis that does not occur in the video. To accurately reason about the imagined events under the designated condition, the model is required to not Fig. 8 Illustration of an instance of VQuAD dataset [200] , which shows various questions that are generated concerning the video created and the difference in complexity in terms of hops for the questions. Fig. 9 Sample reference videos, target video, and question-answer pairs from ComPhy dataset [201] . only conduct relational reasoning in a hierarchical way but also fully explore the causal, logic, and spatial-temporal structures of the visual and linguistic content. NExT-QA [204] is a video question answering benchmark targeting at the explanation of video contents, which requires deeper understanding of videos and reasoning about causal and temporal actions from rich object interactions Fig. 11 An example of counterfactual question-answer pair in SUTD-TrafficQA dataset [203] . in daily activities. As shown in Fig. 12 , NExT-QA dataset contains rich object interactions and requires causal and temporal action reasoning in realistic videos. The NExT-QA dataset challenges QA models to reason about causal and temporal actions and understand rich object interactions in daily activities. Causal reasoning with visual representation learning has a variety of applications. Modeling causal reasoning for a variety of tasks can achieve better perception of the real world. In this section, we introduce the applications from five aspects: image/video analysis, explainable artificial intelligence, recommendation system, human-computer dialog and interaction, and crowd intelligence analysis. We also discuss how causal reasoning benefits various real-world applications, as shown in Fig. 13 . In image/video analysis, most of the existing work relies on learning data correlations rather than causal structures, and the superficial correlation within the image and video data makes the model vulnerable to visual changes in the dataset. Therefore, causality-ware feature learning strategy is required to make the model learn essential causal structures behind data and robust to different data distributions. One of the main methods of dealing with superficial data correlations is using causal intervention. Assumed that the commonsense knowledge is existed in visual features, but the commonsense might be confused by false observation bias. For example, the words "cup", "table" and "stool" have high cooccurrence frequencies because they commonly appear in daily life, but the commonsense knowledge usually wrongly predict the class as table due to the observation bias. To reduce the observation bias, the causality-ware visual commonsense model is required, which regards the object category as a confounding factor and directly maximizes the likelihood after intervention to learn the visual feature representation. By eliminating observation bias, the learned visual features are robust to image and video analysis tasks. With the development of deep learning across industries and disciplines, the applications of deep learning models in real-world scenes require a high degree of robustness, interpretability and transparency. Unfortunately, the black-box properties of deep neural networks is still not fully explainable, and many machine decisions are still poorly understood [205] . In recent years, causal interpretability has received more and more attention. These works [206] [207] [208] [209] [210] [211] [212] have made progress in explainable artificial intelligence based on causal interpretability. For example, in the current COVID-19 pandemic, where causal mediation analysis helps disentangle different effects contributing toward case fatality rates when an example of Simpson's paradox was observed [213] . Learning the best treatment rules for each patient is one of the promising goals of applying explainable treatment effect estimation methods in the medical field. Since the effects of different available drugs can be estimated and explainable, doctors can prescribe better drugs accordingly. At present, some causal reasoning works [214] [215] [216] [217] [218] has been applied to the recommendation system. The recommendation system is actually a problem of causal reasoning [214] . User embedding represents what type of person the user is, and infers the user's preferences based on user's attributes. The causal effect of a recommendation system is whether the user is satisfied with the recommendation. Superficial bias exists because the recommendation system is trained on biased samples (both users and items). One example is personalized recommendation, where we wish to build a model of a customer shopping interest through various data sources, such as web-browser records and shopping histories. However, if we train a recommendation system on customers' records in controlled settings, the system may provide little additional insight compared to the customers' mental states and emotions, thus may fail when deployed. While it may be useful to automate certain decisions, and understanding of causality may be necessary to recommend commodities that are personalized and reliable. A general approach to removing survival bias is to construct counterfactual mirror users, construct similarity measures using unbiased information, and construct matches from low-active to highactive users. In this way, we can alleviate the user's dissatisfaction with the previously recommended content and the low user activity. For human-computer dialog and interaction, some emerging tasks contain the interaction between vision and language. Additionally, there exist multi-modal spatial-temporal information and and complex relations captured by various devices. Most of the existing work relies on data correlation rather than causal relevant evidence, and the false correlation in the data makes the model vulnerable to language biases in the problems. Take a VQA task as an example, where we aim to remove visual objects that are unrelated to answering the question and the prediction of the model is not expected to change. This can prevent the model from relying on superficial data correlations. When changing objects that are related to a question, the model is expected to change the answer accordingly. Adjusting question-related objects encourages the model to predict based on causality-aware objects. For a better user experience, the human-computer dialog and interaction system is required to understand people's purposes and make reliable decisions. Causal reasoning is beneficial to the pursuit of reliable humancomputer interaction by uncovering, modeling the heterogeneous spatial-temporal information in a reliable and explainable way. Especially for robot interaction [88, [219] [220] [221] , where the relevant environmental features are not known in advance, prior knowledge can be utilized as good candidate causal structures. The strong relation between causal reasoning and its ability to intervene in the world suggests that causal reasoning can can greatly address this challenge for robotics, which benefits the application of robotics significantly. The above-mentioned applications usually focus on single subject, while crowd intelligence analysis [222] aims to address related sensing and cognitive tasks for multiple subjects and their interaction. In recent years, we have been witnessing the explosive growth of multi-modal heterogeneous spatial/temporal/spatial-temporal data from different kinds of data sensors. Urban computing [223] is an example of crowd intelligence analysis, which aims to tackle traffic congestion, energy consumption and pollution by using the data that has been generated by a large amount of traffic vehicles in cities (e.g., traffic flow, human mobility, and geographical data). For example, huge amounts of heterogeneous traffic data comes from various sources, including both static and dynamic data, such as traffic road network, geographic information system (GIS) data, traffic flow, traffic mobility, and traffic energy consumption, etc. The heterogeneous spatial-temporal traffic data contains a large amount of useful traffic rules with strong causal relations. Therefore, how to utilize different heterogeneous spatial-temporal data and discover their complex and entangled causal relations is beneficial to urban computing and crowd intelligence analysis. Some researches have successfully implemented causal reasoning for visual representation learning to discover causality and visual relations. However, casual reasoning for visual representation learning is still at its infancy stage and many issues remain unsolved. Therefore, this section highlights several possible research directions and open problems to inspire further extensive and indepth research on this topic. Potential research directions for causal visual representation learning can be summarized as: 1) More reasonable causal relation modeling; 2) More precise approximation of intervention distributions; 3) More proper counterfactual synthesising process; 4) Large-scale benchmarks and evaluation pipeline. Reasonable causality modeling is the basis for causal inference. Real-world data like visual information is usually unstructured and the effect of causal relation may be unobserved, for example, momentum is likely to be detrimental under long-tailed distribution data [64] and there is no consensus on how to properly model causality on many tasks because the real causality may be more complicated than expected. For VQA task, Yang et al. [79] treated visual and language features as one vertex in the causal graph, and Niu et al. [78] consider the visual and linguistic features separately. However, these methods focus on causality discovery on either visual or linguistic modality without considering both of them. Therefore, future work should consider: 1) In-depth analysis of the relations between causal reasoning and visual representation learning; 2) Model comprehensive and reasonable causal relation. A precise estimation of the intervention distribution helps the implementation of a certain causal model. Most of current intervention distribution approximation methods focus on identify all confounders for a certain task while these confounders are usually defined as the average of object features in visual tasks [58, 65, 190] . Actually, the average features may not properly describes a certain condounder especially for complex heterogeneous visual data. Thus, how to approximate the confounders more accurately is a key future work that needs to be further considered for causal intervention methods. Counterfactual inference based methods usually focus on refining the training procedure, i.e., embedding the counterfactual inference process into the training procedure. Counterfactual synthesising methods [59, 74, 81, 188] have proved their effectiveness in many tasks. Embedding counterfactual inference into models can effectively eliminate the data bias within data. A novel counterfactual framework [78] gives us insight on this potential. However, visual data is often entangled and heterogeneous, which make the data bias is hard to understand and modeling. Therefore, how to model a proper counterfactual synthesising process is a potential direction of data debiasing in visual representation. Although causal reasoning has burgeoned for many visual learning tasks, most of the existing evaluation datasets are still traditional datasets for correlation learning without proper largescale benchmarking dataset and pipelines to support fair and transparent evaluations of emerging research contributions. The only existing causal datasets discussed in aforementioned sections have limited scale and lack comprehensive evaluation standards for causal reasoning. Therefore, more large-scale benchmark datasets and pipelines for specific visual representation learning tasks should be considered in the future research. Generally, causal svisual representation learning is still a emerging and challenging research topic. Causal modeling, intervened distribution approximation, counterfactual inference, largescale benchmarks and evaluation pipelines have great potentials for further exploration. This paper has provided a comprehensive survey on causal reasoning for visual representation learning. In this paper, we focus on the prospective survey of related works, datasets, insights, future challenges and opportunities for causal reasoning, visual representation learning, and their integration. We mathematically presents the basic concepts of causality, the structural causal model (SCM), the independent causal mechanism (ICM) principle, causal inference, and causal intervention. Then, based on the analysis, we further gives some directions for conducting causal reasoning on visual representation learning tasks. We also review some recent popular visual learning tasks including visual understanding, action detection and recognition, and visual question answering, including the discussions about the existing challenges of these visual learning methods. In addition, the related causality-based visual representation learning works and datasets are also discussed systematically. Finally, extensive applications and some potential future research directions are provided for further exploration. We hope that this survey can help attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods, publicly available benchmarks, and consensus-building standards for reliable visual representation learning and related real-world applications more efficiently. Deep residual learning for image recognition Knowledge-guided multi-label fewshot learning for general image recognition Cx-tom: Counterfactual explanations with theory-of-mind for enhancing human trust in image recognition models Temporal segment networks: Towards good practices for deep action recognition Temporal relational reasoning in videos Tsm: Temporal shift module for efficient video understanding Tcgl: Temporal contrastive graph for selfsupervised video representation learning Deep textspotter: An end-to-end trainable scene text localization and recognition framework Text recognition in the wild: A survey Sign language recognition: A deep survey Listen to look: Action recognition by previewing audio Look, listen, and attend: Coattention network for self-supervised audiovisual representation learning Distilling audio-visual knowledge by compositional contrastive learning Audio-visual contrastive learning for self-supervised action recognition Hierarchically learned view-invariant representations for cross-view action recognition Transferable feature representation for visible-to-infrared cross-dataset human action recognition Deep image-to-video adaptation and fusion networks for action recognition Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition Instancelevel salient object segmentation Look into person: Joint body parsing & pose estimation network and a new benchmark Relationshipembedded representation learning for grounding referring expressions Actionnet: Multipath excitation for action recognition Self-trained deep ordinal regression for end-to-end video anomaly detection Global temporal representation based cnns for infrared action recognition Making the v in vqa matter: Elevating the role of image understanding in visual question answering Knowledge-routed visual question reasoning: Challenges for deep representation embedding Linguistically routing capsule network for out-of-distribution visual question answering Combining multiple features for crossdomain face sketch recognition Generative multi-view human action recognition Cross-modal knowledge distillation for vision-to-sensor action recognition Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness Not using the car to see the sidewalk-quantifying and controlling the effects of context in classification and segmentation Benchmarking neural network robustness to common corruptions and perturbations Why do deep convolutional networks generalize so poorly to small image transformations? Elements of causal inference: foundations and learning algorithms Toward causal representation learning Evaluation methods and measures for causal learning algorithms Visual interpretability for deep learning: a survey Interpretable convolutional neural networks Interpreting cnns via decision trees Extraction of an explanatory graph to interpret a cnn Mining interpretable aog representations from convolutional networks via active question answering Interpretable cnns for object classification Causality-based feature selection: Methods and evaluations A unified view of causal and non-causal feature selection Causal feature selection with missing data Error-aware markov blanket learning for causal feature selection Confounder identification-free causal visual feature learning Multilabel feature selection: a local causal Springer Nature 2021 L A T E X template structure learning approach Causalvae: Disentangled representation learning via neural structural causal models Towards efficient local causal structure learning Causality-based online streaming feature selection Using feature selection for local causal structure learning Multi-source causal feature selection Accurate markov boundary discovery for causal feature selection Learning markov blankets from multiple interventional data sets Causal attention for unbiased visual recognition Counterfactual zero-shot and open-set visual recognition Deconfounded visual grounding Acre: Abstract causal reasoning beyond covariation Causal intervention for weakly-supervised semantic segmentation Unbiased scene graph generation from biased training Longtailed classification by keeping the good and removing the bad momentum causal effect Visual commonsense r-cnn Counterfactual critic multi-agent training for scene graph generation Explainable and explicit visual reasoning over scene graphs Adversarial visual robustness by causal intervention Distilling causal effect of data in class-incremental learning Transporting causal mechanisms for unsupervised domain adaptation Interventional few-shot learning Learning causal representations for robust domain adaptation A causal framework for distribution generalization Generative interventions for causal learning Exploiting causal structure for robust model selection in unsupervised domain adaptation Temporal interaction and causal influence in communitybased question answering Introspective distillation for robust question answering Counterfactual vqa: A cause-effect look at language bias Causal attention for vision-language tasks Two causal principles for improving visual dialog Counterfactual samples synthesizing for robust visual question answering Learning causal temporal relation and feature discrimination for anomaly detection Temporal-spatial causal interpretations for vision-based reinforcement learning Learning causal representation for training cross-domain pose estimator via generative interventions Unsupervised motion representation learning with capsule autoencoders Inferring hidden statuses and actions in video by causal reasoning Superpixel-based causal multisensor video fusion Robot learning with a spatial, temporal, and causal and-or graph Temporal contrastive graph learning for video action recognition and retrieval Deconfounded image captioning: A causal retrospect Towards out-ofdistribution generalization: A survey Bias and debias in recommender system: A survey and future directions Representation learning via invariant causal mechanisms Disentangled generative causal representation learning Rich feature hierarchies for accurate object detection and semantic segmentation Spatial pyramid pooling in deep convolutional networks for visual recognition Fast r-cnn Faster r-cnn: Towards real-time object detection with region proposal networks Feature pyramid networks for object detection Rfcn: Object detection via region-based fully convolutional networks Mask r-cnn Scalable object detection using deep neural networks Attentionnet: Aggregating weak directions for accurate object detection G-cnn: an iterative grid based object detector You only look once: Unified, real-time object detection Ssd: Single shot multibox detector Yolo9000: better, faster, stronger Dsod: Learning deeply supervised object detectors from scratch Dssd: Deconvolutional single shot detector Flow guided recurrent neural encoder for video salient object detection Motion guided attention for video salient object detection Semi-supervised video salient object detection using pseudolabels 3d scene graph: A structure for unified semantics, 3d space, and camera Image retrieval using scene graphs Storytelling from an image stream using scene graphs Scene-centric joint parsing of crossview videos Detecting visual relationships with deep relational networks Visual translation embedding network for visual relation detection Union visual translation embedding for visual relationship detection and scene graph generation Panet: A context based predicate association network for scene graph generation Learning to compose dynamic tree structures for visual contexts Vip-cnn: Visual phrase guided convolutional neural network Vrr-vg: Refocusing visually-relevant relationships Factorizable net: an efficient subgraph-based framework for scene graph generation Attentive relational networks for mapping images to scene graphs Visual relationship detection with language priors Knowledge-embedded routing network for scene graph generation Scene graph generation with external knowledge and image reconstruction Neural motifs: Scene graph parsing with global context Phrase localization and visual relationship detection with comprehensive image-language cues Crossmodal relationship inference for grounding referring expressions Scene-intuitive agent for remote embodied visual grounding Refer-it-in-rgbd: A bottom-up approach for 3d visual grounding in rgbd images Iterative shrinking for referring expression grounding using deep reinforcement learning Mdetrmodulated detection for end-to-end multimodal understanding Transvg: End-to-end visual grounding with transformers Treestructured policy based progressive reinforcement learning for temporally language grounding in video Ref-nms: breaking proposal bottlenecks in two-stage referring expression grounding Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos Unsupervised textual grounding: Linking words to image concepts Edge boxes: Locating object proposals from edges Selective search for object recognition Relation-aware instance refinement for weakly supervised visual grounding Improving weakly supervised visual grounding by contrastive knowledge distillation Phrase localization without paired training examples From recognition to cognition: Visual commonsense reasoning Connective cognition network for directional visual commonsense reasoning Heterogeneous graph learning for visual commonsense reasoning Tab-vcr: tags and attributes based vcr baselines Multi-level counterfactual contrast for visual commonsense reasoning Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks Uniter: Universal image-text representation learning Vl-bert: Pre-training of generic visual-linguistic representations Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning Quo vadis, action recognition? a new model and the kinetics dataset Actor-centric relation network Long-term feature banks for detailed video understanding Temporal pyramid network for action recognition X3d: Expanding architectures for efficient video recognition Slowfast networks for video recognition Evidential deep learning for open set action recognition Temporal contrastive graph for self-supervised video representation learning Spatiotemporal representation factorization for video-based person re-identification Relaxed transformer decoders for direct action proposal generation Is space-time attention all you need for video understanding? Oadtr: Online action detection with transformers Temporal query networks for fine-grained video understanding Is space-time attention all you need for video understanding? Spatial temporal graph convolutional networks for skeleton-based action recognition An attention enhanced graph convolutional lstm network for skeleton-based action recognition Two-stream adaptive graph convolutional networks for skeleton-based action recognition End-toend human pose and mesh reconstruction with transformers An image is worth 16x16 words: Transformers for image recognition at scale Vivit: A video vision transformer Bottom-up and top-down attention for image captioning and visual question answering Long short-term memory Attention is all you need Bert: Pre-training of deep bidirectional transformers for language understanding Vqa: Visual question answering Stacked attention networks for image question answering Video question answering via gradually refined attention over appearance and motion Hierarchical conditional relation networks for video question answering Reasoning with heterogeneous graph alignment for video question answering Location-aware graph convolutional networks for video question answering Less is more: Clipbert for video-and-language learning via sparse sampling Hair: Hierarchical visual-semantic relational reasoning for video question answering Don't just assume; look and answer: Overcoming priors for visual question answering Towards causal vqa: Revealing and reducing spurious correlations by invariant and covariant semantic editing Generative adversarial nets Devlbert: Out-of-distribution visiolinguistic pretraining with causality Instrumental variables Unpaired image-to-image translation using cycle-consistent adversarial networks Autoencoding variational bayes Yin and yang: Balancing and answering binary visual questions Lxmert: Learning cross-modality encoder representations from transformers Vqa: Visual question answering Microsoft coco captions: Data collection and evaluation server Adversarial vqa: A new benchmark for evaluating the robustness of vqa models Clevrer: Collision events for video representation and reasoning Vquad: Video question answering diagnostic dataset Comphy: Compositional physical reasoning of objects and events from videos Agqa: A benchmark for compositional spatio-temporal reasoning Sutdtrafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events Next-qa: Next phase of question-answering to explaining temporal actions A survey on explainable artificial intelligence (xai): Toward medical xai Explaining visual models by causal attribution Explaining deep learning models using causal inference Causal learning and explanation of deep neural networks via autoencoded activations Neural network attributions: A causal perspective Causal interpretability for machine learning-problems, methods and evaluation Generative causal explanations of black-box classifiers Generative causal explanations for graph neural networks Simpson's paradox in covid-19 case fatality rates: a mediation analysis of age-related causal effects Disentangling user interest and conformity for recommendation with causal embedding Mitigating confounding bias in recommendation via information bottleneck Model-agnostic counterfactual reasoning for eliminating popularity bias in recommender system Clicks can be cheating: Counterfactual recommendation for mitigating clickbait issue Causal intervention for leveraging popularity bias in recommendation From robot learning to robot understanding: Leveraging causal graphical models for robotics Causal reasoning in simulation for structure and transfer learning of robot manipulation policies Counterfactual explanation and causal inference in service of robustness in robot control Mobile crowd sensing: incentive mechanism design Urban computing: concepts, methodologies, and applications He has authorized and co-authorized more than 10 papers in top-tier academic journals and conferences. He has been serving as a reviewer for numerous academic journals and conferences such as IEEE TIP He is currently pursuing the master's degree with the School of Computer Science and Engineering, Sun Yat-sen University. His current research interests include video understanding, computer vision and machine learning He is currently pursuing the master's degree with the School of Computer Science and Engineering, Sun Yat-sen University. His current research interests include video understanding, computer vision and machine learning He is a recipient of ICCV 2019 Best Paper Nomination Award. He has authorized and co-authorized on more than 70 papers in top-tier academic journals and conferences. He serves as an area chair for the conference of VISAPP. He has been serving as a reviewer for numerous academic journals and conferences such as TPAMI leading the R&D teams for cutting-edge technology transferring. He has authored or co-authored more than 200 papers in leading academic journals and conferences, and his papers have been cited by more than 19,000 times This work was supported by National Natural Science Foundation of China (No. 62002395), and the National Natural Science Foundation of Guangdong Province (No. 2021A15150123).