key: cord-0172170-nrp1amkj
authors: Zhu, Xinyuan; Zhang, Yang; Feng, Fuli; Yang, Xun; Wang, Dingxian; He, Xiangnan
title: Mitigating Hidden Confounding Effects for Causal Recommendation
date: 2022-05-16
journal: nan
DOI: nan
sha: b20fef69f1eac8cd702984dfa71fa09ce3def98e
doc_id: 172170
cord_uid: nrp1amkj

Recommender systems suffer from confounding biases when there exist confounders affecting both item features and user feedback (e.g., like or not). Existing causal recommendation methods typically assume confounders are fully observed and measured, forgoing the possible existence of hidden confounders in real applications. For instance, product quality is a confounder since affecting both item prices and user ratings, but is hidden for the third-party e-commerce platform due to the difficulty of large-scale quality inspection; ignoring it could result in the bias effect of over-recommending high-price items. This work analyzes and addresses the problem from a causal perspective. The key lies in modeling the causal effect of item features on a user's feedback. To mitigate hidden confounding effects, it is compulsory but challenging to estimate the causal effect without measuring the confounder. Towards this goal, we propose a Hidden Confounder Removal (HCR) framework that leverages front-door adjustment to decompose the causal effect into two partial effects, according to the mediators between item features and user feedback. The partial effects are independent from the hidden confounder and identifiable. During training, HCR performs multi-task learning to infer the partial effects from historical interactions. We instantiate HCR for two scenarios and conduct experiments on three real-world datasets. Empirical results show that the HCR framework provides more accurate recommendations, especially for less-active users. We will release the code once accepted.

Data-driven models have become the default choice for building personalized recommendation services [1, 2] . These models typically focus on the correlation between item attributes and user feedback, suffering from the confounding bias [3, 4] . The source of such bias is the confounder that affects item attributes and user feedback simultaneously, leading to spurious correlations [5, 6] . For instance, high quality is the reason for high item price, and leads to more positive ratings from users, resulting in the spurious correlation between high price and high rating. Blindly fitting the data will cause the over-recommendation of high-price items. Worse still, the confounding effect will hurt the fairness across item producers and make the model vulnerable to be attacked, e.g., some producers may intentionally increase the price for more exposure opportunities. It is thus essential to mitigate the confounding effect in recommendation.

Causal recommendation has been studied to eliminate the confounding effect, which takes the causal effect of item attributes on the user feedback as the recommendation criterion. Existing solutions are mainly two categories:

• Inverse Propensity Weighting (IPW), which adjusts the data distribution to be unbiased by re-weighting training interactions [7] [8] [9] [10] [11] [12] . Despite theoretically sound, IPW is based on strict assumption of no hidden confounder. Thus these methods may fail when hidden confounder exists. • Backdoor Adjustment, which performs causal intervention by adjusting the prediction probability on different values of the confounder according to its distribution [5, 13, 14] . Unfortunately, these methods cannot handle hidden confounder, since they require the confounder's distribution which is however unknown.

It is indispensable to mitigate hidden confounding effects in recommendation since many confounders are hard to measure due to technical difficulties, privacy restrictions, etc. [15] . For example, product quality is a such confounder in product recommendation. Most e-commerce platforms cannot monitor the productive process of items and also cannot afford the overhead to launch large-scale inspection. News events are hidden confounders in video recommendation 1 . For example, COVID-19 brings videos with face masks and attracts more user attention on epidemic relevant videos. The spurious correlation would result in the bias of over-recommending the videos with face masks. In food recommendation, hidden confounders can cause severe effects: some restaurants may use banned food additives (e.g., poppy capsule) to please users for high ratings, but will not disclose it due to illegality; such spurious correlation may mislead the model to recommend unhealthy food. To mitigate such biases, it is critical to consider the hidden confounder in recommendation modeling.

Noticing the distinct properties of hidden confounder in different scenarios, we pursue a general solution for 1 . Current video recommendation systems typically neglect news events [16] , which are costly to be considered. handling the hidden confounder in recommender systems. To understand its impact, we abstract the generation process of the like feedback 2 as a general causal graph in Figure 1 (a). The hidden confounder (V ) affects item features (I) and the happening of like (L) through V → I and V → L, respectively. Item features I affect L through some mediators M such as the interaction of user-item features and mediation feedback (e.g., click). The key to mitigating the hidden confounding effect lies in blocking the backdoor path (I ← V → L) to estimate the causal effect of I on L, i.e., P (L|U, do(I)) [5, 13] . This is non-trivial since V is unobserved, restricting us from adopting any operation requiring the value (or distribution) of V , including randomized control trails.

In this work, we propose a general Hidden Confounder Removal (HCR) framework to estimate the causal effect P (L|U, do(I)) by performing front-door adjustment [6] . The core idea is to decompose the causal effect into two partial effects through the mediator M : 1) the effect of M on L, i.e., P (L|U, do(M )); and 2) the effect of I on M , i.e., P (M |U, do(I))). According to causal theory, both partial effects are identifiable and can be derived from plain conditional probabilities P (M |U, I) and P (L|U, I, M ). In this light, we design HCR as a multi-task learning framework that simultaneously learns the two distributions from historical interactions. After training, we infer the partial effects and chain them up to obtain P (L|U, do(I)), which is used for making recommendation. We select two recommendation scenarios, e-commerce products and micro-videos, and instantiate HCR over MMGCN [17] , a representative multimodal recommendation model. We conduct extensive experiments on three real-world datasets, validating the effectiveness of HCR, especially for less-active users.

The main contributions of this work are summarized as follows:

• We study a new problem of hidden confounder in recommender systems and analyze it from a causal perspective. • We propose a new causal recommendation framework, Hidden Confounder Removal, which mitigates the hidden confounding effect with front-door adjustment. • We evaluate HCR in two practical scenarios and conduct extensive experiments on three real-world datasets, verifying the effectiveness of our proposal.

2. Like conceptually denotes post-click feedback such as favorite, purchase, etc.

We first give a brief introduction of notations used in this paper. We use upper characters (e.g., I), lowercase characters (e.g., i), and calligraphic font (e.g., I) to denote random variables, values of a random variable, and the sample space of variables, respectively. Taking I as an example, we denote the probability distribution of a variable as P (I) where the probability of observing I = i from the distribution is denoted as P (i) or P (I = i).

From the probabilistic perspective, the target of recommendation is to estimate P (L = 1|u, i), which denotes the like probability between a user-item pair (u, i) [18] . Conventional data-driven methods parameterize the target distribution as a recommender model f Θ (u, i) where Θ denotes model parameters. These methods learn model parameters from a set of historical interactions D = {(u, i, l u,i ) |u ∈ U, i ∈ I}. l u,i ∈ {0, 1} indicates the happening of like between the user u and item i, U and I denote the user set and item set, respectively. After training, the model infers the interaction probability for each user-item pair and constructs personalized ranking accordingly.

Causal Recommendation: To mitigate confounding biases, causal recommendation casts the recommendation problem as estimating P (L = 1|u, do(i)), which indicates the causal effect of item features I on L [5] . Note that P (L = 1|u, do(i)) is a probability from the distribution P (L|U, do(I)). In the rest of this paper, we interchangeably use P (L = 1|u, do(i)) and P (l|u, do(i)). The existing work on causal recommendation estimates the causal effect P (l|u, do(i)) under a setting of observing all confounders between I and L, i.e., ignoring all hidden confounders. Noticing that hidden confounders are common in practice, we formulate the task as estimating the causal effect when hidden confounder exists.

In this section, we first introduce the causal graph describing the recommendation process with hidden confounders and analyze their impact. We then present the HCR framework that aims to mitigate the hidden confounding effects, followed by an instantiation of the HCR framework.

By definition, causal graph [6] is a directed acyclic graph, in which a node denotes a random variable and an edge denotes a causal relation between two nodes. A causal graph describes the abstract process of data generation and can guide the modeling of causal effects [5, 14] . Figure 1 (a) shows the generation process of the like feedback with hidden confounders. We explain the semantics of nodes and edges in the graph as follows:

• Nodes U and I denote the user and item, specifically, the corresponding user and item features. • Node L denotes the label of the like feedback. The like feedback conceptually denotes post-click user behaviors such as favorite, purchase, etc. • Node V denotes hidden confounders which affect both item features and the happening of like. • Node M denotes a set of variables that act as mediators between {U, I} and L. For example, click feedback is such a mediator, which is affected by the user and item features and is a prior behavior of the post-click feedback L, i.e., the happening of like depends on the happening of click. • Edges I ← V → L denote that V affects both item features and the happening of like 3 . • Edges {U, I} → M → L denote that U and I usually affect like through a set of mediators, e.g., the matching of user and item features. In other words, U and I do not result in like solely. Moreover, users often demonstrate multiple cascading feedback [19] , e.g., click → add-tocart → purchase (like) in e-commerce scenarios and click → finish → thumbs-up (like) on micro-video platforms. Therefore, prior feedback is also a mediator between item features {U, I} and like feedback L.

Confounding effect. Note that the hidden confounder V opens the backdoor path I ← V → L, bringing spurious correlations between the item feature I and like L. • As to conventional recommender models that are trained on the historical interactions, i.e., observational data, they would inherit these spurious correlations, resulting in biased estimation of user preferences. • While there exist causal models to estimate the causal effect P (l|u, do(i)), they can only consider observed confounders, i.e., neglecting the hidden confounder V . Consequently, the backdoor path through V still brings the confounding effect to their estimation of P (l|u, do(i)). These methods thus also face bias issues.

We now consider how to mitigate the hidden confounding effect through the backdoor path I ← V → L without measuring the confounders V .

The progress of causal inference provides us a tool to handle our case with mediator between I and L. The key is front-door adjustment, which constructs the causal effect P (l|u, do(i)) from the underlying effects w.r.t. the mediator.

The is because any change on item features I can only affect the like feedback L when it has changed the value of the mediator, e.g., the matching between user and item features. Note that P (l|u, do(i)) means controlling the input item features I = i with the do-calculus [6] , which is shown in Figure 1 (b). Accordingly, we can draw the joint distribution of L, V , and M as,

Eq. (1) holds due to the conditional independence of variables given their parent nodes. By summing the probabilities in Eq. (1) over v and m, we obtain the target causal effect:

3. As an initial attempt of studying the hidden confounder issue in recommender systems, we omit the confounders between U and L and the edge V → U , which are left for future exploration. Besides, the spurious correlation between U and L unlikely changes the ranking of items for a user.

Eq. (2) (b) holds due to the back-door adjustment [6] . In Eq.

(2), P (l|u, do(m)) denotes the causal effect of M on L and P (m|u, do(i)) denotes the causal effect of I on M . In particular, P (l|u, do(m)) is the probability of like happening when forcibly setting the value of mediator as m. P (m|u, do(i)) represents how likely the mediator will be set as m when we choose the item feature i.

According to the causal graph in Figure 1 , we can find that both P (l|u, do(m)) and P (m|u, do(i)) are identifiable. • As to P (l|u, do(m)), we can block the backdoor path M ← I ← V → L without measuring V . This is because controlling V is equal to controlling I [6] . As such, we can achieve P (l|u, do(m)) by conducting a backdoor adjustment over the observable item feature I, which is similar to the existing causal recommendation methods on observable confounders [5, 20] . • As to P (m|u, do(i)), the backdoor path I ← V → L ← M is d-separated by the collider L [6] . Therefore, P (m|u, do(i)) = P (m|u, i) where m, u, and i are all observable values. We then further derive the second term

The derivation is explained step by step as follows:

with V given I according the causal graph. • (d) holds due to the properties of marginal distribution. By replacing P (l|u, do(m)) with i P (l|u, i, m)P (i) and p(m|u, do(i)) with p(m|u, i), we obtain the causal effect free from the hidden confounder V , which is:

Up to this point, we have freed the causal effect P (l|u, do(i)) from the hidden confounder V . We then consider estimating the causal effect from historical data D. According to Eq. (4), to obtain P (l|u, do(i)), we need to: 1) in the training stage, estimate the conditional mediator probability P (m|u, i) and the conditional like probability P (l|u, i, m) through historical data D; 2) in the inference stage, avoid iterating over all values of I and M since it is computationally costly. We need to get rid of the sum over i and m in Eq. (4).

We estimate the two conditional probabilities P (m|u, i) and P (l|u, i, m) in the following steps: Step 1. Modeling the conditional mediator probability P (m|u, i). We parameterize the distribution of the conditional mediator probability as f m (u, i), where f (·) can be arbitrary backbone models (e.g., MMGCN) that take u and i as inputs. , and f m denotes the predicted probability of M = m.

Step 2. Modeling the conditional like probability P (l|u, i, m). Given u and i, and the value of mediator m, P (l|u, i, m) gives the like probability. We parameterize the distribution as a decomposed model in the following form:

where h 1 (·) and h 2 (·) can be any backbone models for recommendation. Similar to [5, 20] , our main consideration for the decomposition is that the correlation P (l|u, i, m) comes from two different sources: (1) M is correlated with L due to the casual path M → L given I; (2) I is correlated with L due to the backdoor path I ← V → L given M .

Step 3. Estimating P (m|u, i) and P (l|u, i, m). As the backbone models have different target values, we adopt multitask learning to learn them simultaneously. Formally,

where R M (·) and R L (·) denote the recommendation losses of the two tasks, respectively, such as the cross-entropy loss, and β is a hyper-parameter to balance the two tasks. Note that we let the backbone models share the embedding layer to facilitate knowledge transferring across tasks [21] . Figure 2 shows our model architecture under the multitask learning framework. Note that we merge h 1 (u, m) and h 2 (u, i) in the figure for briefness.

To construct the recommendation list for each user, we need to calculate the causal effect P (l|u, do(i)) for each useritem pair. It is computationally costly to direct calculate the causal effect according to Eq. our design of decomposing h(u, i, m), we can get rid of the sum operation as:

where S u = i h 2 (u, i ) is a constant given u. As S u will not influence the item ranking, we omit can safely omit it during inference.

In this part, we present an instantiation of the HCR framework in which two mediators are considered: integrated user-item features and the click feedback. We denote the integrated features of user-item pair and the click feedback as C and Z, respectively. As shown in Figure 3 , C and Z also have a direct causal relation (Z → C) since the integrated user-item features affect the happening of click feedback 4 .

By replacing M with C and Z in Eq. (4), we identify the causal effect as:

Since the like feedback are conditioned on clicks, we have P (l = 1|u, i , c = 0, z) = 0. Besides, the integrated feature of user-item pair, i.e., z, is determined given the input features of u and i. Hence, we have P (c, z = z(u, i)|u, i) = 0 (or P (z = z(u, i)|u, i) = 0) [13] , where z(·) represents the feature integration function in recommender systems. Similar to [13] , we implement f m (u, i) as:

Algorithm 1: Training Procedure of HCR

where c j and l j denote click and like label, respectively. Output: Identified causal effect of i on l 1 Randomly initialize f m (u, i) and h(u, i, m) as Eq. (9) and (10) Update f m (u, i) and h(u, i, m) by minimizing Eq. (6); 5 end 6 Return causal effect P (l|u, do(i)) according to Eq. (11) .

) are backbone recommender models, which should match the property of input features in the specific scenario. For instance, we can select MMGCN when the features are in multiple modalities. As both C and L represents two different uesr feedback, the objective in Eq. (6) for training these backbone models is similar to multi-behavior recommendation [19] where the hyper-parameter β adjusts the weights of different feedback. For inference, substituting the designed models into Eq. (7) gives that:

To summarize, Algorithm 1 shows the whole training and inference procedure of HCR under this case.

In this part, we will discuss a underlying assumption of HCR on hidden mediators and the generality of HCR with consideration of measured confounders. Hidden mediators. Hidden mediators between I and L conflict with the front-door criterion, i.e., HCR is under a no hidden mediator assumption. Generally, there could be two kinds of mediators between I and L: 1) M that is affected by user features; and 2) M that is independent to user features, as shown in Fig. 4 (a) . As to M , HCR accounts for both the integrated features and prior behavior of like, which should be sufficient to avoid hidden confounders. As to M , we believe HCR cannot handle such factor. However, we believe such factor is ignorable since practical recommender systems mainly pursue personalized services while M affects like regardless of user features.

Measured confounders. Confounders between item attributes and like feedback could be divided into two categories: hidden confounders and measured confounders, as shown in Fig. 4 (b) . The existence of measured confounders (b) Causal graph with the measured confounder V Fig. 4 . (a) A special causal graph with the hidden mediator M . M denotes the mediator that is related to the matching process of recommendation and M denotes hidden mediator that is independent to the user features. (b) A special causal graph with the measured confounder V . V denotes the hidden confounder and V denotes the measured confounder.

(i.e., V ) does not conflict with the proposed HCR framework because the causal effect of I on L is still determined through the mediator M 5 . Therefore, HCR is able to handle both categories of confounders. Note that HCR cannot be utilized when confounders are measured but no mediator exists. In such cases, we could leverage conventional methods [5, 20] designed for observed confounders. However, as argued above, we believe that mediators exist in most cases of recommender systems.

We conduct experiments to answer three main research questions:

• RQ1: Does removing hidden confounding effects with HCR benefit the recommendation performance? How is the performance of HCR compared with existing state-ofthe-art methods? • RQ2: How do the causal effect identification and components of HCR influence the effectiveness of HCR? • RQ3: Where do the improvements of HCR come from, and can HCR obtain unbiased user preference estimations and achieve stable improvements?

We conduct experiments on three publicly available realworld datasets: Tiktok, Kwai, and Taobao. All datasets have multi-behaviors, one of which is the mediator -the click feedback. The statistics of datasets are in Table 1 .

• Tiktok 6 . This is a multimodal micro-video dataset released in the ICME Challenge 2019. It records several user feedback on videos, including click, finish and thumbs-up.

We view both the finish and thumbs-up as the like feedback, i.e., L in Figure 3 . Tiktok dataset contains textual, visual and audio features for items. According to [20] thumbs-up. Similarly, we treat finish and thumbs-up as the like feedback. Besides the interaction data, visual features of video covers were extracted by the organizer and can be viewed as exposure features. • Taobao 8 . This is an e-commercial dataset including users' click and purchase records. Besides interactions, the seller and category features are also provided. For this dataset, the purchase will be treated as the like feedback. We sort the click feedback in chronological order. The earlier 70% clicks are used for training. Then for each user, liked items not included in the training set are equally divided into validation set and test set, keeping chronological order. The main consideration here is that hidden confounding effects may drift over time (e.g., news events). Thus, achieving better performance on the validation and test sets indicates better capacities of recommender models to get rid of the confounding effects (in training set) and provide more accurate recommendations. We take the like feedback to evaluate model performance.

To evaluate the validity of our proposal, we compare HCR with various recommender methods, which can be categorized into two groups: normal methods (CT and ESMM) and debias methods (CR, DCF, Multi-IPW and Multi-DR), as described below.

• CT [17] . This method is conducted in the clean training(CT) setting, where only the like feedback are used to train the backbone model through a recommendation loss. Note that it is a single-task method without considering bias issues. • ESMM [22] . [8] . It is a de-biasing method which applies the Inverse Propensity Weighting method under a multitask learning framework. It tries to solve the selection bias issue for post-click conversion rate (CVR) estimation. For this goal, it introduces an auxiliary CTR task to remedy the data sparsity issue. Meanwhile, predictions of the CTR model are treated as propensities for the CVR task. We implement its CTR and CVR backbone models with MMGCN to exploit multimodal features. • Multi-DR [8] . This is a Doubly Robust-based method under a multi-task (CVR and CTR) learning framework. Multi-DR makes use of the IPW like Multi-IPW and adopts an imputation model to predict estimation errors for better de-biasing. Similarly, we implement its CTR, CVR and imputation model with MMGCN. • DCF [23] . This is a method that takes hidden confounders into account. Assuming item exposures are highly related to hidden confounders, it learns an exposure model by fitting exposures. The exposure model provides substitutes for unobserved confounders and then DCF leverages the substitutes to remove the impact of hidden confounders in rating data. Since no exposure data is provided in all three datasets, we use the click and like feedback to substitute exposures and ratings, respectively. We implement DCF based on MMGCN. The baselines can also be categorized into two groups: methods with multi-task learning (ESMM, CR, Multi-IPW and Multi-DR) and methods without multi-task learning (CT, DCF).

For a fair comparison, we implement all baseline methods with the backbone model MMGCN [17] since it utilizes multimodal item features and thus can achieve better performance. We then take the Tiktok dataset as an example to illustrate some implementation details of HCR. We take {f v , f a , f t } to represent the visual, audio and textual item features, respectively. The textual features f t are extracted from the cover caption of micro-videos and thus are treated as exposure features. f t affects click C directly , i.e., the clickbait issue [20] . Since HCR adopts a multi-task framework, we use two MMGCN models to generate user/item representations as:

Then the integrated features z, exposure features e and content features c can be described as :

. G denotes the feature fusion function, e.g., mean(·). z and e contain both user and item features. We follow [20] to use multiplication as the fusion strategy for click prediction:

where Y (·) denotes inner product and Y (z 1 ) and Y (e 1 ) are the predictive probabilities based on z 1 and e 1 respectively. Similarly,ĥ 1 = Y (z 2 ) andĥ 2 = σ(Y (e 2 )) * σ(Y (c 2 )). Other details will be shown in our released code in the future. 

During evaluation, recommender models serve each user and generate a recommendation list by ranking items that do not appear in the training dataset, i.e., the all-ranking protocol. Since the like feedback has better ability to indicate actual user preferences, we only treat the like feedback on validation and test sets as positive samples. To measure the top-K recommendation performance, we take two widelyused evaluation metrics: Recall@K (abbreviated as R@K), which considers whether the relevant items are retrieved within top-K positions, and NDCG@K (abbreviated as R@K) that measures relative orders among positive and negative items in the top-K recommendation list. Due to the large number of items and sparsity of the like feedback in realworld datasets, we report the results of K=50 and K=100.

We optimize all models with the Adam [24] optimizer and use default mini-batch size of 1024. For Tiktok and Kwai, we search learning rate in the range of {1e-4, 5e-4, 1e-3}. For Taobao, we seach learning rate in the range of {1e-5, 1e-4, 5e-4, 1e-3}. For all methods, L 2 regularization coefficient is searched in the range of{1e-4, 1e-3, 1e-2}, and the settings of the backbone MMGCN follow previous work CR [20] , including latent dimension, concatenation strategy, the number of GCN layer, etc. For HCR, the weight β in the multi-task loss function (i.e., Eq. (6)) is tuned in the range of {1, 2, 3, 5}. Moreover, for model selections, early stopping is adopted. Training will stop if NDCG@50 in the validation set does not increase for 10 successive epochs.

In this section, we study the recommendation performance of the proposed HCR framework. We compare HCR with a variety of approaches including the biased conventional methods and de-biasing methods. The comparison result is summarized in Table 2 , where we have the following observations:

• In most cases, the proposed HCR achieves distinct improvements over all baselines, showing its capacity of obtaining more accurate user preference estimations. The improvement can be attributed to the deconfounded training and inference, which remove hidden confounding effects. In addition, HCR consistently outperforms all methods that do not model hidden confounders, i.e., all baselines except DCF. These findings reflect the rationality of our causal analysis of hidden confounders and validate the necessity to deal with hidden confounder issues.

• HCR consistently outperforms ESMM, while both adopt multi-task learning frameworks. This result implies that improvements of HCR over baselines should be attributed to removing the impact of hidden confounders rather than the multi-task learning. • CR is superior to CT on Tiktok and Kwai. It is confusing that the model trained with direct access to the like (CT) cannot beat a model using only clicks (CR) during evaluation on the like data. However, this phenomenon is also found in the CR paper [20] . There are two main reasons: 1) the like data is highly sparse; 2) CT captures correlations between like and item features without eliminating spurious correlations brought by hidden confounders, leading to biased estimations. CR does not directly fit the like data, but it takes the counterfactual inference to remove biases in clicks. Thus, CR achieves more accurate user preference estimations. Meanwhile, HCR outperforms CR w.r.t. R@50 by 16.75% and 4.88% on Tiktok and Kwai, respectively. These results again show that it is essential to address the problem of hidden confounder. • Multi-IPW and Multi-DR are de-biasing recommendation methods, and they outperform CT on Tiktok. However, they cannot maintain improvements across all datasets. This phenomenon may be due to the high variance of IPW-based methods [4] . Another reason is that in order to achieve the desired unbiased estimations, they assume the non-existence of hidden confounders, which is usually impossible in practice [23] . • DCF achieves good performance in several cases compared to other baselines. This result can be attributed to the fact that it controls hidden confounding effects with substitutes of hidden confounders. However, DCF also fails to beat other baselines in many cases. And it is not as good as the proposed HCR (except Recall@50 on Tiktok). This is because substitute confounder estimations are not guaranteed to deal with arbitrary hidden confounders, especially when the confounder is weakly related to exposures. HCR makes fewer assumptions about hidden confounders and identifies the causal effect of item features on user preferences without measuring or estimating hidden confounders, leading to improved performance.

To shed light on performance improvements, we further study four variants of HCR, named HCR-T, HCR-S1, HCR-S2, and HCR-NS, respectively. The former three variants differ from the original HCR during inference, with components of models disabled or replaced. The HCR-NS model disables the shared embedding layer between the two estimation modules in the training stage.

-HCR-T. Recall that we decompose h m (u, i, m) into the product of h 1 (u, z) and h 2 (u, i) as described in Eq. (5). This decoupling design enables rid of the sum over i through removing h 2 (u, i ) in the inference stage. For the variant HCR-T, we adapt the inference formula to,

In fact, it only uses correlations to represent the effect of M on the L, which disables the intervention performed during inference.

-HCR-S1 denotes the model in which we disablê h 1 (u, z(u, i)) in the inference stage, formulated as,

-HCR-S2 denotes the model in which we disablê f (u, i, z(u, i)) during inference, formulated as,

-HCR-NS represents that we disable the shared embedding layer between the two probability estimation models in the training stage. During inference, HCR-NS adopts the same scoring function as the original model. The recommendation performance of HCR and its four variants on the three datasets are summarized in Table 3 . We can obtain the following findings from the results:

• In all cases, the performance of HCR is superior to HCR-S1 and HCR-S2, indicating that using a single estimation model under the HCR framework as ranking scores is detrimental to recommendation results. Especially,ĥ 1 (u, z(u, i)) (adopted by HCR-S2) can be viewed as the partial effect of M on L while forgoing the effect of I on M . The absence of complete causal effects explains performance drops of HCR-S1 and HCR-S2. These findings validate the effectiveness of the multi-task framework and the causal recognition. • During inference, HCR-T directly combines the trained estimation functions fitted in the observed data to perform recommendations. Thus the ranking scores of HCR-T cannot reflect the causal effects, but only correlations, resulting in worse performance compared to HCR. Therefore, the superior performance of HCR over HCR-T reflects the rationalities of our causal effect identification and the capacities of HCR to mitigate unobserved confounding effects. • HCR achieves consistent gains over HCR-NS in all cases.

Meanwhile, ESMM that also adopts embedding sharing without causal interventions is outperformed by HCR (cf. Table 2 ). These results imply the necessity of combining embedding sharing and causal effect identification. We attribute the performance drop of HCR-NS to its failure to facilitate knowledge transfer across tasks. Thus, the accurate causal effect P (l|u, do(i)) cannot be achieved due to the insufficient estimations of the required correlations.

In this subsection, we conduct comparisons between HCR and CT as examples to further investigate: 1) where the main performance improvements come from; 2) whether the improvements are stable along time; and 3) whether HCR recommends high-quality items.

In this part, we investigate whether HCR achieves more improvements on active or less-active users user group. To achieve this goal, for two micro-video datasets (Tiktok and Kwai), we treat top 40% users as the active users according to the number of clicks while others as less-active users. For Taobao, due to higher sparsity, we select the top 20% users to form active user group. We compare 1) absolute performance w.r.t. Recall@100 of HCR and CT; 2) relative improvements of HCR over CT. The results are summarized in Figure 5 . According to Figure 5 , 1) both HCR and CT achieve better performance in less-active user groups. We propose that active users usually have diverse preferences, and their interests drift with time with higher probabilities. Recall that datasets are chronologically split. The inherent user interests in training and test sets are different, leading to the performance drop. In contrast, less-active users have stable interests. 2) HCR can outperform CT on both active and less active user groups in most cases (5 of 6), which verifies the superiority of HCR. 3) HCR achieves greater relative improvements in less-active user group. As aforementioned, active users have more unstable interests. Even though HCR obtains relatively more precise estimations of users' interests during training, it cannot achieve greater improvements on the test set since users' interests have drifted. While the less active users have more stable interest, thus the unbiased estimation of HCR (regarding hidden confounders) could show greater superiority over the biased estimation of CT. Meanwhile, improving the experience of less-active users is meaningful, since a large percentage of users belong to the less-active group due to the long-tail phenomenon.

As aforementioned, the value of hidden confounders may change along time, e.g., different types of social event occur on different dates, which means its impact drifts over time. HCR would show stable and consistent improvements if hidden confounders are removed. Therefore, we evaluate Fig. 5 . Performance of HCR and CT in active and less-active user groups. (a) the absolute performance; and (b) the relative improvements of HCR over CT. "AU" and "LAU" are short for the active user group and the less-active user group, respectively. In (a), bars with slash and without slash corresponds to CT and HCR, respectively. Better viewed in color. the performance of HCR over time compared to CT. For each user, we evenly divide the corresponding liked items in validation and test sets into four subsets chronologically, denoted as subset 1, 2, 3, and 4 respectively. We conduct two experiments: 1) compare relative improvements of HCR over CT in four subsets; 2) evaluate average performance drop of LCDR and CT in subsets 2, 3, and 4, compared to subset 1. The results are summarized in Figure 6 . Figure 6 (a) shows that in all subsets, HCR achieves consistent gains over CT. Figure 6 (b) shows that HCR maintains a relatively small performance drop compared to CT. These findings can be attributed to the removal of dynamic hidden confounders, which leads to more accurate estimations on user preferences. However, even though HCR consistently outperforms CT, its performance still declines over time. We attribute this fact to the drift of user interest. Recommender models learned from training data always tend to perform worse on subsets later in time. In the future, we may need to design models that remove the impact of hidden confounders dynamically to capture real-time unbiased user preferences.

Whether the proposed HCR can achieve consistent and unbiased user preference estimations is our concern. Thus in this part, we sort items according to like/click ratio and divide them into two subsets with a 1:2 ratio of size. Items with higher like/click ratios are more likely to satisfy users' interest, while over-recommending items with low like/click ratios is likely to hurt user experience. We use normalized recall to evaluate recommender models, defined as the recall metric normalized by proportions of target item group in recommendation list. The performance of HCR and CR in the two groups are shown in Figure 7 . (b) Item group with low like/click ratio. Fig. 7 . Normalized Recall of HCR and CT in item groups with high and low like/click ratio. HCR achieves consistent gains over CT in item groups with higher like/click ratios, showing that HCR can provide more recommendations with high quality items due to the removal of hidden confounders.

In this section, we first overview existing work on debiasing in recommendation, then we specially discuss the work to deal with the bias from the confounder perspective.

Recommender systems face a variety of bias issues, such as position bias [25, 26] , exposure bias [11, 27] and popularity bias [5, 28] . Much effort has been paid to these issues in recent years. Existing methods dealing with biases can be categorized into the following technical routes. The most widely considered approaches are based on the IPW [7, 8, [10] [11] [12] , which try to adjust the data distribution to be unbiased by re-weighting the training samples. The performance of IPW-based methods depends on the accuracy of the propensity scores, and they are unbiased only when the true propensity scores are estimated. Moreover, they suffer from high variance [4] since obtaining accurate propensities is challenging in practice. Doubly Robust (DR) [29] [30] [31] combines the IPW-based methods with an imputation model. Similarly, its performance is related to the accuracy of the propensities. However, these methods do not consider hidden confounding effects and thus there is no guaranteed accuracy of propensities in our settings.

Leveraging unbiased data is another option to address bias issues [10, [32] [33] [34] . However, obtaining unbiased data requires applying causal intervention on interactions during the data generation process. Thus unbiased data collections are likely to harm the user experience [29] . Moreover, controlling confounders for intervention experiments can not be conducted with confounders hidden. There are also some heuristic methods for recommendation de-biasing, such as re-ranking [28, 35] and loss constraints [36, 37] . However, these methods lack sound theoretical basis, and it is challenging to find proper heuristics to handle with hidden confounders.

Recently, causally-aware methods have become thriving in recommendation. Some efforts have been paid to address confounding problems in recommendation. Most of them focus on observed confounders, and a few works focus on hidden confounders. We introduce these two aspects of work in the following.

Several recent works have adopted causal tools to analyze the root causes of bias problems, and identify the confounding effects as the root reasons [5, 13, 14, [38] [39] [40] . PDA [5] identifies the popularity as a confounder that affects both the exposures and clicks, and treats the confounding effects as the bad effects of popularity bias. [13] identifies the distribution of historical interactions as a confounder for the bias amplification issue. [14] identifies the momentum of SGD optimizer as a confounder, which is thought as another reason for popularity bias in session-based recommendations. [40] identifies item aspects (e.g., actor) as confounders in scenarios with heterogeneous information. These works perform the backdoor adjustment in the training or inference stage to handle the confounding problems, which requires controlling confounders. In contrast, we do not identify any particular controllable factor as a confounder but consider hidden confounders. Therefore, the backdoor adjustment-based methods fail to handle this case. [38] identifies user/item features as confounders when estimating causal effects of recommendation, [39] identifies response rate as a confounder for user satisfaction, and they are both based on IPW to achieve deconfounding. Besides, counterfactual inferences-based methods [20, 41] are also potential to solve confounding bias issues. But these methods also assume that confounders can be measured.

For general recommendation, there are a few works [23, 42] that aim to mitigate the hidden confounding bias without direct measurement. DCF [23] considers hidden confounders, and assumes that they are related to exposures. Thus, DCF learns a exposure model to compute substitutes for confounders. By leveraging the substitutes, DCF removes the impact of hidden confounders on ratings.

However, such substitute confounder estimation is not guaranteed to address arbitrary hidden confounder, especially when the confounder is weakly related to the exposure. [42] considers the case that model inputs (user and item feature vector) contain unknown confounders, which affect the treatment (defined as the recommendation strategy) and outcomes. This work learns certain biased representations and discard the biased component when inference. This process is based on information bottleneck and does not require explicit identification of the confounder. However, confounders are not necessarily included in model inputs, e.g., news events. HCR does not assume that confounding effects are fully reflected in embedding representations. Instead, HCR focuses on general confounders that simultaneously affect item attributes and user feedback.

For sequential recommendation, [43] proposes an unbiased approach by modeling the hidden confounder. Assuming that the hidden confounder can be estimated with historical interactions, this work estimates the confounder with GRU. Then IPW is used to achieve de-confounding, where propensities are defined by estimated confounders. DEMER [44] also considers hidden confounders in sequential scenario, but focuses on reinforcement learning based recommendation. This work leverages a confounding agent to simulate the confounding effect of hidden confounders when formulating the environment reconstruction. Then confounder embedded policy and the compatible discriminator are proposed to achieve a deconfounded environment reconstruction. These two methods model hidden confounders through exploiting historical user feedback or simulation of a confounding agent in an interactive environment. However, it is difficult to adapt these methods to general recommendation settings. Moreover, disentangling hidden confounding effects from observed data remains an open problem without enough inductive bias or supervised information [45] . Instead, we regard item attributes as treatment and identify the causal effect without explicitly modeling the hidden confounder.

In this paper, we highlighted the importance of considering hidden confounders in recommender systems. We resorted to causal language to abstract the recommendation process as a causal graph. Inspired by the front-door adjustment technique rooted in causality theory, we proposed a novel deconfounded training and inference framework named Hidden Confounder Removal (HCR), which blocks the hidden confounding effect when estimating the causal effect P (l|u, do(i)). We instantiated LCDR in micro-videos and e-commerce products recommendation scenarios over a representative multi-modal model MMGCN. Empirical results on three real-world datasets validate the advantages of eliminating hidden confounding effects.

This work shows the significance of modeling user preferences with hidden confounding effects removed. The general design of HCR makes it model-agnostic and can be adapted to other recommendation settings. The causal effect recognition of HCR relies on the identification of mediators. Therefore, we will extend the proposed HCR framework in the following directions: 1) handling the cases that mediators are non-exist or do not satisfy the front-door criterion; 2) considering hidden confounders between like and user features; 3) investigating the relation between hidden confounders and bias issues in recommender systems, e.g., selection bias and popularity bias; and 4) testing the HCR framework with more backbone model architectures in more recommendation scenarios such as food recommendation.

Deep learning for matching in search and recommendation

Deep learning for sequential recommendation: Algorithms, influential factors, and evaluations

Bias and Debias in Recommender System: A Survey and Future Directions

Asymmetric tri-training for debiasing missingnot-at-random explicit feedback

Causal intervention for leveraging popularity bias in recommendation

Causal inference in statistics: A primer

Recommendations as treatments: Debiasing learning and evaluation

Large-scale causal approaches to debiasing post-click conversion rate estimation with multi-task learning

Offline evaluation to make decisions about playlistrecommendation algorithms

Autodebias: Learning to debias for recommendation

Unbiased recommender learning from missing-not-at-random implicit feedback

Enhanced doubly robust learning for debiasing post-click conversion rate estimation

Deconfounded recommendation for alleviating bias amplification

Causer: Causal session-based recommendations for handling popularity bias

Causally attentive collaborative filtering

Recommender systems leveraging multimedia content

Mmgcn: Multi-modal graph convolution network for personalized recommendation of micro-video

A survey on neural recommendation: From collaborative filtering to content and context enriched recommendation

Learning to recommend with multiple cascading behaviors

Clicks can be cheating: Counterfactual recommendation for mitigating clickbait issue

A survey on multi-task learning

Entire space multi-task model: An effective approach for estimating post-click conversion rate

Causal inference for recommender systems

Adam: A method for stochastic optimization

Pal: a position-bias aware learning framework for ctr prediction in live recommender systems

Cross-positional attention for debiasing clicks

Modeling user exposure in recommendation

Popularity-opportunity bias in collaborative filtering

Doubly robust joint learning for recommendation on data missing not at random

Enhanced doubly robust learning for debiasing post-click conversion rate estimation

Offline a/b testing for recommender systems

Causal embeddings for recommendation

A general knowledge distillation framework for counterfactual recommendation via uniform data

Combating selection biases in recommender systems with a few unbiased ratings

Managing popularity bias in recommender systems with personalized re-ranking

Esam: Discriminative domain adaptation with non-displayed items to improve long-tail performance

Controlling popularity bias in learning-to-rank recommendation

Unbiased learning for the causal effect of recommendation

Deconfounding user satisfaction estimation from response rate bias

Causal disentanglement for semantics-aware intent learning in recommendation

Modelagnostic counterfactual reasoning for eliminating popularity bias in recommender system

Mitigating confounding bias in recommendation via information bottleneck

Unbiased sequential recommendation with latent confounders

Environment reconstruction with hidden confounders for reinforcement learning based recommendation

Disentangling user interest and conformity for recommendation with causal embedding