key: cord-0145273-j4ces6g1 authors: Lechiakh, Mohamed; Maurer, Alexandre title: FEBR: Expert-Based Recommendation Framework for beneficial and personalized content date: 2021-07-17 journal: nan DOI: nan sha: 38b849bb3488b7a839a5f7924f0d4247417c0658 doc_id: 145273 cord_uid: j4ces6g1 So far, most research on recommender systems focused on maintaining long-term user engagement and satisfaction, by promoting relevant and personalized content. However, it is still very challenging to evaluate the quality and the reliability of this content. In this paper, we propose FEBR (Expert-Based Recommendation Framework), an apprenticeship learning framework to assess the quality of the recommended content on online platforms. The framework exploits the demonstrated trajectories of an expert (assumed to be reliable) in a recommendation evaluation environment, to recover an unknown utility function. This function is used to learn an optimal policy describing the expert's behavior, which is then used in the framework to provide high-quality and personalized recommendations. We evaluate the performance of our solution through a user interest simulation environment (using RecSim). We simulate interactions under the aforementioned expert policy for videos recommendation, and compare its efficiency with standard recommendation methods. The results show that our approach provides a significant gain in terms of content quality, evaluated by experts and watched by users, while maintaining almost the same watch time as the baseline approaches. Recommender systems (RS) try to provide their users with content matching their interests and preferences. To do so, they use many different approaches: collaborative filtering, content-based approaches, hybrid approaches. . . [28] . Recent works on collaborative interactive and conversational RS showed promising methods to improve the relevance and the personalization of recommendations [2, 9, 12, 19] . They mainly focus on modeling complex user behaviors and dynamic user/item interactions. For this purpose, they use techniques derived from optimal control paradigm, preference elicitation learning, deep learning and natural language processing (NLP) [11, 21, 27, 35, 36] . Nowadays, recommendation platforms are often judged to be lacking transparency and accountability with their algorithmic recommendations, which have a tendency to capture user attention and make the platform more addictive. Thus, user values and principles are sometimes underestimated, and even intentionally suppressed, to fulfill the company's objectives. 1 In fact, aligning recommendations with human values is a complicated problem that requests (1) a good understanding of user behaviors and (2) optimizing the right metrics. Here, these metrics must be carefully designed to adapt to user preferences and goals in a fair, accurate and ethical way (e.g., by reducing addictive, harmful, misleading and polarizing content). Actually, finding beneficial information within mainstream RS is far from obvious. This problem becomes very challenging when trying to personalize the recommendation service, and it is especially prevalent with news and informative contents provided by those platforms. In the case of video RS like YouTube, the content quality must be discussed from many perspectives, which includes: the amount of expertise, authoritativeness and trustworthiness of the content creator; the validity and accuracy of the content itself; its ability to create and support good habits and behaviors; the user engagement and satisfaction towards this content; etc. These quality features characterize what we call a beneficial personalized content. According to YouTube Official Blog [5] , YouTube recently adopted the concept of social responsibility as a core value for the company. In this context, it defined new metrics for quality evaluation (namely "user watch time" and "quality watch time"). In addition, it started recommending from sources that the company considers to be authoritative, and reducing suggestions of "borderline" videos. However, despite these efforts, YouTube's recommendation algorithm still suggests content containing misleading or false information, as well as conspiracy theories. Therefore, in such huge RS, high-quality content can be "drowned" among low-quality content, making it less visible to users. 2 Unfortunately, designing robust ML methods to recommend beneficial and engaging content seems hopeless in systems of large dimension, with billions of users and items. Our contribution. To address this problem in current recommendation environments, we propose FEBR (Framework for Expert-Based Recommendations), the first expert-based Apprenticeship Learning (AL) [14] framework for news (video, articles. . . ) RS. In our solution, we try to use relevant quality signals to identify beneficial and personalized content to be recommended to users. The novelty in our approach is in the way we measure this quality, and the guarantees that it presents in terms of importance, accuracy and reliability. We derive our quality metrics from experts involved in the evaluation part of the framework, using a customized evaluation model. Personalization is essentially achieved by the classifier which matches the user state model to its closest expert state model (from the expert state dataset). Our simulation results show a consumption of high-quality content that remarkably exceeds baseline approaches, while maintaining a very close total watch time (as shown in Figures 3 and 4 of Section 6). In addition, we implemented our solution as a configurable framework (using RecSim [16] ), that can be used for simulation experiments related to our topic. Overview of our solution. We took inspiration from the We-BuildAI framework [18] : a collective participatory framework that enables stakeholders to provide useful inputs for learning personalized models in order to create algorithmic policies. Our proposed framework FEBR is a three-part participatory system where the RS, experts and users collaborate between each other in Markov Decision Process (MDP) environments, to leverage expert knowledge for a better user experience. We assume that experts are reliable (we are aware that they can only evaluate a small fraction of the system's items). An intuitive idea would be to directly inject the set of evaluated videos as a "ground truth" of reliable quality content, that would be exploited by collaborative filtering techniques to improve the quality of recommended contents. However, this approach is very limited, since it relies on the probability that a given evaluated video will likely be explored and recommended based on the similarities between user profiles (which would not be efficient in RS with large user and item spaces). Therefore, we started from the assumption that a reliable expert's behavior across the recommended contents (for a given topic) is different from a layman's one. 3 Thus, if we could guide a user to mimic an expert's trajectory (i.e., her sequence of visited states within the recommendation environment) while respecting her own preferences, we would achieve high quality while maintaining high engagement. However, it seems to be difficult to exactly know the expert's intentions throughout her watching and evaluation session, since she is dealing with an interactive environment with a large action space. Therefore, it is almost impossible to capture an expert's behavior through a RL method that tries to optimize a given reward function: in this case, unlike classical problems, we cannot specify the reward. Indeed, the reward would depend on many observable and latent environment parameters, including the expert and system's states, and the complex dependencies and relationships this may involve. Besides the difficulty of inferring the objective of the expert's behavior from a demonstration (i.e., an expert trajectory within a continuous session), we believe that, in general, in a partially observable MDP environment, her demonstration is actually a series of state-action operations that aim to maximize an unknown reward function, which reflects different aspects of the expert's intuitions, estimations, choices and evaluations. Thus, our problem can be seen as an AL problem that uses Inverse Reinforcement Learning (IRL) [14, 26] to recover a complex reward function in an expert MDP environment (within an AL/IRL component). This reward function will be used to generate the optimal expert policy, which will be consumed by the final user MDP environment (in the recommendation component of the same RS) to generate high-quality content. We assume that this reward function can be formulated as a linear combination of unknown features describing the learning task. We developed a dedicated expert MDP environment that uses a personalized evaluation mechanism within the response model. This enables experts to evaluate videos and, at the same time, to capture the main patterns of quality and engagement features delivered during this process (which could help to efficiently learn the reward function). However, the expert demonstrations cannot always be optimal (i.e., some demonstrations may include bad content selections, and lead to a low average quality), which may scramble the learning process and make it difficult to converge to a stable reward function. To overcome this ambiguity, we use the Maximum Entropy (MaxEnt) principle with IRL [34] to select the best distribution over trajectories, leading to a maximum reward value. Finally, we use the value iteration [31] algorithm to optimize the policy under the recovered reward function, which will be exploited by a classifier to build our recommender agent for the end-user MDP environment. It is crucial to mention that our approach recommends items to the user according to the policy action (slate) learned on a similar expert state. This expert state is selected (based on a classification algorithm) for its similarity with the user state. Also, we do not discuss the problem of disagreement between experts, since they all maintain their own evaluations and the system converges to a unique optimal (expert) policy. Organization of the paper. In Section 2, we present related works and the background. In Section 3, we state preliminary definitions. In Section 4, we describe our proposed framework. In Section 5, we describe the design and setting of our experimentation. In Section 6, we describe and comment our experimental results. We conclude and discuss future works in Section 7. In this section, we discuss some important works around AL and IRL, and how they are (or can be) related to RS. Many works on AL have been applied in the robotic domain, and most of them consider some supervised learning techniques to learn a mapping from the states to the action space. However, such methods are not effective in highly dynamic environments, and can even lead the agent (apprentice) to choose catastrophic actions. Abbeel and Ng. [1] provided a short overview of this literature. In their paper, they supposed that an expert is trying to optimize for an unknown reward function, that can be expressed as a linear combination of known features in a finite MDP environment, in order to find a policy that performs as well as the expert. However, assuming that the reward function is a linear combination of hand-selected features may be unrealistic. Neu et al. [25] proposed another approach based on a gradient algorithm that combines the supervised learning versions of AL with Abbeel and Ng's method to learn a cost function by tuning the reward parameters, and using RL to find the optimal policy. This approach is demonstrated for unknown reward features, but both of them use IRL, which regularly calls RL algorithms to solve MDP problems. They further suppose finite MDP settings, which is far from convenient in large interactive systems, like many of today's RS. A few works studied the issue of learning a non-linear reward function using some probabilistic reasoning about stochastic expert behaviors with Gaussian processes. They managed to recover both the reward and the hyper-parameters of a kernel function that describes the structure of the reward [20] . In addition, many policies can be a solution for a given reward, and many reward functions can model a given set of demonstrations. To tackle this problem in small dynamic spaces, Ziebart et al. [34] proposed a maximum entropy probabilistic framework using IRL, which aims to choose a trajectory distribution with maximum entropy that matches the expected expert reward. This approach is further extended by Wulfmeier et al. [32] using a deep neural network to learn the unknown reward features. Here, the maximum entropy principle is used to optimize the weights of the neural network. IRL algorithms use the entire trajectories of the learned task instead of an independent state-action operation to learn a similar expert policy by maximizing an unknown reward function. Contrary to the approach used by Zhao et al. [33] , which considers a recommendation session as multiple independent state-action pairs, Chen et al. [7] used a Generative Adversarial Imitation Learning approach [13] to learn a user behavior model and the corresponding reward function, which are used to develop a combinatorial recommendation policy using RL. They considered that a sequence of state-action pairs is a whole trajectory, such that posterior actions could be influenced by prior actions. However, both approaches succeed in improving the user's long-term engagement, regardless of the quality of the content they provide. With an approach closer to our contribution, Massimo and Ricci [24] proposed a tourism recommender system that builds a user behavioral model based on the policies learned from clustering the user's trajectories, then solved an IRL problem to derive the reward function of each cluster. Their clustering approach could be seen as a projection of ours in a specific application domain using other tools, where the experts are exactly the same users that build "correct" trajectories by highlighting relevant points of interest from their own visits. Existing works so far rely on the behavior of regular users, or try to inject some "expert knowledge" (e.g. guidelines, good practice. . . ) in the system for improving the accuracy of predictions. For instance, some works [3, 8] used a small dataset of expert ratings collected from a reduced number of experts (identified by specific techniques like "domain authority" reputation-like score) to effectively predict the ratings of a large population. These approaches, based on the principle of "wisdom of the few", have only ensured the relevance of recommendations with the user preferences. To the best of our knowledge, this paper is the first to leverage the behavior of a set of selected experts to deliver beneficial, high-quality and personalized content. RL definitions. A MDP is defined in forward reinforcement learning (RL) by the tuple (S,A,T,D,R, ), where S is the state space and A is the action space. : × × ↦ → [0, 1] is the state transition function. is the initial-state distribution, from which the initial state 0 ∈ is drawn. : × × ↦ → R is the reward function. is the discount rate. A stationary stochastic policy (we simply say "policy" from now) : × ↦ → [0, 1] corresponds to a recommendation strategy which returns the probability of taking an action given a user state at timestamp . A policy is called deterministic if it results in a single action for any state of . Thus, for any policy , the value of a state ∈ w.r.t. the initial state distribution 0 ∼ is given by ≥0 is the sequence of state-actions pairs generated by executing policy . A policy maximizing the value function such that ★ ( ) = ( ) for each ∈ is called an optimal policy. In general, a typical interactive RL-based RS executes an action = { ∈ } through recommending a slate of items (e.g., videos, commercial products) to a user who provides a feedback ∈ (e.g., skipping, clicking and other reactions) at the th interaction. Then, the RS recommends the next slate ( +1 ) until the user leaves the platform (end of session). Here, (resp. ) is the set of candidate items for recommendations (resp. set of possible feedbacks). We model a user MDP state by: = ( , 1 , 1 , ..., −1 , −1 ); then, a trajectory is represented by = ( 0 , 0 , 0 , ..., , , , ..., ), where ∈ R is a reward associated with the user's feedback. In the case of forward RL, this reward is a function to be maximized by the recommender agent, which will derive its optimal policy. AL/IRL definitions. Algorithms for AL problems take as input a Markov decision process (MDP) with an unknown reward function (MDP\ ). 4 In a expert MDP\ environment, we observe an expert trajectory through a sequence of state-action pairs Here, the expert set feedback would have particular reactions (evaluation features). We assume that the expert behaves according to an optimal policy , which is assumed to maximize an unknown reward function = * (typically, it is common to make some assumptions on the structure of the reward function, such as assuming a linear model). As discussed in the related works, there are many IRL algorithms and approaches [26, 34] that could be used to find the optimized reward function. This function is recovered by a joint-iterative improvement and evaluation process using its associated policy, which is derived using some RL methods. As shown in Figure 1 , the framework is composed of the AL/IRL component and the recommendation component. This model framework is a centralized version of our approach, where the experts collaborate to learn the same policy. Furthermore, we suppose that the framework operates sequentially through three stages. The first stage concerns the generated trajectories from all experts participating in the system. The second stage concerns the learning of the expert policy by the MaxEnt model. The third stage concerns the recommendation process using the expert classification model in a convenient user environment. In this section, we describe our framework in the context of video recommendation systems, but the approach remains valid for other types of instructive and informative recommended contents. Expert MDP\R environment. Following the AL/IRL component in Figure 1 , the mean objective of the expert environment is to capture the experts' behavior through their watching and evaluation process. The experts are assumed to be reliable and to act in a professional and ethical way. In this case, we developed a finite-state MDP\ environment to model the system behavior in front of experts' demonstrations. Then, for an expert session, the recommendation task is defined as the sequential interactions between a recommender system (agent) and the expert environment. It uses a MDP\ = ( , , , , ) environment to model it, where. . . • S reflects the expert state space that refers to her interest distribution over topics, and the information about her watching history (e.g., watched videos and the interactions they involved: clicks, ratings using evaluation features, engagement rate. . . ). 5 • A is the set of actions that contains all possible recommendation slates of size . Basically, each action is a slate of videos that the system recommends to the expert. In a simple version, this slate can contain at least one item, or the null item ⊥ when nothing is selected. 6 • R is the reward function when the system takes an action from a state . Basically, this reward is unknown, and it is influenced by (1) the expert choice model (when she selects a video from the slate) and (2) the feedback that she gave from this video during watch time. The set of state transition probabilities T and the initial-state distribution D are determined by the environment models. Finally, let ∈ [0, 1) be the discount factor. 5 In our model framework, we propose a configurable evaluation model (see details in 5.4) in which we can define the quality metrics to evaluate the total quality of each video. This model can be specified w.r.t. the nature of the service and the constraints of the system. 6 Note that having only one recommended item per slate may not allow a useful expert interaction, since we rely on its ability to distinguish relevant contents among recommended items. The true reward function * is unknown, but we assume that we can linearly define its structure as * ( ) = , where : ↦ → R is a vector of features over the set of states, describing the expert's utility for visiting each state. This function is parameterized by some reward weights . Then, given the path counts = ∈ , the reward value of a trajectory is: Given trajectories˜extracted from the expert's behavior, its expected empirical feature count is˜= 1 ˜. The idea here is to find a probability distribution over the entire class of possible trajectories to ensure the following equality: Abbel and Ng [1] demonstrated that the matching feature expectations (Equation 2) between an observed policy (of the expert) and a learner's behavior (of the RL agent) is both necessary and sufficient to achieve the same performance as this policy, when the expert is solving a MDP with a reward function linear in those features (Equation 1). However, the matching problem of Equation 2 can result in many reward functions that correspond to the same optimal policy, and many policies can lead to the same empirical feature counts. Therefore, this approach may be ambiguous, especially in the case of sub-optimal demonstrations, like in our case, where the expert's behavior is often not a perfect one (since she is performing an evaluation work where optimal performance is not always expected). In our case, a mixture of policies is required to satisfy Equation 2. In the context of our contribution, we would like to fix a distribution that results in the minimum bias over its paths. In other terms, we look for a stochastic policy that only considers the constrained feature matching property (Equation 2) and does not add any additional preferences on the set of paths (other than the ones implied by this constraint). Thus, Ziebart et al. [34] proposed to use the Maximum Entropy principle to resolve this problem. The selected distribution yielding higher total reward is an exponentially preferable choice, and it is parameterized by the reward weights . Then, it turns out to be an optimization problem: We then maximize the likelihood of observing the expert demonstrations under the maximum entropy distribution . This can be FEBR: Expert-Based Recommendation Framework for beneficial and personalized content , , solved by a gradient optimization method: where ( ) is called expected state visitation frequency of state , and represents the probability of being in such a state. Ziebart et al. [34] presented a dynamic programming algorithm to compute ( ). Within the gradient optimization loop, we use a value iteration [31] RL algorithm to derive the expert policy , whose reward function is learned and optimized by the IRL-MaxEnt model. After the convergence of IRL-MaxEnt algorithm, we obtain the expert optimal policy * . The recommendation component presented in Figure 1 is designed to host the end-user environment. Basically, it is an exploitation and evaluation component of the AL/IRL component. We feed this component with the state dataset D (presented in 5.5) and * (learned by AL/IRL). The trick here is that we use D and * to build a classification model (introduced in 5.6 to be used as a RL agent in a standard MDP recommendation (user) environment with a utility function ( ) -e.g., measures of quality, watch time, user long-term engagement. . . ). The choice and the design of the classification model is a parameter of the system, and could be further studied as an independent problem. Note that, in this component, we use a well-known utility (reward) function, to avoid confusion with the AL/IRL methodology. In the proposed framework, the document model of such an environment must include a quality computation mechanism to rank items (videos) in the document corpus of each state (see 5.3 for more details). Furthermore, the rewards are used to construct a user's history dataset, which can be exploited by the AL/IRL component to improve the evaluation process of videos. They can also be used by the user environment for content personalization, and to evaluate the performance of the recommendation process by building potential new metrics. For the evaluation of our framework, we choose to perform simulated experiments (as previously done for several RL-based RS [6, 10, 23] ), since most public datasets are irrelevant and not designed for evaluating multi-step user-recommender interactions (static datasets). Furthermore, live experiments would be very expensive. Thus, we used the RecSim [16] platform to develop the simulation environment of our framework, in the case of a video recommendation platform. RecSim provides the necessary tools and models to create RL simulation components for RS. We note that it is possible to re-implement our solution using another RL recommendation tool that respects the structure and the framework design of RecSim. In the following, we present an overview of the models used in our simulation, but we highly recommend to refer to the implementation project and to the RecSim paper [16] for more details about each model (response, choice, transition. . . ) and the hypotheses that have been considered. Our source code is available here: https://github.com/FEBR-rec/ExpertDrivenRec Expert environment. In the AL/IRL component of our framework, we developed a POMDP\ expert environment with an IRL agent 7 to model the regular behavior of experts (selecting, watching and evaluating videos). This environment is used to generate trajectories with a variable number of steps (defined by the consumption of time budget of the session). At each time step , the IRL agent observes the expert state and chooses the slate to recommend using a (possibly stochastic) policy according to the unknown . For generality, we do not specify the RL algorithm used by this agent to learn its policy, but we assume that it executes a stationary policy. Besides, we used a document model that implements a simple evaluation mechanism based on the quality features of a video: pedagogy, accuracy, importance and entertainment (see see 5.4 for more details). User environment. To evaluate the performance of * in the recommendation component, we used the interest evolution environment provided by RecSim, which mainly implements the user simulation environment described in [15] . For this environment, we have to specify a reward function to be optimized by the agent. Then, we specify this reward according to the evaluation metrics (defined in 5.8) that we will choose for the evaluation of our framework, which are quality and watch time. In addition, we propose a simple classification model (see 5.6) to recommend, for each user state, the best slate specified by the learned * based on a similar expert state. Ideally, to build this classification model, we could have used a ML classification algorithm trained on large D , but the size of our generated dataset D is not sufficient to learn an accurate model. For both expert and user environment, we used the choice and quality models that are introduced respectively in 5.2 and 5.3. In our experiments, we use the conditional logit model [22] as the main choice model for both expert (AL/IRL component) and end-user (recommendation component) simulation environments. We recommend using conditional choice functions of the form ( | ) = ( , )/ ∈ ( , ), in which an individual selects item from slate with an unnormalized probability ( , ), where is a function of the user-item feature vector , . The conditional logit (and multinomial logit) [22] model is a common instance of this general format, which effectively captures an individual's behavior within her interaction environments. Cascade choice models [17] are effective at capturing position bias through ordered lists of recommendations. Thus, cascade models could be interesting for experiments with large slate size and a large video corpus. Overall, it is highly recommended to use the same choice model for both user and expert environments, to ensure the consistency of the system requests and the accuracy of the results. We recall that our main objective is to build a system that recommends beneficial contents. In general, we call ( ) the quality of the clicked (and probably watched) videos related to a given state . The closer ( ) is to 1 (resp. -1), the better (resp. worst) quality we get; 0 corresponds to a neuter quality. Typically, as for many practical RS, FEBR scores/ranks candidate videos using a DNN with both user/expert context and video features as input, by optimizing a combination of several objectives (e.g., clicks, expected engagement, satisfaction and other factors). These scores are often considered to describe the quality of videos in the RS. In FEBR RS, we generally use the same technique to update the value of ( ), which is initially set to follow a uniform distribution (−1, 1) . However, ( ) may be differently updated inside (1) the expert AL/IRL environment when providing evaluations after watching a video by an expert (see 5.4 for more details), and (2) the user environment of the recommendation component when the user clicked a video that has already been evaluated by an expert. In the second case, ( ) has the value from this evaluation. Similarly to an information retrieval system, RS should help users to make quality searches in order to achieve relevant quality recommendations. However, constructing efficient quality metrics for the evaluation of RS is still a very challenging problem. Within the expert environment, we propose a simple model to evaluate the quality of videos based on four quality criteria: Eval = {pedagogy, accuracy, importance, entertainment}. This is a non exhaustive list, which could be modified and adapted depending on the evaluation procedure of the studied problem. Then, these features are initially set to 0, and ( ) ∼ (−1, 1) ( ( ) is the state quality introduced in 5.3). Then, when a video is evaluated by an expert (for instance, by adjusting the features related to a session's state with a dedicated expert Web interface), an average quality score is calculated: , and the video is marked as evaluated by this expert. In this case, we update ( ) by adding × ( ∈ [0.1] is the expert quality factor that represents her amount of expertise in the topic of the video; it is a metadata of the expert's profile). D is constructed from expert trajectories where each entry contains a total description of the interaction information generated by the expert's environment. These entries have this model state form: [expert ID, expert state, response state, video state], where (1) the expert state contains the interest distribution vector over topics ∈ R , where is the number of topics; (2) the response state contains the expert behavior on the recommended slate (clicked video, watch time, values of evaluation metrics, engagement rate, new estimated quality); and (3) the video state contains videos of the corpus sampled for this expert environment's state. For each video, we store its topic, length and quality. Basically, the implemented version of this classification model in FEBR is a similarity search algorithm (alg. 1) using Euclidean distance ( , ) = ∥ − ∥. This algorithm tries to match a user state model to her closest expert state model ∈ D based on similarity between user/expert and corpus features. is defined by the inputs and , such that the vector describes the user interest distribution over topics and contains video descriptors (topic, length, score or latent quality). Then, for each , we extract the same vectors as and for the case of experts, which are the variables and . ℎ1 and ℎ2 are respectively the interest margin and the corpus margin. The values chosen for the inputs ℎ1 and ℎ2 depend on the size of D (smaller values make the classifier more accurate, but highly selective). When an expert state ∈ D is determined to match the user inputs, the underlying action (the output ) learned by the expert policy * is proposed. Otherwise, a random slate from the current user corpus is proposed. To evaluate the performance of FEBR, we consider the quality ( ) of recommended videos that have been chosen by users. As explained in 5.3, recommendation quality of most (if not all) existing approaches is essentially measured by how much the user appreciates their recommendations, and how successful they are in keeping the user engaged with the content of their system. On the opposite, with our solution, we try to first ensure the recommendation of a correct "beneficial" content, and then, to ensure that this content is as personalized as possible. To better compare our approach (that we now call RecFEBR) with other methods, we make the following (conservative) assumption. Assumption. We consider that the notion of quality used by other baseline methods (that do not use expert evaluations, namely RecFSQ, RecPCTR and RecBandit) is the same as RecFEBR, even if their measuring models are totally different. 8 We compare RecFEBR to. . . • RecFSQ, a standard, full slate, Q-learning based approach. 9 RecFSQ's RL algorithm recommends slates are based on a deep Q-Network (DQN) agent. This algorithm converges quickly in systems of small dimensions, to a policy that offers high average quality of recommendations compared to the SlateQ decomposed method [15] . • RecPCTR, which implements a myopic greedy agent that recommends slates with the highest pCTR (predicted clickthrough rate) items. Basically, this agent receives observations of the true user and document states, because it assumes knowledge of the true underlying choice model. • RecBandit, which uses a bandit algorithm to recommend items with the highest UCBs of topic affinities [4] . This agent exploits observations of user's past responses for each topic, without assuming any knowledge of their related user's affinity. Within the same best topic, the agent picks documents with high quality scores. • RecNaive, which is based on a random agent [15] that recommends random videos from the corpus of videos with high expert ratings, which best match the current user context. We emphasize that the approaches RecFSQ, RecPCTR and RecBandit do not consider quality values that are assigned by experts. Instead, they consider an inherent quality feature that represents the topic-independent attractiveness to the average user. Then, they reflect the general performance of popular RL-based RS using different techniques and powerful RL algorithms. They essentially try to maximize a reward function (using quality and watch time as engagement metrics) which leads to optimized learned actions. RecNaive is useful to test the efficiency of our approach for personalizing high-quality content for each user. Note that all approaches, including ours (RecFEBR), use the same user environment, which is introduced in Section 5.1. We first investigate the efficiency of our solution for recommending positive contents. Therefore, we analyze the quality of watched videos that are recommended based on the learned expert policy * . We define S as the set of user states in which the system recommends a slate following * (states that are successfully matched by the classification model). For the comparison with baseline approaches, we propose to evaluate the performance of FEBR by measuring the average quality of the watched videos, as well as the total watch time, as an engagement metric showing how much the user was interested in the proposed content. Then, for a full user session S of length (where S ⊆ S ), we use the following evaluation metrics: • Average expert-based quality: We consider a global system of 8 categories of video topics. We assume that each video belongs to a single category (note that this is for experiments simplification: it is still possible to consider the assignment to many categories). The simulation is performed upon a large set of videos (order of magnitude 10 5 ); for each state, the system retrieves a small corpus candidate of size 5 that best matches the user context, then recommends a slate of size 2 (small values are chosen for technical constraints). AL/IRL training. We consider a sub-system of 10 experts. For each expert, we generate 100 trajectories, each one with at most 20 steps. We fix = 0.5 for the value iteration RL algorithm. In such a complex system, determining the best value of is not obvious, since it is further constrained by the performance of a learned policy based on unknown (and insignificant) rewards. Basically, values close from 1 would result in a better learning, reducing the number of iterations required to optimize sub-optimal policies; but it would also result in a much higher convergence time. For the IRL-MaxEnt algorithm, we set the number of training iterations to 10000. This simulation runs on CPU using a server machine equipped with an Intel Xeon Gold 6152/2.1GHz, and 384G DDR4 RAM. Recommendation component. According to the explanation in 5.6 for the classification model, we set ℎ1 = 0.5 and ℎ2 = 0.1. We recommend using this setting for a generated dataset D of less than 50000 states (the larger it gets, the better it is to use small values). We simulate 3000 independent user sessions. We run this simulation on a desktop computer equipped with an Intel i7-8550U CPU, GTX 720ti, and 8G DDR4 RAM. In this section, we propose to discuss the efficiency of our approach by analyzing the quality metrics measured by the user simulation environment (recommendation component). Then, we evaluate our framework, and compare it to the baseline approaches introduced in section 5.7. Figure 2 shows that FEBR efficiently leads to positive quality consumption ( ) when following * . More precisely, the plot presents the average total quality which is achieved within all watched videos for a given session. If the classification model succeeds to match a user's state to its closest expert's state (i.e., ∈ S ), then, the recommended slate will be the underlying expert slate fixed by * . In this case, this results in the average expertbased quality . Otherwise, for unsuccessful matching operations, the framework recommends arbitrary slates (with random quality/score) from the current user corpus. Therefore, we can write = + ′ , such that ′ is the quality achieved through those arbitrary recommendations. We notice that always results in increasing positive values. Thus, arbitrary recommendations are likely responsible of providing content of low quality ( ′ < 0) that decreases the value of . The values of show that indeed, this simulation setting produced a good expert-like policy * . On the other hand, the values of ′ reflect some weakness of the recommendation component. This weakness could be caused by many factors such as the small size of D , inaccurate classification model and the cold start problem. Nevertheless, these problems remain inevitable in large and dynamic systems. In Figure 4 , we notice that our approach provides the highest value of compared to the baseline approaches, that achieve lower quality values. In addition, Figure 3 shows that this significant gain in quality does not come at the cost of losing user engagement, since RecFEBR reaches almost the same total watch time as the best ones (RecFSQ and RecBandit). Thus, we can confirm the three following points. First, FEBR can effectively guide users through positive quality content by proposing relevant and beneficial recommendations, matching their preferences and tastes. The approach generally succeeds (through its AL/IRL component) to learn an optimal expert policy * . It also manages (through its recommendation strategy) to take advantage of * , and to associate a considerable number of user states to their closest similar expert states, based on their common observable features using the proposed classification model. Thus, regardless of the values of , which highly depends on the models used by the simulation environments, we can conclude that this simulation instance proves the efficiency of our approach in adapting IRL to capture an expert behavior (despite the complexity and the ambiguity it involves) for personalized recommendations, by ensuring a significant (positive) impact on . Moreover, we highlight the important role of the classification model, that is: an accurate model trained on large datasets (built by a large number of experts and containing high number of states) could be more efficient, and significantly improve performances. Second, although RecNaive recommends evaluated videos, this approach will require an important number of evaluated videos (at the scale of the system size) to be injected to (hopefully) improve the quality of watched videos. However, even if that happens (which is very unlikely in large systems like YouTube), the random agent may fail (or likely take a long time) to learn a policy capable of delivering high-quality and personalized content. Then, this naive exploitation of these videos, making continuous explorations by the agent (because of random actions) will often recommend irrelevant contents. Thus, regardless of the higher quality that these recommendations can ensure in some successful scenarios, users will often skip recommended videos and interrupt watching, as shown by the value of in Figure 3 . Third, as we can see in Figure 4 , optimizing the quality using RecFSQ, RecPCTR and RecBandit cannot ensure a sufficient exposure to content of high quality. The low values of (compared to ) achieved by these RL methods can be explained by the ability of the RS to excessively act under user objectives. This behavior (contrary to what may seem desirable) could result in unprofitable actions (or even worse) by the RL agent, again and again. This may happen because the state quality is measured based on system engagement metrics, which incentivizes agents to prefer personalization over high quality of recommendation. In this context, Figure 4 shows the good personalization results of these methods, except for RecPCTR, which produces a relatively low value of . The result of RecPCTR regarding could be explained by its myopic nature. In this paper, we proposed FEBR, a configurable AL/IRL-based framework for quality content recommendations leveraging expert evaluations. We developed a MDP environment that exploits the experts' knowledge and their personal preferences to derive an optimal policy that maximizes an unknown reward function, using the MaxEnt IRL model. We then used this policy to build our recommender agent (the classification model), which matches user and expert's states and recommends the best related action. Experiments on video recommendations (using a simulated RLbased RS) show that expert-driven recommendations using our approach could be an efficient solution to control the quality of the recommended content, while keeping a high user engagement rate. For future works, an important challenge would be to generalize this approach to systems of large dimension. Besides, one could extend this work to study the case of unreliable experts who tend to manipulate the system for malicious or personal purposes. It is also interesting to mitigate the effects of the cold start problem in the end-user recommendation process, which indirectly affects recommendations in situation of poor state matching by the classifier (which may result in beneficial but non-personalized recommendations). It may also be interesting to develop a decentralized version of the framework using some deep neural networks approaches to better learn the expert's policy (with less constraints on the reward function), enabling a better scalability for systems of high dimensions, with less constraints on the reward functions. It may also bring more insights for running similar experiments in real-life conditions. Apprenticeship learning via inverse reinforcement learning Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions The Wisdom of the Few: A Collaborative Filtering Approach Based on Expert Opinions from the Web (SIGIR '09) Finite-Time Analysis of the Multiarmed Bandit Problem The Four Rs of Responsibility, Part 2: Raising authoritative content and reducing borderline content and harmful misinformation When Policies Are Better than Plans: Decision-Theoretic Planning of Recommendation Sequences (IUI '01) Generative Adversarial User Model for Reinforcement Learning Based Recommendation System Collaborative Filtering Using Dual Information Sources Towards Conversational Recommender Systems An Interactive Recommender System Based on Reinforcement Learning for Improving Emotional Competences in Educational Groups Deep Natural Language Processing for Search and Recommender Systems Context Adaptation in Interactive Recommender Systems Generative Adversarial Imitation Learning SlateQ: A Tractable Decomposition for Reinforcement Learning with Recommendation Sets RecSim: A Configurable Simulation Platform for Recommender Systems Cascading Bandits: Learning to Rank in the Cascade Model Siheon Lee, Alexandros Psomas, and Ariel D. Procaccia. 2019. WeBuildAI: Participatory Framework for Algorithmic Governance. Proc. ACM Hum.-Comput. Interact. 3, CSCW, Article 181 Conversational Recommendation: Formulation, Methods, and Evaluation Nonlinear Inverse Reinforcement Learning with Gaussian Processes Towards Deep Conversational Recommendations Stated Choice Methods: Analysis and Application Learning and Adaptivity in Interactive Recommender Systems Harnessing a Generalised User Behaviour Model for Next-POI Recommendation Apprenticeship Learning Using Inverse Reinforcement Learning and Gradient Methods (UAI'07) Algorithms for Inverse Reinforcement Learning Preference Elicitation Strategy for Conversational Recommender System Recommender systems: An overview of different approaches to recommendations Why Am I Seeing This? How Video and E-Commerce Platforms Use Recommendation Systems to Shape User Experiences? Case study Deep Reinforcement Learning with Attention for Slate Markov Decision Processes with High-Dimensional States and Actions Reinforcement Learning: An Introduction Deep Inverse Reinforcement Learning Toward Simulating Environments in Reinforcement Learning Based Recommendations Maximum Entropy Inverse Reinforcement Learning Reinforcement Learning to Optimize Long-Term User Engagement in Recommender Systems (KDD '19) Pseudo Dyna-Q: A Reinforcement Learning Framework for Interactive Recommendation