key: cord-0163194-at8j71rb authors: Antaris, Stefanos; Rafailidis, Dimitrios; Arriaza, Romina title: Multi-Task Learning for User Engagement and Adoption in Live Video Streaming Events date: 2021-06-18 journal: nan DOI: nan sha: 8da29d273f9e9d605d3c68d21e03ebb42846d116 doc_id: 163194 cord_uid: at8j71rb Nowadays, live video streaming events have become a mainstay in viewer's communication in large international enterprises. Provided that viewers are distributed worldwide, the main challenge resides on how to schedule the optimal event's time so as to improve both the viewer's engagement and adoption. In this paper we present a multi-task deep reinforcement learning model to select the time of a live video streaming event, aiming to optimize the viewer's engagement and adoption at the same time. We consider the engagement and adoption of the viewers as independent tasks and formulate a unified loss function to learn a common policy. In addition, we account for the fact that each task might have different contribution to the training strategy of the agent. Therefore, to determine the contribution of each task to the agent's training, we design a Transformer's architecture for the state-action transitions of each task. We evaluate our proposed model on four real-world datasets, generated by the live video streaming events of four large enterprises spanning from January 2019 until March 2021. Our experiments demonstrate the effectiveness of the proposed model when compared with several state-of-the-art strategies. For reproduction purposes, our evaluation datasets and implementation are publicly available at https://github.com/stefanosantaris/merlin. network congestion and distribute the streaming video to viewers [4] . Although distributed solutions ensure that every viewer can attend the event, an erroneously scheduled time of an event negatively affects the viewer's engagement, that is the percentage of the event's duration that a viewer attends [1] . In practice, the viewers partially attend the entire duration of an event, when an event is erroneously scheduled on a non-preferred time e.g., day and hour, resulting in a low viewer's engagement. Moreover, the erroneously scheduled time impacts the number of enterprise's events that each viewer participates, reflecting on the viewer's adoption. In particular, the viewers with several time zones have low adoption, when organizing the events and ignoring the viewer's availability. Instead of manually organizing the events, it is important for the enterprises to develop a mechanism to learn how to schedule an event on the day and hour that optimizes both the viewer's engagement and adoption. To organize an event, enterprises interact with a centralized agent that is located in a company offering the live video streaming solution. However, current streaming solutions do not account for the optimal selection of the time of the next event. To overcome the shortcomings of current live video streaming solutions, in this study we follow a reinforcement learning strategy and design an agent that receives the viewer's engagement and adoption as two different reward signals for the selection of the event's time. Reinforcement learning has been proven an efficient means for optimizing a reward signal in various domains such as robotics [18, 28] , games [19, 27] , recommendation systems [14, 26] , and so on. However, such approaches train an agent on a single task, where the learned policy maximizes a single cumulative reward. Nonetheless, the goal of the agent in our case of the event's time selection problem is to optimize both the viewer's engagement and adoption rewards. Recently, multi-task reinforcement learning approaches have been proposed to generate a single agent that learns a policy which optimizes multiple tasks, with each task corresponding to a different reward signal [8, 11, 23] . State-of-the-art approaches train an agent by sharing knowledge among similar tasks [25] . For example, the attentive multi-task deep reinforcement learning (AMT) model [5] exploits a soft-attention mechanism to train a single agent on tasks that follow different distributions in the reward signal. However, AMT transfers knowledge among similar tasks, while isolating dissimilar tasks during the agent's training. This means that AMT achieves sub-optimal performance when tasks have completely different characteristics, as it happens in the case of live video streaming events. For instance, as we will demonstrate in Section 2 the viewers have a low engagement behavior over time, whereas the viewer's adoption increases among consecutive events. In addition, to efficiently select the event's time, the agent has to capture the evolution of the viewer's engagement and adoption. Towards this aim, the Transformer's architecture has been emerged as a state-of-the-art learning model across a wide variety of evolving tasks [24] . For example, in [17] the Transformer's architecture has been exploited in a reinforcement learning strategy to provide memory to the agent by preserving the sequence of the past observations. How-ever, baseline approaches based on the Transformer's architecture have not been studied for multi-task reinforcement learning problems. To address the shortcomings of state-of-the-art strategies, in this study we propose a Multi-task lEaRning model for user engagement and adoption in Live vIdeo streamiNg events (MERLIN), making the following contributions: -We formulate the viewer's engagement and adoption tasks as different Markov Decision Processes (MDPs) and propose a multi-task reinforcement learning strategy to train an agent that selects the optimal time, that is day and hour of the enterprise's next event aiming to maximize both tasks. -We design a Transformer's architecture to weigh the importance of each task during the training of the agent, that is to determine the contribution of each task to the learning strategy of the agent's policy. -We transfer knowledge among tasks through a joint loss function in a multitask learner component and compute a common policy that optimizes both the viewer's engagement and adoption in a live video streaming event. Our experimental evaluation on four real-world datasets with live video streaming events show the superiority of the proposed MERLIN model over baseline multi-task reinforcement learning strategies. The remainder of this paper is organized as follows, in Section 2 we present the main characteristics of the live video streaming events as well as the evolution of the viewer's engagement and adoption. In Section 3 we formally define the multi-task problem of scheduling live video streaming events, and detail the proposed MERLIN model. Then, in Section 4 we present the experimental evaluation of our model against baseline strategies, and conclude the study in Section 5. We collected four real-world datasets with all the events that occurred in four large enterprises worldwide from January 2019 until March 2021. The video streaming solution of the events was supported by our company. We monitored a set E of live video streaming events, where for each event e t ∈ E on date t the viewers reported to a backend server of our company the timezones, as well as their joining and leaving times during the event. The datasets were anonymized and made publicly available. In Table 1 , we summarize the statistics of the four evaluation datasets. Each enterprise has a different number of viewers, located in several countries around the world with different time zones. We observe that the viewers in Enterprise 1 are distributed to less time zones than the other enterprises, whereas Enterprise 4 hosts the largest number of live video streaming events with approximately 0.5M viewers in total. In Figure 1 , we present the average viewer's engagement to the live video streaming events throughout the time span. We define the average engagement u t of the viewers that participated in the event e t ∈ E on the date t as follows: where n is the number of viewers that participated in the event e t , k i is each viewer's attendance time and m is the duration of the event. In all enterprises the viewers have low engagement, that is in all enterprises the viewers attended less than the half duration of each live video streaming event with average viewer's engagement u t <0.5 (Table 1 ). In addition, the average viewer's adoption expresses how many events the viewers attended until a date t, where large adoption scores indicate that viewers were willing to participate in the enterprise's previous events. We formally define the average adoption v t as follows: where c i is the number of events that each viewer i attended prior to the event e t . We observe that the viewers in Enterprise 1 adopted less events than the other enterprises with average adoption v t =1.275. On one of the last dates Enterprise 1 organized an all-hands event where all the viewers were invited, which explains the pick of the adoption score for Enterprise 1 in Figure 1 . The adoption scores for Enterprises 2, 3 and 4 increase over time in the last year, as enterprises started to organize more events than the previous years for viewers who most of them worked from home due to the COVID'19 pandemic. An enterprise organizes T = |E| events, where each event on a date/step t is defined as e t = (h, n, m, u t , v t , z), with h being a timestamp that corresponds to the event's day and hour. Notice that a date/step t has 24 different timestamps h and an event e t has a duration of m minutes with n viewers. The viewers attend the event with different time zones which is represented as an one-hot vector z ∈ R dz , where d z is the number of different time zones of the viewers. The goal of the enterprise is to organize each event e t ∈ E on the timestamp h, to maximize the average engagement u t and adoption v t of the viewers. We formulate the scheduling of the next event as a Markov Decision Process (MDP), where the agent interacts with the environment/enterprise by selecting the timestamp h of the next event e t+1 and maximizing the cumulative rewards. In particular, we define the MDP of the live video streaming event as follows [21] : Definition 1. Live Video Streaming Event MDP. At each step t = 1, . . . , T , the agent interacts with the environment and selects an action a t ∈ A. An action a t corresponds to the selection of the timestamp h of the next event e t+1 based on the state s t ∈ S of the enterprise. We define the state s t of the enterprise as a sequence of the l previous events s t = {e t−l , . . . , e t } 4 . The agent receives a reward r(s t , a t , s t+1 ) ∈ R for selecting the action a t ∈ A in state s t ∈ S, considering the enterprise transitions to state s t+1 with a probability p(s t+1 |s t , a t ) ∈ P. The goal of the agent is to find the optimal policy π θ : S × A → R, where θ is the set of policy parameters, assigning a probability π θ (a t |s t ) of selecting an action a t ∈ A provided a state s t ∈ S. Having computed the policy π θ , the agent maximizes the expectation of the discounted cumulative reward max E[ T t=0 γ t r(s t , a t , s t+1 )|π θ ], with γ ∈ [0, 1] being the discount factor. In our model, we focus on training a common agent that optimizes both the viewer's engagement u t and adoption v t . As mentioned in Section 2, the viewer's engagement and adoption behavior vary over time. Therefore, we first consider the viewer's engagement and adoption as independent tasks, and then train a common agent to optimize the cumulative rewards of both tasks at the same time. We define the multi-task Reinforcement Learning (RL) problem in live video streaming events as follows [5, 6, 8, 11] : Multi-Task RL in Live Video Streaming In the multi-task RL problem for live video streaming events, we consider a set of tasks T , that is the engagement and adoption tasks with |T | = 2. We formulate each task τ ∈ T as a different MDP, where the tasks have the same state S and action space A with a different set of rewards R. For the engagement task we compute reward r(s t , a t ) as the average engagement u t in Equation 1, and for the adoption task the reward corresponds to the average adoption v t in Equation 2 at the t-th step. The goal of the agent is to learn a common policy π θ that solves each task τ ∈ T , by maximizing the expected return max for both tasks. s τ t is the state of the agent and a τ t is the action taken by the agent for the task τ at the t-th step. As illustrated in Figure 2 , the proposed MERLIN model consists of three main components: the policy, task importance and multi-task learner components. The goal of MERLIN is to compute a common policy π θ that maximizes the future rewards for the viewer's engagement and adoption tasks τ ∈ T . -Policy Component. The role of the policy component is to compute the action a τ t of both tasks. During training, the agent interacts with two environments in the enterprise, that is the different two tasks τ ∈ T . The input of the policy component is the l previous events {e τ t−l , . . . , e τ t } of each task. We implement a shared state representation module to compute the state s τ t of task τ . In our architecture, we design the respective two actors to generate the actions a τ t for the engagement and adoption tasks [8] . Then, the generated state-action transitions by both actors are stored in the replay buffer with size l b to train the common agent. -Task Importance Component. The task importance component determines the contribution of each task to the learning process of the agent. Notice that state-of-the-art RL strategies are designed to learn a policy of a single agent that optimizes similar tasks, ignoring the information of each task's state-action transition [25] . Instead, in the proposed MERLIN model to account for the impact of each state-action transition on the policy π θ , we consider the encoder model of the Transformer's architecture for the state-action transition sequences. In doing so, we capture the information of the state-action transitions of both the engagement and adoption actors over time [15, 17] . In addition, the task importance component computes a weight matrix M ∈ R l b ×|T | which reflects on the contribution of each actor to the learning process of the policy π θ . -Multi-Task Learner Component. The role of the multi-task learner component is to optimize the policy π θ based on the l b state-action transitions stored in the replay buffer. Provided the stored state-action transitions in the replay buffer and the weight matrix M of the task importance component, the multitask learner updates the policy parameters through a joint loss function L policy and the parameters of the task importance component via the L learner function, following the temporal-difference learning strategy [21] . In particular, matrix M first weighs the state-action transitions in the replay buffer, and then the multitask learner optimizes the joint loss function L policy to compute the parameters of the policy component. In addition, the multi-task learner learns its parameters via the joint loss function L learner , and updates the parameters of the task importance component accordingly. At each step t = 1, . . . , T , the policy component takes as an input the l previous events {e τ t−l , . . . , e τ t } of each task τ ∈ T . The goal of the policy component is to learn a policy π θ that solves each task τ . Provided that the engagement and adoption tasks have the same state space S and action space A, the policy component consists of a shared state representation module and two actors, that is the engagement and adoption actors. -State Representation Module. The state representation module takes as an input the l previous events {e τ t−l , . . . , e τ t }, and generates the state s τ t of each task τ at the t-th step. We represent each event e τ t as a d x -dimensional vector x τ t ∈ R k concatenating the event's features x τ t = Concat(h, n, m, g, o, z). Given the representations {x τ t−l , . . . , x τ t } of the l previous events, we compute the d sdimensional state representation vector s τ t ∈ R ds as follows: [29, 30] : where w are the trainable parameters of the Time-LSTM function ξ(·) [29] . Notice that Time-LSTM models the time difference ∆(t) of the event e τ t and the previous event e τ t−1 as follows: where g t is the time dependent gate influencing the memory cell and the output gate o t , q t is the memory cell of LSTM, and f t and i t are the forget and input gates, respectively [10, 30] . The symbol represents the Hadamard element-wise product and σ(·) is the sigmoid function. The different weight matrices W * in Equation 4 transform the event embedding x τ t and the time difference ∆(t) to the d s -dimensional latent space, and b * are the respective bias terms. Notice that the time difference ∆(t) is important to capture the similarity among consecutive events in the state s τ t . Provided that the engagement and adoption of the viewers vary over time, our goal is to capture the most recent viewer's behaviour in the state space s τ t . Therefore, the Time-LSTM in Equation 4 tends to forget events with high time difference, and focuses on the recent events. -Engagement and Adoption Actors. The engagement and adoption actors take as input the state s τ t of each task τ ∈ T . The state representation s τ t captures the evolution of the enterprise over time. Given the state s τ t and a policy π θ , each actor computes a d a -dimensional action vector a τ t ∈ R da , where d a is the number of all the possible timestamps. Each dimension of the action vector a τ t corresponds to the probability of selecting the timestamp h for the next event e t+1 . We implement a two-layer perceptron (MLP) to transform the state vector s τ t ∈ R b to the action vector a τ t ∈ R u as follows: where θ are the trainable parameters of the MLP, that is the policy parameters of the agent. Given the action vector a τ t of each actor, we normalize the action vector a τ t based on the softmax function and select the action with the highest value using the -greedy exploration technique [21] . The generated state-action transitions are stored in the replay buffer to learn the optimal policy π θ based on the past experiences of each task. The goal of the task importance component is to determine the contribution of each task to the learning strategy of the policy π θ . The input of the task importance component is the set of state-action transitions stored in the replay buffer by the engagement and adoption actors. At each step t = 1, . . . , T , the engagement and adoption actors store in the replay buffer the respective stateaction transition (s τ t , a τ t ) of the task τ ∈ T . Having stored the l b state-action transitions of each task τ in the replay buffer, the task importance component computes the similarity among the tasks. As the replay buffer contains a sequence of state-action transitions, we employ the encoder of the Transformer's model to capture the information of the l b states to d y -dimensional vectors Y τ ∈ R l b ×dy [24] . To overcome any stability problems that might occur at the early stages of the training, we implement the Gated Transformer(-XL) (GTrXL) model of the Transformer's architecture as follows [17] : where {s τ t−l b , . . . , s τ t } is the states sequence of the task τ stored in the replay buffer. Parameters η denote the trainable weights of the GTRrXL function ψ(·) [17] . By computing the d y -dimensional vectors, that is the rows of matrix Y τ of each task τ , we deduce the importance of each state s τ t in the actions selected by the actor over time for task τ . Therefore, we can compute a weight matrix M ∈ R l b ×|T | of each state s τ t during the training of the agent's policy π θ . To calculate the weight matrix M, we employ a two-layer MLP with softmax activation: where ω are the parameters of the MLP transformation function λ(·). Intuitively, we give stronger preference to the states s τ t that contribute more to the learning strategy of the agent than the rest of the states. This means that our agent learns the policy π θ based on the most important states s τ t . According to our architecture in Section 3.1 the multi-task learner optimizes the joint loss function L policy to compute the parameters w and θ of the policy component of Equations 3 and 5. In addition, based on the joint loss function L learner we calculate the parameters ζ of the multi-task learner component, and update the parameters η and ω of the task importance component of Equations 6 and 7. The input of the multi-task learner component is the l b state-action transitions, of each task τ , stored in the replay buffer, and the weight matrix M generated by the task importance component. The multi-task learner component calculates the state-action value Q(s τ where α is the learning rate. The term [r(s τ k , a τ k ) − M τ,k Q(s τ k , a τ k )] corresponds to the benefit of taking the action a τ k given the state s τ k . The expected value Q(s τ k , a τ k ) is weighted by M τ,k so as to strengthen/weaken the contribution of the state s τ k when learning the policy π θ , accordingly. The joint loss function L learner is formulated as a minimization mean squared error function with respect to parameters η ω and ζ as follows: Overall, to train our model we consider that the agent interacts with the environment in an episodic manner [21] . This means that the agent interacts with the environment within a finite horizon of T interactions/events. We train our model for multiple episodes and optimize the joint loss functions L policy and L learner in Equations 9 and 10 with respect to the parameters w, θ, η ω and ζ through backpropagation with the Adam optimizer [12] . -Enviroment. In our experiments, we evaluate the performance of the proposed model to select the timestamp h of each event that maximizes the viewer's engagement u t and adoption v t . For each dataset we order the events according to the timestamps, and consider the first 70% of the events as training set E train , 10% for validation E val and 20% for testing E test .The agent interacts with an emulated environment 5 which models the behavioural policy π β of the events of each dataset. Following [7, 9, 30] , to emulate the behavioural policy π β we train a multi-head neural network on each dataset, which takes as input a sequence of events and outputs the average engagement and adoption of the next event. During the agent's training, we initialize the reinforcement learning environment with the events of the training set E train . To initialize the state s τ t of the agent, we randomly select an event e t ∈ E train of the training set. At each step t = 1, . . . , T , the agent takes an action a τ t for each task τ . Then, the agent receives the average engagement u t and adoption v t generated by the behavioural policy π β as a reward of each task. To evaluate the learned policy π θ , we initialize the reinforcement learning environment with the events of the test set E test . Similar to the training strategy, the state s τ t of the agent is initialized by randomly selecting an event e t ∈ E test from the test set. The agent takes an action a τ t and receives the reward by the multi-head network which models the behaviour policy π β of the test set E test . -Evaluation Metrics. We evaluate the performance of our proposed model in terms of the step-wise variant of Normalized Capping Importance Sampling (NCIS) for each task as follows: [22, 30] : whereρ is the max capping of the importance ratio, and δ is a threshold to ensure small variance and control the bias of the policy π θ towards the behavioural policy π β . The termρr(s τ t , a τ t ) is the capped importance weighted reward of a task τ . Intuitively, by adopting different rewards in the termρr(s τ t , a τ t ), we can measure the performance of the policy π θ to approximate the behavioural policy π β . By setting each reward r(s τ t , a τ t ) equal to the viewer's engagement and adoption as in Section 3, we can evaluate the performance of the proposed model based on the respective metrics Eng. NCIS and Ad. NCIS for both tasks. As the emulated environment is initialized randomly, we repeated our experiments five times and report average Eng. NCIS and Ad. NCIS in our experiments. -Baselines. We compare the proposed MERLIN model against the following strategies: FeedRec [30] , AMT 6 [5], IMPALA 7 [8] and PopART [11] . As there are no publicly available implementations of FeedRec and PopART, we implemented both from scratch and published our source codes 8 . -Parameter Configuration. For each examined model, we tuned the hyperparameters on the validation set, following a grid-selection strategy. In FeedRec, we set the state representation dimensionality d s = 256 for Enterprises 1 and 3, and d s = 128 for Enterprises 2 and 4. At the t-th step, the FeedRec model takes as an input all the events occurred prior to the current step with l = 0. In AMT we fix a d s = 128 dimensional state representation for all datasets, with a time window l = 30 previous events. In IMPALA and PopART the state representation's dimensionality is fixed to d s = 64 for all Enterprises. The window length l in IMPALA and PopART is set to 20 and 23, respectively. In the proposed MER-LIN model we use a d s = 128 dimensional state representation for Enterprises 1 and 4, and 256 and 64 for Enterprises 2 and 3, respectively. The window length l is fixed to 10 for Enterprise 1, and 15 for Enterprises 2, 3 and 4. In addition, the size of the replay buffer l b is set to 128 for all Enterprises. In all the examined models, we follow an -greedy exploration-exploitation strategy and set = 0.1. The discount factor γ is fixed to 0.92 and the learning rate is set to α = 0.001. In the emulated environment, we set the number of interactions/events to 200 and the number of episodes to 300. All our experiments were conducted on a single server with an Intel Xeon Bronze 3106, 1.70GHz CPU. The operating system of the server was Ubuntu 18.04 LTS. We accelerated the training of the model using the GPU Geforce RTX 2080 Ti graph card. Our proposed MERLIN model was implemented in Pytorch 1.7.1 and we created the reinforcement learning environment with the OpenAI Gym 0.17.3 library. In Table 2 , we evaluate the performance of the examined models in terms of average Eng. NCIS and Ad. NCIS over the five trials in the emulated environment for the engagement and adoption tasks, respectively. The proposed MERLIN model significantly outperforms the baselines in all datasets. This indicates that MERLIN can efficiently learn a common policy π θ that optimizes both tasks concurrently. Compared with the second best method FeedRec, MER-LIN achieves relative improvements of 15.76 and 15.96% in terms of Eng. NCIS and Ad. NCIS, respectively. FeedRec performs better than the other baseline approaches because FeedRec formulates a joint loss function for training the agent on the different tasks. However, each task in FeedRec contributes equally when learning the policy π θ , and therefore the agent ignores the evolutionary patterns and the importance of the state-action transitions for each task. The proposed MERLIN model overcomes this problem by integrating the training parameters of the task importance component in the common learning strategy of the policy and multi-task learner components. In doing so, MERLIN balances the contribution of each task to the generated policy. In Figure 3 we report the Eng. reward and Ad. reward based on Equations 1 and 2 for the engagement and adoption tasks, respectively, when the interactions/events evolve in the emulated environment. We observe that MERLIN constantly achieves higher rewards than the other baseline approaches at the first interactions. This demonstrates the effectiveness of MERLIN to weigh the importance of each task during training and learn a policy that optimizes both tasks. In addition, we observe that the Ad. reward in the adoption task of MER-LIN converges faster in Enterprises 2, 3 and 4 than in Enterprise 1. As discussed in Section 2, the viewer's adoption in Enterprises 2, 3 and 4 increase over time. Therefore, the task importance component promotes the adoption task during the training of the policy, thus achieving high reward in Enterprises 2, 3 and 4 at the beginning of the interactions. In the next set of experiments we compare the proposed MERLIN model with its variant MERLIN-S. In particular, the agent of the variant MERLIN-S is trained on a single task, ignoring the multi-task learning strategy of MERLIN. In Figure 4 events in Enterprises 2, 3 and 4. Moreover, we observe that MERLIN constantly outperforms the single task variant MERLIN-S. Notice that MERLIN-S achieves the best performance when the window length l is set to 15 past events for Enterprise 1, and 20 for Enterprises 2, 3 and 4. Therefore, MERLIN-S requires a higher window length l than MERLIN in all Enterprises, as MERLIN-S omits the auxiliary information of the other task when training the agent. In this study, we presented a multi-task reinforcement learning strategy to train an agent so as to select the optimal time of a live video streaming event in large enterprises, aiming to improve the viewer's engagement and adoption. In the proposed MERLIN model, we formulate the engagement and adoption tasks as different MDPs and design a joint loss function to extract knowledge from both tasks. To determine the contribution of each task to the training strategy of the agent, we implement a task importance learner component that extracts the most important information, that is the most important state-action transitions from the replay buffer based on the Transformer's architecture. Having weighted the transitions, the agent of MERLIN learns a common policy for both tasks. Our experiments with four real-world datasets demonstrate the superiority of our model against several baseline approaches in terms of viewer's engagement and adoption. The proposed MERLIN model can significantly help enterprises in selecting the optimal time of an event. Provided that nowadays the majority of the events are online, the enterprises want to ensure that their employees/viewers adopt the video streaming events with high engagement. This means that with the help of MERLIN in scheduling the live video streaming events, the enterprises can communicate with their employees efficiently, which as a consequence reflects on significant productivity gains [2] . An interesting future direction is to study the influence of distillation strategies on the proposed MERLIN model [16] . Gauging demand for enterprise streaming -2020 -investment trends in times of global change Using video for internal corporate communications, training & compliance Vstreamdrls: Dynamic graph representation learning with self-attention for enterprise distributed video streaming solutions Attentive multi-task deep reinforcement learning Sparse multi-task reinforcement learning Top-k offpolicy correction for a reinforce recommender system IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures Offline a/b testing for recommender systems Generating sequences with recurrent neural networks Multi-task deep reinforcement learning with popart. In: AAAI Adam: A method for stochastic optimization Continuous control with deep reinforcement learning End-to-end deep reinforcement learning based recommendation with supervised embedding Working memory graphs Efficient transformers in reinforcement learning using actor-learner distillation Stabilizing transformers for reinforcement learning Survey of model-based reinforcement learning: Applications on robotics Mastering chess and shogi by self-play with a general reinforcement learning algorithm Deterministic policy gradient algorithms Reinforcement learning: An introduction The self-normalized estimator for counterfactual learning Distral: Robust multitask reinforcement learning Attention is all you need A survey of multi-task deep reinforcement learning Self-supervised reinforcement learning for recommender systems Mastering complex control in moba games with deep reinforcement learning The ingredients of real world robotic reinforcement learning What to do next: Modeling user behaviors by time-lstm Reinforcement learning to optimize long-term user engagement in recommender systems