key: cord-0164210-98wkhzgl authors: Deliu, Nina; Williams, Joseph Jay; Chakraborty, Bibhas title: Reinforcement Learning in Modern Biostatistics: Constructing Optimal Adaptive Interventions date: 2022-03-04 journal: nan DOI: nan sha: 85c0720d1e03ecc8e5f5119508cc5bcedb748244 doc_id: 164210 cord_uid: 98wkhzgl Reinforcement learning (RL) is acquiring a key role in the space of adaptive interventions (AIs), attracting a substantial interest within methodological and theoretical literature and becoming increasingly popular within health sciences. Despite potential benefits, its application in real life is still limited due to several operational and statistical challenges--in addition to ethical and cost issues among others--that remain open in part due to poor communication and synergy between methodological and applied scientists. In this work, we aim to bridge the different domains that contribute to and may benefit from RL, under a unique framework that intersects the areas of RL, causal inference, and AIs, among others. We provide the first unified instructive survey on RL methods for building AIs, encompassing both dynamic treatment regimes (DTRs) and just-in-time adaptive interventions in mobile health (mHealth). We outline similarities and differences between the two areas, and discuss their implications for using RL. We combine our relevant methodological knowledge with motivating studies in both DTRs and mHealth to illustrate the tremendous collaboration opportunities between statistical, RL, and healthcare researchers in the space of AIs. In the era of big data and digital innovation, healthcare is undergoing a process of rapid and dramatic change, transitioning from the one-size-fits-all standards to the tailored approach of precision medicine or personalized medicine (Kosorok and Laber, 2019) . This paradigm intersects a broad range of scientific domains from genomics to causal inference, all in support of evidencebased, i.e., data-driven, decision-making. Conceptually, under this framework, "disease treatment and prevention takes into account individual variability in genes, environment, and lifestyle for each person" (Precision Medicine Initiative; Collins and Varmus, 2015) , allowing for more accurate prediction on which treatments and prevention strategies may benefit each group of patients affected by a particular disease. One of the key methodological lines of research within the domain of personalized medicine is the development of evidence-based adaptive interventions (AIs; Almirall et al., 2014; Collins et al., 2004) . The fundamental problem of AIs is to operationalize sequential clinical decision-making by tailoring interventions to individuals with the aim of optimizing their individual outcomes. In clinical practice, a typical problem is represented by the ubiquitous situation in which a doctor needs to define a set of treatment rules (i.e., a treatment regime) that dictate how to personalize treatments to patients based on their individual characteristics. Tailoring information accounts for both baseline information, e.g., demographics or pretreatment clinical conditions, and each uniquely evolving (time-varying, dynamic) health status, e.g., previous responses to treatments. Thus, such a treatment regime is "dynamic" within a person, due to their changing disease status. To the patient, this sequence of treatments seems like standard treatment; to the clinician, it represents a series of decisions made according to information from previous patients with similar treatment history, characteristics, and behaviors; and to the statistician, this constitutes a dynamic treatment regime or regimen (DTR; Murphy, 2003; Lavori and Dawson, 2004; Chakraborty and Murphy, 2014) . Conceptually, DTRs can be viewed as a decision support system of a clinician (or more generally, a decision maker). A more recent, but rapidly expanding healthcare domain, in which the vehicular characteristic of AIs represents a powerful resource, is mobile health (mHealth; Istepanian et al., 2007; Kumar et al., 2013; Rehg et al., 2017) . mHealth refers to the use of mobile (or wearable) technologies for health promotion-often targeting behavioral aspects-in both clinical and non-clinical populations. A high-level goal in mHealth is to deliver efficacious just-in-time interventions, in response to rapid changes in individual's circumstances. A major challenge is thus not only to handle "the right individual with the right intervention", but also "at the right time", avoiding over-treatment and its consequences on user engagement (e.g., low adherence to recommendations or discontinued usage of the app). The time component is a key element of mHealth interventions, and it is often part of the interventions set defining such real-time AI rules, known as justin-time adaptive interventions (JITAIs; Tewari and Murphy, 2017; Nahum-Shani et al., 2018) . Notably, despite their relatively recent uptake compared to DTRs, JITAIs are currently registering an increasing interest within the AI domain, with a trend that suggests a dominating relevance. We support our beliefs with a literature search on Google Scholar, quantifying the evolving literature interest in the two areas from 2000 to 2021 (see FIG. 1) . Notice that, both DTRs and JITAIs terminologies have been originated after 2000, specifically 2003 for DTRs (Murphy, 2003) and 2013 for JITAIs (Timms et al., 2013) . If on one side the increasing technological sophistication has led to new biomedical data sources-e.g., mobile/wearable devices and electronic health records (EHRs)-that offer a powerful resource for improving healthcare, on the other side it poses some unique new challenges, ranging from ethical and societal concerns, to transparency and data analysis. Among the various existing challenges (see Zicari, 2014 , for a comprehensive survey), within statistics, the ability to process and meaningfully analyze such complex data dominates over other interests (Sivarajah et al., 2017) , continuously calling for new statistical tools. Machine learning (ML) algorithms, which directly learn from the observations and can automatically improve through experience, could complement classical statistical tools, and support clinicians in their decisionmaking (see e.g., Deo, 2015; Rajkomar et al., 2019 , for an overview of successful applications). By matching characteristics of an individual patient to a computerized clinical knowledge base, such algorithms can suggest patient-specific assessments or (treatment) recommendations, and aid complex decisions (Sutton et al., 2020) . There is no bright line between ML models and traditional statistical models (Beam and Kohane, 2018) . However, it is well documented that sophisticated ML models (e.g., deep learning models; Goodfellow et al., 2016) are well suited to learn from high-dimensional and heterogeneous data generated from modern clinical care, often with improved performances compared to standard statistical models. Among the existing ML paradigms (Bishop, 2006; Mohri et al., 2018) , reinforcement learning (RL; Sutton and Barto, 2018; Bertsekas, 2019; Sugiyama, 2015) , offers an ideal framework for sequential decision-making problems. In RL, at each time step of a sequential process, an agent interacts with an unknown environment, in which it takes action(s). Based on the feedback or reward received from the environment for the selected action(s), the agent learns, by trial-and-error, on how to make better actions in order to optimize the cumulative feedback over time. The RL framework is abstract and flexible and can be applied in a variety of (healthcare) domains where the problem has a sequential nature (Chakraborty and Moodie, 2013; Gottesman et al., 2019) , by specifically characterizing the environment (or domain) dynamics. Well suited to the problem at hand, RL has been introduced in the clinical arena as a tool for data analysis for discovering optimal DTRs in life-threatening diseases such as cancer (Zhao et al., 2009; Goldberg and Kosorok, 2012) , as well as broader healthcare and behavioral areas, going from physical activity (Yom-Tov et al., 2017) and weight loss management (Forman et al., 2019; Pfammatter et al., 2019) to substance use (Murphy et al., 2007b) . Real-world examples, included the PROJECT QUIT -FOREVER FREE (Chakraborty and Moodie, 2013) smoking-cessation study we participated in, will be discussed in Sections 2 and 6.1. Similar healthcare domains are currently employing RL for deploying JITAIs to promote health behavior changes. Such domains typically reflect underlying behavioral theories (see e.g., Heron and Smyth, 2010) , based on which interventions are expected to have an in-the-moment effect on a (proximal, short-term) outcome. Moreover, being delivered in dynamic environments where context and options can change rapidly, JITAIs require a continuous and computationally feasible learning. These arguments motivate assumptions on RL dynamics that do not take into account states progression (states are assumed to be independent), and the RL problem boils down to solving a one-state Markov decision process, or a stochastic multiarmed bandit (MAB) problem (Sutton and Barto, 2018; Lattimore and Szepesvári, 2020) . As we will thoroughly discuss in Section 3.2.2, such MAB problems can be seen as the simplest form of RL, in which the agent is stateless (Bouneffouf et al., 2020) . A concrete application of this framework will be illustrated in Sections 2 and 6.2, where we will discuss the real-world implementation of the DIAMANTE Figueroa et al., 2020) mHealth study on physical activity. Given the increasing number of mHealth studies, and in tandem the increasing interest among methodologists in addressing challenges arising from that field (see also FIG. 1) , in this work, we combine our knowledge with the development process of motivating studies, to provide an extensive overview of the state-of-the-art of RL in mHealth. To the best of our knowledge, this represents the first comprehensive survey on methods, applications, and challenges in developing JITAIs in mHealth, to add to the theoretical and applied literature, and to the extensively surveyed DTR problems (see e.g., Chakraborty and Moodie, 2013; Chakraborty and Murphy, 2014; Tsiatis et al., 2019) . Indeed, we discuss this emerging field in parallel with the more developed DTR domain, reporting similarities and divergences between the two types of AIs. This will serve as an update of the recent RL progress in DTRs, and will provide the research community with a unified overview that brings together the two applied areas of DTRs and JITAIs and their formalization under the RL framework. Besides offering such a unified view of the problem, we also provide researchers engaged in data-driven health decision support with: 1) a comprehensive illustration of the state-of-the-art of RL methodologies for developing optimal AIs; 2) examples of real-life RL-based AIs studies; and 3) current challenges and open problems in the field. We believe that there is ample scope for important practical advances in these areas, and with this survey we aim to make it easier for theoretical disciplines to join forces to assist healthcare discoveries by developing the next generation of methods for AIs in healthcare. Considering the extent of our general goal, we summarize the key contributions of this work-in relation to its structure-as follows: • Section 2: We provide a general panorama on the problem of interest, i.e., AIs, unifying both DTRs and JI-TAIs under this common framework, and discussing similarities and differences between them. • Section 3: We offer a mathematical formalization of the RL framework, and its subclasses, and augment it with the causal inference framework for building an umbrella under which AIs can be developed. We translate the key terminologies between the RL, statistics, and AI domains, assimilating also the different existing notations into a coherent body of work. This offers a foundation to more easily conduct research in both theoretical and applied sciences, enhancing communication and synergy between the different areas. • Section 4-5: We illustrate the main RL-based approaches for developing i) DTRs (both finite-time and indefinite-time horizon settings), with a focus on novel proposals, and ii) JITAIs for mHealth, highlighting theoretical and applied contributions. • Section 6: Using as motivating examples the PROJECT QUIT -FOREVER FREE and the DIAMANTE studies, we discuss the applicability of RL in DTRs and JITAIs in mHealth, and the potential challenges that may arise in their development, highlighting the crucial role of the biostatistical community. In medical research, AIs offer a vehicle to operationalize a sequential decision-making process over the course of a disease or a program, with the aim of optimizing individual health-related outcomes. Technically speaking, AIs are defined via explicit sequences of decision rules that pre-specify how the type, intensity, and delivery of intervention options should be adjusted over time in response to individuals' progress Nahum-Shani et al., 2018) . The pre-specified nature of AIs increases their replicability in research, enhancing the assessment of their effectiveness (Nahum-Shani and Almirall, 2019) . Existing frameworks for formalizing AIs (Collins et al., 2004; Almirall et al., 2014) are primarily founded on four key components: (i) the decision points specifying the time points at which a decision concerning intervention has to be made; here we assume a number of finite or indefinite discrete times t = 0, 1, . . . ; (ii) the decisions or intervention options at each time t, that may correspond to different types, dosages (duration, frequency or amount; Voils et al., 2012) , or delivery mode, as well as various tactical options (e.g., augment, switch, maintain); we denote them with A t ∈ A t , where A t is the decisions space, generally discrete; (iii) the tailoring variable(s) at each time t, say X t ∈ X t , with X t ⊆ R p , capturing individuals' baseline and time-varying information for personalizing decision-making; (iv) the decision rules d = {d t } t≥0 , that, at each time t, link the tailoring variable(s) to specific decisions or interventions. A common and useful illustrative way of describing an AI is through schematics such as the one shown in FIG 2, where we also specify each of the four components. "Ifthen" statements make it clear how the decision rule prespecify the intervention option under various conditions. However, since the adaptation performed by AIs are aimed for optimizing individual health-related outcomes, two additional components have a determinant role in their definition (Nahum-Shani and Almirall, 2019), i.e.: (v) the proximal outcome(s), say {Y t } t>0 , directly targeted by an intervention, easily observable and expected to influence a longer-term outcome of interest, according to some mediation theory (MacKinnon et al., 2007) ; (vi) the distal outcome(s), representing the long-term health outcome of interest and ultimate goal of the overall AI. Note that proximal outcomes can also be used as tailoring variables for guiding later-stage decisions. In FIG 2, for instance, the response status at week 2, 4 and 8 can represent both the proximal outcome targeted by each of the four-stage decision rules and a tailoring variable used in subsequent decision rules but first. The development and establishment of AIs or decision rules is guided by domain theory, clinical knowledge and empirical evidence on the relationship between tailoring, proximal and distal outcome(s) and intervention options. Determination of optimized decision rules also involves more sophisticated statistical and ML tools, with RL representing a state-of-the art solution. AIs are known by a variety of different names, such as adaptive treatment strategies (Murphy, 2005a; Murphy et al., 2007a) and treatment policies (Lunceford et al., 2002; Wahed and Tsiatis, 2006; Dawson and Lavori, 2012) , and are often used as synonyms of dynamic treatment regimes or regimens (Murphy, 2003; Lavori and Dawson, 2004; Laber et al., 2010; Chakraborty and Moodie, 2013; Laber et al., 2014c) . However, given their more generic nature, we use the term AIs to refer to a general framework for personalizing interventions sequentially based on an individual's time-varying characteristics. This broader definition embraces a considerate number of applications, including non-healthcare (e.g., education Nahum-Shani and Almirall, 2019) and healthcare domains, with the development of DTRs and JITAIs. In medical research, DTRs define a sequence of treatments individually tailored to each patient based on their baseline and time-varying (dynamic) state. Based on the disease type, we can distinguish two main domains for building DTRs: chronic diseases and critical care. The former refer to long-lasting conditions and include the leading causes of death and disability (e.g., cancer, diabetes, mental illness, obesity), while in critical care, patients need urgent medical treatment (e.g., . Due to the demographic shift in most high-income populations, the management of chronic conditions has been registering an increased interest in personalized evidencebased strategies, moving away from the single-stage and average-based protocols (Wagner et al., 2001) . In contrast with traditional single-stage treatments in which individuals are assigned the same type of treatment, DTRs explicitly incorporate the heterogeneity in treatment across individuals and the heterogeneity in treatment across time within an individual (Murphy, 2003) , providing an attractive framework of personalized treatments in longitudinal settings. In addition, by treating only subjects who show a need for treatment, DTRs hold the promise of reducing non-compliance by subjects due to overtreatment or undertreatment (Lavori and Dawson, 2000; Collins et al., 2001) . At the same time, they are attractive to public policy makers, allowing a better allocation of public and private funds for more intensive treatment of the needy (Murphy, 2003) . According to the AI framework's components, a DTR is characterized by a set of decision points at which treatment decisions are made through a set of decision rules, here named DTR. In this specific setting, the decision points generally correspond to two (t = 0, 1) or three (t = 0, 1, 2) stages of intervention. At each stage t, a treatment A t ∈ A t is selected based on patient's covariates X t , that may include an intermediate outcome assessed comes because what appears to be the best initial treatment may not yield the best overall outcome. Beyond personalizing or tailoring treatment, DTRs can identify and evaluate delayed effects, i.e., effects that do not occur immediately after treatment but may affect a person or their disease later in time or set the stage for subsequent treatment. For developing DTRs, main sources of data include longitudinal observational studies and sequential multiple assignment randomized trials (SMARTs; Lavori and Dawson, 2000; Dawson and Lavori, 2012; Murphy, 2005a) . While observational studies are much more common, SMARTs are experiencing a period of growth and are the current gold standard for developing DTRs (Lei et al., 2012; Kasari et al., 2014) . A SMART design is characterized by multiple stages of treatment, each stage corresponding to one of the critical decision time points. A concrete example, summarizing the SMART Weight Loss Management study (Pfammatter et al., 2019) , is provided in FIG 3. Here, at study entry, all individuals are uniformly randomized to one of two first-stage interventions: either mobile app alone (App) or App combined with Coaching. Starting from week 2, those achieving < 0.5 lb weight loss on average per week are classified as non-responders and are re-randomized to one of two second-stage augmentation tactics: either modest augmentation-consisting in adding a supportive text message (TXT)-or vigorous augmentation, i.e., TXT combined with Coaching or meal replacement (MR). Responders continue the initial treatment option. Because different subsequent intervention options are considered for re-sponders (continue) and non-responders (modest or vigorous augmentation), the response status is embedded as a tailoring variable. Such multistage restricted randomizations give rise to several DTRs that are embedded in the SMART (see Chakraborty and Moodie, 2013 , for more details on embedded regimes). The continuous improvement and use of mobile technologies has determined the development of a new area for health promotion, known as mHealth (Istepanian et al., 2007) , interesting both clinical and non-clinical populations. A key objective in mHealth is to deliver efficacious interventions in response to rapid changes in an individual's circumstances, while avoiding over-treatment as it leads to user disengagement (e.g., recommendations are ignored, the app is deleted). This can be efficiently achieved by real-time AIs, termed JITAIs (Nahum-Shani et al., 2018) . In mHealth, JITAIs refer to a sequence of decision rules that use continuously collected data through mobile technologies (e.g., wearable devices, accelerometers, or smartphones) to adapt intervention components in real-time for supporting behavior change and promoting health. The peculiarity of JITAIs is that they deliver interventions according to the user's in-the-moment context or needs, e.g., time, location, or current activity, including considerations on whether and when the intervention is needed. Unlike DTRs, the number of decision points in JI-TAIs can be hundreds or even thousands, and the intervention can be delivered each minute, hour, or day. This distinctive feature, at least partially, contributed to their increasing popularity in a variety of behavioral domains for improving e.g., physical activity (Consolvo et al., 2008; Van Dantzig et al., 2013; Hardeman et al., 2019) , illness management support (Ben-Zeev et al., 2014) , addictive disorders in alcohol and drug use (Gustafson et (Pfammatter et al., 2019) design. Response is defined as a weight loss ≥ 0.5 lb on average per week. et al., 2020), smoking cessation (Naughton, 2017) and obesity/weight management (Patrick et al., 2009; Aswani et al., 2019) . Aligned with the definition of AIs, JITAIs are defined according to: (i) the distal outcome(s), i.e., the ultimate goal (typically clinical) of the study, and (ii) the proximal outcome(s), which defines intermediate measures of the distal outcome(s), upon which intervention(s) act. However, unlike DTRs-which target the distal outcome and may or may not have an intermediate (proximal) outcome-in JITAIs, proximal outcomes represent the direct and in-the-moment target of the intervention. The distal outcome is expected to improve only based on domain knowledge on their relationship with proximal outcomes, but it is not formally included in an optimization problem when optimizing JITAIs. A detailed comparison between DTRs and JITAIs is given in TABLE 1. Typical experimental designs for building JITAIs are represented by factorial experiments (Collins et al., 2009) , or, most notably, micro-randomized trials (MRTs; Klasnja et al., 2015) . In MRTs, individuals are randomized hundreds or thousands of times over the course of the study, and, in a typical multicomponent intervention study, the multiple components can be randomized concurrently, making micro-randomization a form of a sequential factorial design. The goal of these trials is to optimize mHealth interventions by assessing the causal effects of each randomized intervention component and evaluating whether the intervention effects vary with time or the individuals' current context. To better understand the characteristics and value of MRTs, let us now consider the DIAMANTE real-life study involving JITAIs to promote physical activity, with design illustrated in FIG 4. In this study, the intervention options (i.e., factor levels) include whether or not to send a text message, which type of message to deliver, and at which time; the proximal outcome is the change in the number of steps the person walked today from yesterday; and the context is given by a set of user's individual variables such as baseline health status or study day. In order to develop optimized JITAIs, in addition to assessing the causal role of each intervention on the outcome, users are assigned to different study groups ( see FIG 4) , including a static (non-optimized) group and an RL-based adaptive group. In both groups, users are randomized every day to receive a combination of the different factors' levels, delivered within different time-frames. The adaptive RL-based optimized group strategy will be illustrated in Section 6.2, after introducing the RL and the causal inference frameworks in Section 3. Generally speaking, RL is an ML area concerned with determining optimal action selection policies in sequential decision-making problems (Sutton and Barto, 2018; Bertsekas, 2019) . This framework is based on continuous interactions between a decision maker or learning agent and the environment it wants to learn about. We now characterize this interaction with the appropriate notation and formalize the general problem. In RL, at each stage or time step t (here we assume a discrete time space, with t ∈ N = {0, 1, . . . , }), an agent faces a decision-making problem in an unknown environment. After receiving some representation of the environment's state or context, say X t ∈ X t , it selects an action, denoted by A t , from a set of admissible actions A t . As a result, one time step later, the environment responds to the agent's action by making a transition into a new state X t+1 ∈ X t+1 and (typically) providing a reward Y t+1 ∈ Y t+1 ⊂ R. By repeating this process over Generally very large (for a single unit, hundreds or even thousands of decision points) and can be fixed or random (e.g., when a user makes a request) Distance between decision points Sufficiently long according to the expected time to capture a potential effect (including a delayed effect) of the intervention on the primary outcome of interest or a strong surrogate Generally quite short according to the expected "in-the-moment" effect of the intervention on the proximal outcome (e.g., every few minutes, hours or daily). In JITAIs, the time between decision points is often too short to capture the (distal) outcome of interest, and they rely on a weak surrogate, i.e., the proximal outcome Type of intervention Drugs or behavioral interventions Generally behavioral interventions (e.g., motivational/feedback messages, coaching, reminders) with few exceptions (e.g., insulin adjustments) Intervention delivery Assigned by the care provider during an appointment or through digital devices Generally digital and/or mobile devices according to the mHealth domain Tailoring variable(s) Can include the full or partial history of baseline and time-varying patient's information. External context can also be considered, but it generally occupies a secondary relevance Current user's information, and any type of variable related to their momentary context (e.g., availability, weather). Momentary context occupies a primary role and can be very granular Outcome variable(s) directly targeted by the intervention Primarily, a distal outcome (long-term goal), but intermediate outcomes (short-term goals expected to impact the long-term outcome) are often part of the optimization Proximal outcomes: short-term goals or mechanisms of change that are expected to impact a distal outcome. Changes in the long-term outcome are secondary or by-product Data sources for constructing AIs SMARTs, longitudinal observational data including EHRs, dynamical systems models MRTs, factorial designs, single-case experimental designs, randomized-controlled trials (RCTs) Primary RL methods for optimizing interventions Primarily, offline methods for finite horizon decision problems. In general, all the individual's history (all the observed covariates, interventions, responses) is considered, with no simplifying (e.g., Markov) assumptions on the structure. For EHRs-based DTRs, indefinite horizon methods may be considered Generally, online methods with an infinite horizon. Considering the "in-the-moment" effect of the intervention, typically, only the current or last observed information is accounted for. Markov, partially observed Markov, or other simpler structures are used time t ∈ N, the result is a single trajectory i, i = 1, . . . , N , of states visited, actions pursued, and rewards received. In a medical context, this trajectory can be viewed as the collection of information (e.g., covariates, treatments, and responses to treatments) of a single patient i over the course of a disease. Note that in some settings there may be only one terminal reward (or a final outcome, e.g., overall survival or school performance at the end of the study, Pelham et al., 2002) ; in this case, rewards at all previous stages are taken to be 0. In other settings (e.g., multiarmed bandits, as we will see later in Section 3.2.2), the state is not considered, leading thus to a trajectory of actions and rewards only. and similarly x t , a t and y t , where the upper and lower case letters denote random variables and their particular realizations, respectively. Also define the history H t (also known as filtration F t ) as all the information available at time t prior to the agent's deci- ; similarly h t . The space of history at stage t, denoted by H t , is therefore the product of spaces of individual elements of H t , i.e. Note that, by definition, H 0 = X 0 . We assume that these longitudinal histories are sampled independently according to a distribution P Full-RL π , with superscript clarified later in Section 3.2, given by: where: • p 0 is the initial probability distribution specifying the initial state X 0 . • π . = {π t } t≥0 represents the exploration policy and it determines the sequence of actions generated throughout the decision making process. More specifically, π t maps histories of length t, h t , to a probability distribution over the action space A t , i.e. π t (·|h t ). The conditioning symbol "|" in π t (·|h t ) reminds us that the exploration policy defines a probability distribution over A t for each h t ∈ H t . Sometimes, the action A t to take at each time step t is uniquely determined by the history, therefore the policy is simply a function of the form π t : H t → A t , or equivalently π t (h t ) = a t . We call it deterministic policy, in contrast with stochastic policies that determine actions probabilistically. • {p t } t≥1 are the unknown transition probability distributions and they completely characterize the dynamics of the environment. At each time t ∈ N, the transition probability p t assigns to each state-action-reward sequence (x t−1 , a t−1 , y t−1 ) = (h t−1 , a t−1 ) of the trajectory up to time t − 1 a probability measure over At each time t, the transition probability distribution p t+1 (x t+1 , y t+1 |h t , a t ) gives rise to: (i) the state-transition probability distribution p t+1 (x t+1 |h t , a t , y t+1 ), i.e., the probability of moving to state x t+1 conditioning on the observed history h t , the current selected action a t and the reward received y t+1 ; and (ii) the immediate reward distribution p t+1 (y t+1 |h t , a t , x t+1 ), which specifies the reward Y t+1 after transitioning to x t+1 with action a t . Generally, in DTRs, the immediate reward Y t+1 is conceptualized as a known function (rather than a distribution) of the history H t , the current selected action A t and the new state X t+1 ; we thus adapt our notation to Y t+1 = Y t+1 (H t , A t , X t+1 ). To give a concrete example, one can think of a dose-finding trial, where the level of toxicity is one of the covariates (or state variables), among the others. In this setting, at each time t, the immediate reward Y t+1 of a patient with history H t and administered dose A t could be potentially defined as a binary variable assuming value −1 if a toxicity level (X t+1 ) higher than a certain value α is observed, and 0 otherwise. The cumulative sum (often time-discounted) of immediate rewards is known as return, say R t , and is given by for t ∈ N. The discount rate γ ∈ [0, 1] determines the current value of future rewards: a reward received τ time steps in the future is worth only γ τ times what it would be worth if it were received immediately. If γ < 1, the potential infinte sum in Equation (2) has a finite value as long as the reward sequence {Y τ +1 } τ ≥t is bounded. If γ = 0, the agent is myopic in being concerned only with maximizing immediate rewards, i.e. R t = Y t+1 ; this is often the case of MABs (see Section 3.2.2). If γ = 1, the return is undiscounted and it is well defined (finite) as long as the time-horizon is finite, i.e., t ∈ [0, T ], with T < ∞ (Sutton and Barto, 2018). If T is fixed and known in advance, e.g., in clinical trials, the agent faces a finite-horizon problem; if T is not pre-specified and can be arbitrarily large (the typical case of EHRs), but finite, we call it indefinitehorizon problem; finally we use the term infinite-horizon problem when T = ∞. In this case, we need γ ∈ (0, 1) to ensure a well defined return. Solving an RL task means, roughly, learning an optimal way of choosing the set of actions or learning an optimal policy, so as to maximize the expected future return. However, in many decision problems, the target policy or estimation policy we want to learn about, say d, might be different from the exploration policy π that generated the data. This may happen when we want to estimate an optimal policy without interacting with the environment but using some already collected data (e.g., observational EHR data), for which a certain exploration policy, often unknown, was used. We refer to it as offline RL, as opposed to online RL, where the agent interacts with the environment to collect the samples and iteratively improve the policy. Taking into account this potential policy change, the RL problem at any time t ∈ N is to find an optimal policy d * where the expectation is meant with respect to a trajectory distribution analogous to (1), say P d , where the fixed exploration policy π that generated the data is replaced by an arbitrary policy d we use to estimate the data. For ease of notation, (1) refers to Equation (1), and E d to E dt : the time index is already incorporated in the argument. For estimating optimal policies, various methods have been developed so far in the RL literature: see Sutton and Barto (2018) and Sugiyama (2015) for an overview. A traditional approach is through value functions, which define a partial ordering over policies, with insightful information on the optimal ones. In fact, optimal policies share the same (optimal) value function. For this reason, efficiently estimating the value function is one of the most important components of almost all RL algorithms, and it occupies a central place in the medical decision-making paradigm. In DTRs, for example, evaluating the value function of a treatment regime is equivalent to evaluating the average outcome if the estimated treatment rule were to be applied to a population with the same characteristics (state or history) in future (Zhu et al., 2019) . Comparing estimated value functions of different candidate treatment strategies offers a way to understand which strategy may offer the greatest expected outcome. There are two kinds of value functions: i) state-value or simply value functions, representing how good it is for an agent to be in a given state, and ii) action-value functions, indicating how good it is for the agent to perform a given action in a given state. More specifically, the stage t statevalue function of policy d t , gives us the expected return starting from history h t at stage-t and following policy d t afterward. Formally, we denote it by V t . = V dt : H t → R and define it as ; while for the terminal stage, if any, the state-value function is 0. Similarly, the stage-t action-value function for policy d t is the expected return when starting from history h t at stage-t, taking an action a t and following the policy d t thereafter. Denoted it by Q t . = Q dt : H t × A t → R, where "Q" stands for "Quality" and introduces the names Q-value and Q-function, we have that = V d * t yields the largest expected return for each history with any policy d t , and the optimal Q-function Q * t . = Q d * t yields the largest expected return for each history-action pair with any policy d t , i.e., Because an optimal state-value function is optimal for any fixed h t ∈ H t , it follows that the optimal policy at time t must satisfy A fundamental property of value functions used throughout RL is that they satisfy particular recursive relationships, known as Bellman equations. For any policy d, the following consistency condition, expressing the relationship between the value of a state and the values of successor states, holds: ∀h t ∈ H t , ∀t ∈ N. Based on this property and (6)-(7), at each time t, ∀h t ∈ H t and ∀a t ∈ A t , with discrete state and action spaces, the following rules, known as Bellman optimality equations (Bellman, 1957) , are satisfied: Here, the expectation is taken with respect to the transition distribution p t+1 only, which does not depend on the policy, thus the subscript d can be omitted. This property allows estimation of (optimal) value functions recursively, from T backwards in time. In finite-horizon dynamic programming (DP), this technique is known as backward induction, and represents one of the main methods for solving the Bellman equation, also referred to as the DP equation or optimality equation. In infiniteand indefinite-horizon problems, using traditional backward induction is impossible, given the impossibility of extrapolating beyond the time horizon in the observed data. To overcome this issue, alternative methods and additional assumptions (e.g., discounting and boundedness of rewards) are typically taken into account. Under these assumptions, authors proposed for example to focus on time-homogenous Markov processes for eliminating the dependence of value functions on t (see e.g., Ertefaie and Strawderman, 2018; Luckett et al., 2020) , or revisit the Bellman optimality equation . The RL problem can be posed in a variety of different ways, depending on the assumptions about the level of knowledge initially available to the agent. The framework is abstract and flexible and can be applied to many different (sequential) problems, by specifically characterizing the state and action spaces, the reward function, and other general domain (or environment) aspects, such as the time horizon or the dynamics of the process. The general framework introduced in Section 3.1 does not make any simplifying assumptions on the dependency between rewards, actions and states: by carrying over all the available history from the starting time, it considers a full dependency between them. We name this framework full reinforcement learning (full-RL). Often, specific domains of application may have an underlying theory about the potential relationships between the key elements of an RL problem. To illustrate, consider a hospital admission scheduling problem (Kolesar, 1970) , in which the decision (or the action) is represented by the number of daily admissions. In order to determine the optimal action, one may need to know the current (or at a certain time, e.g., daily) number of beds occupied, but neither the number of beds occupied at all the previous decision points nor the set of all the previous actions. In other words, one may ignore the overall history and consider only the current state in the decision-making process. Alternatively, in some applied problems (e.g., indefinite-horizon problems), a full-RL formalization may be infeasible and/or intractable for both optimization and inference purposes, thus requiring some forms of simplification in the distribution of the longitudinal histories. In JITAIs, for instance, the "just-in-time" nature of a decision making, assumes that the underlying decision rule is applied at the moment, without any significant computational time costs. Common examples of specific formalizations of an RL problem include Markov decision processes (MDPs) and multi-armed bandit (MAB) or contextual MAB problems. While we discuss the MAB problem as a subclass of, or a specific way of formalizing, the RL problem (as in Sutton and Barto, 2018) , we want to point out that key researchers in the domain (see e.g., Lattimore and Szepesvári, 2020 ) distinguish between the two; according to them, RL is mostly associated to ML, while MABs are to mathematics. One driver of this choice may be related to the major focus and attention to theoretical guarantees, e.g., optimal regret bounds, that MAB algorithms are expected to satisfy. In what follows, we illustrate more in-depth these two specific formalizations, starting with the MDPs, the main framework in indefinite-horizon DTR problems. A graphical illustration of the different settings is given in FIG. 5. State X t p t (X t |X t-1 , A t-1 ) Reward Y t+1 Y t+1 (A t , X t , X t+1 ) Action A t t (A t |X t ) Action A t t (A t |X t ) Action A t t (A t |H t ) Reward Y t+1 Y t+1 (A t , H t , X t+1 ) State X t p t (X t |A t-1, H t-1 ) MDP -RL Full -RL State X t p t (X t ) t+1 t+1,2… t+1 t+1,2… An MDP is a stochastic control process used to define an environment's dynamics and to model the interaction between the agent and the controlled environment. It provides a mathematical framework for modeling decisionmaking in situations where rewards are partly random and partly under the control of a decision-maker (Puterman, 2014), and it is the most common setting assumed in RL (Van Otterlo and Wiering, 2012) . What distinguishes an MDP-based RL (MDP-RL) from to the full-RL framework is the environment's random memory-less characteristic that informs the agent about its transition probabilities and guides the decision-making process. More specifically, assuming that the current state X t contains all the information from the past history H t−1 (including also the current reward Y t ) that is meaningful to predict the future, it allows ignoring all the past histories when modelling next states and rewards. This property, known as the Markov property, leads to a finitesize representation of the past, exemplifying all the trajectory distribution in (1) as follows: and, exemplifying, thus, the entire optimization procedure required for computing the optimal policy as reported in (3) or (8). Note that, under the Markov property, the agent's decisions can be entirely determined based on the current information only, as the latter fully determines the environment's transition-probability distri- When transition probabilities {p t+1 } t≥0 are also time independent, i.e., p t+1 = p, ∀t ≥ 0 the process is called time-homogeneous or stationary MDP. In light of this additional assumption, states, rewards and actions are now time independent, given the previous stage information. In the context of DTRs as well as JITAIs, timehomogeneous MDPs were proposed in indefinite-time horizon settings, as they simplify the problem by working with time-independent quantities, which do not require a backward induction strategy (see Section 4.2). While both Full-RL and MDP-RL are typically formulated as problems with states, actions, rewards, and transition rules that depend on previous states, an exception is made for MABs, whose original formulation can be viewed as a stateless variant of RL (Bouneffouf et al., 2020) . In a typical MAB problem, actions and rewards are not associated with states or they are assumed to depend only on the current state, enabling in this way faster online learning. MAB problems, often identified as a special subclass of RL (Sutton and Barto, 2018; Bouneffouf et al., 2020) , have a long history in statistics. They have been introduced in biostatistics by Thompson (1933) and, extensively studied under the heading sequential design of experiments (Robbins, 1952; Berry and Fristedt, 1985; Lai, 1987) . Generally speaking, the MAB problem (also called the K-armed bandit problem) is a problem in which a limited set of resources (e.g., a sample of individuals) must be allocated between competing choices in order to maximize the total expected reward over time. Each of the K choices (i.e., arms or actions) provides a different reward, whose probability distribution is specific to that choice. If one knew the expected reward (or value) of each action, then it would be trivial to solve the bandit problem: they would always select the action with the highest value. However, as this information is only partially gained for the selected actions, at each decision time t the agent must trade-off between optimizing its decisions based on acquired knowledge up to time t (exploitation) and acquiring new knowledge about the expected payoffs of the other actions (exploration). MAB strategies were originally proposed for solving stateless problems, in which the reward depends uniquely on actions. Subsequently, a "stateful" variant of MABs, named contextual MAB (C-MAB), in which actions are associated with some signal, or context, was introduced. However, compared to Full-RL and MDP-RL, in contextual MABs, actions do not have any effect on the next states. In addition, generally, there are no transition rules from one state to another in subsequent times, implying that states, actions, and rewards can be treated as a set of separate events within time. The most typical assumption is that contexts {X t } t∈N are independent and identically distributed (i.i.d.) with some fixed, but unknown distribution. This means that action A t at time t has an in-themoment effect on the proximal reward Y t+1 at time t + 1, but not on the distribution of future rewards {Y τ } τ ≥t+2 , for which the i.i.d. property holds as well. Under this assumption, one can be completely myopic and ignore the effect of an action on the distant future in searching for a good policy. This problem is better known as stochastic MAB, in contrast with adversarial MAB, in which no independence assumptions on the sequence of rewards are made. In stochastic contextual MABs, the trajectory distribution is simplified as follows: with a further reduction in the context-free MAB problem as follows: Note that, since the effect of an action in the stochastic MAB is in-the-moment, the bandit problem is formally equivalent to a one-step/state MDP, in which the states progression is not taken into account. MABs, thus, provide a simplified way (compared to both MDP-RL and full-RL) of formalizing the relationships between RL's components in time. A graphical summary of the different RL frameworks discussed here is given in FIG. 5. As in the general RL problem, a MAB problem goal is to select the optimal arm at each time t so as to maximize the expected return, alternatively (and with a slightly different nuance) expressed in the bandit literature in terms of minimizing total regret. Indeed, in (online) real-world problems, until we can identify the best (unique) arm, we need to make repeated trials by pulling the different arms. The loss that we incur during this learning phase (time/rounds spent for learning the best arm) represents what is called regret, i.e., how much we regret in not picking the best arm. More formally, denoted with the optimal arm of round t, we define the immediate regret ∆(A t ) of action A t as the gap between the expected reward of the optimal arm A * t and the expected reward of the ultimately chosen arm A t , i.e., Given a (random) horizon T , the goal of the learner is to minimize the total regret given by Reg(T ) Note that the agent does not know ahead of time how many rounds T are to be played. The goal is thus to perform well not only at the final stage T , but also during the learning phase. For example, in a dose-finding problem as the one mentioned in Section 3.2.1, the aim may be not only to minimize the sum of toxicities over time, but also to ensure that at each time t, these toxicities have a proper upper bound, that limits extremely harmful adverse events. For this reason, as we will see later in Section 5, theoretical works on regret bounds occupy a central place in the bandit literature. So far, we have introduced the RL as a general framework for sequential decision-making problems, and discussed its applicability and characterization in illustrative examples of interest within the biomedical AIs area. Before diving deep into the rich literature of existing RL methods for building (optimized) AIs, we provide the reader with a joint overview on the different problems, which notably share the same key elements (see TABLE 2 ) and a common optimization objective. As such, they can be unified under a unique formal framework and solved with techniques developed under the RL paradigm. Note that this does not imply that RL is the unique option for building AIs-as we will see in Section 4, a variety of other traditional statistical solutions exist-or that the RL framework suffices for ensuring valid solutions. Indeed, for developing AIs, one often needs to assess the causal relationship between an intervention and the outcome, requiring thus, an adequate causal inference framework that will be shortly discussed. 2 serves as a table of equivalence between the different terminologies of reference in each setting, with a unified notation adopted from the general RL framework. Note that, while we report only the most common terminology employed in each setting, lexical borrowing is widely used across the different theoretical and applied domains. To illustrate, the term "treatment policy", or just "policy" is often used in place of "treatment regime" in the DTR literature, and "arm" is a common term in both DTRs and JITAIs. Also note that, in general, the terminology adopted in a specific application, is guided by the RL method and framework used in that application; see e.g., the similarity between terms used in JITAIs and MABs such as "contextual variables" and "context" (i.e., the state of the environment). Both contextual and tailoring variables represent the set of baseline and time-varying information that is used to personalize the decision-making. Alternative terms such as covariates or features (that we use with a slightly different meaning, as we discuss in Section 5.1) are also common. We anticipate that most (if not all) of the methods for constructing JITAIs would generally belong to the MAB class, although the applied literature commonly refers to it with the generic "reinforcement learning" name (see e.g., Yom-Tov et al., 2017; Liao et al., 2020; Figueroa et al., 2020) . In DTRs, the prevalent class of methods is full-RL, followed by MDP-RL proposed specifically for indefinite-horizon (e.g. EHR-based) DTR problems. In fact, the underlying theory of DTRs, characterized by potential delayed and/or carry-over effects of treatment over time, and the importance of the evolving history of a patient for predicting future outcomes, requires an accurate consideration of information from previous stages. Generally, the meaningful relationship between the different variables of a patient's history does not allow simplifying or ignoring the (state-)transition rules, making full-RL (and exceptionally MDP-RL) the ideal option. On the other hand, the behavioral theory of a momentary effect of an intervention on the proximal outcome (underlying mHealth applications), makes MAB a more suitable framework compared to full-RL and MDP-RL in this setting. In addition, the reduced computational burden from carrying through all the historical information, allows MAB strategies to be applied on a continuous-time basis, e.g., every hour, and efficiently construct JITAIs. In order to allow quantification of treatment effects, Neyman (1923) and Rubin (1974) laid the foundations of modern causal inference, based on the so called potential outcomes or counterfactual framework. This framework provides a set of sufficient conditions for defining a quantitative causal effect, i.e., a causal estimand, and validating its estimation. Briefly, with potential outcomes we refer to the set of all possible values of a status or outcome variable that would be achieved, if perhaps contrary to fact, the patient had been assigned to different treatments. In a simple onestage randomized controlled trial (RCT) in which subjects can receive either treatments a and a , the set of (unobserved) potential outcomes for an individual with baseline information X 0 , is given by . In order to define what we mean by a causal effect, for each individual we assume the existence of the potential outcomes, Y a 1 , Y a 1 , corresponding to what value the outcome would take if we did assign a or a , respectively. Potential outcomes can be compared to understand the causal effect, or individual-level causal parameter, defined as Y a 1 − Y a 1 and find the regime that leads to the highest expected outcome. However, since we cannot observe all the potential outcomes on the same patient, typically population-level causal parameters (such ) are considered instead. In order to connect the potential outcomes with observed data, and ensureÊ[Y 1 |A = a] is an unbiased estimate of E[Y a 1 ], the following assumptions about the assignment mechanism must hold. 1. Stable unit treatment value assumption (SUTVA), which assumes that each participant's potential outcome is not influenced by the treatment applied to other participants (Rubin, 1978 (Rubin, , 1980 . This assumption connects the potential outcomes to the observed data such that, for each t, X at . = Y t+1 , when regime a t is actually followed. This agreement between potential outcomes under the observed treatment and the observed data is also known as axiom of consistency. 2. No unmeasured confounders (NUC), which states that conditional on the patient's history H t up to time t, the treatment assignment A t at time t is independent of future potential outcomes of the individual (Robins, 1997) . That is, for any possible regime a, This assumption always holds under either complete or sequential randomization, including SMART designs, but must be evaluated on subject matter grounds in observational studies. 3. Positivity, which defines the set of feasible regimes so that for every covariate-treatment history up to time t that has a positive probability of being observed, there must be a positive probability that the corresponding treatment dictated by the treatment regime will be observed (Robins, 1994) . Formally, if we denote with π the probability distribution of actions given the history, That is, feasibility requires some subjects to follow regime d to guarantee non-parametric estimation of its performance. Please note that notation "π" is not arbitrary, it translates the notion of "exploration policy" meant for the action process generation, and in the case of a randomized trial it consists of the randomization probabilities. Under these assumptions, the conditional distributions of the observed data are the same as the conditional distributions of the potential outcomes. It follows that an optimal AI may be obtained using the observed data. An alternative theoretical framework for estimating causal effects is represented by causal graphical models (CGMs; Spirtes et al., 2000; Pearl, 2009 ) that are used to encode researchers' a priori assumptions about the data-generating structure. This causal structure can be visually represented by directed acyclic graphs (DAGs), i.e., mathematical objects that consist of nodes (variables) and directed edges (functions) with precise causal meanings. To keep matters at their simplest, FIG. 6 depicts a DAG and its corresponding structural causal model (SCM), consisting of a set of exogenous ( A , M , Y ) and a set of endogenous variables (A, M, Y ), with relationships reflected in the edges. More specifically, it encodes the conditional (in)dependence structure among the set of variables of interest: A "directly causes" M , M "directly causes" Y , while A is considered an "indirect cause" of Y . The node M lies on the causal pathway between A and Y and is considered a mediator variable on the directed path, determining conditional independence between A and Y given M . For example, the DIAMANTE Study discussed in this work, is based on a mediational theory according to which the intervention A may impact the clinical (depression) outcome Y indirectly through a proximal outcome M , which acts as a mediator. Also notice that the representation given in FIG. 5 resembles Beyond the alternative terminology and representation, Pearl (2009) shows that the fundamental concepts underlying the graphical perspective and the potential outcome perspective are equivalent, primarily because they both encode counterfactual causal states to define causality. In summary, causal inference provides a set of tools and principles that allows one to combine data and causal assumptions about the environment to reason with questions of counterfactual nature. On a different tangent, RL is concerned with efficiently finding a policy that optimizes an objective function (e.g., reward or regret) in interactive and uncertain environments. While these two areas have evolved independently over different aspects of the same building block and with no interaction between them, disciplines such as AIs can be developed only under an integrated conceptual and theoretical umbrella. Recent attempts in the ML community have also worked in this direction, trying to introduce a unified framework called causal RL, which combines the causal graphical approach with the sample efficiency of RL (Zhang and Bareinboim, 2019) . Methodology for constructing optimal DTRs, i.e., the ones that, if followed, yield the most favourable (typically long-term) mean outcome, is of considerable interest within the domain of precision medicine, and comprises a large body of research within theoretical and applied sciences (Chakraborty and Moodie, 2013; Laber et al., 2014c; Kosorok and Laber, 2019) . While its relevance has been long documented within statistics and causal inference, recently it has been attracting a vast literature from computer science and engineering, due to the similarity between the mathematical formalization of a DTR and the RL framework. Perhaps due to the needs of identifying causal relationships, the study of DTRs originated in causal inference, with the pioneer works of Robins (1986 Robins ( , 1994 Robins ( , 1997 . Over an extended period of time, the author introduced three basic approaches for finding optimal time-varying regimes in the presence of confounding variables: the parametric G-formula or G-computation (Robins, 1986) , structural nested mean models (SNMMs) with the associated method of G-estimation (Robins, 1989 (Robins, , 1992 (Robins, , 1994 , and marginal structural models (MSMs) with the associated method of inverse probability of treatment weighting (IPTW; Robins, 2000) . In spite of their advantages, SNMMs and G-estimation have not become as popular as MSMs and IPTW methods. Possible reasons are discussed in Vansteelandt et al. (2014) , who use the appellative "partially realized promise" referring to SNMs and G-estimation. A number of methods have then been proposed within statistics, including frequentist and Bayesian approaches (Thall et al., 2000 (Thall et al., , 2002 (Thall et al., , 2007 Lavori and Dawson, 2000) . However, they all estimate the optimal DTR based on inferred distributions of the data-generation process via parametric models, and, as such, can easily suffer from model misspecification . The first semi-parametric method for DTRs was proposed by Murphy (2003) , immediately followed by Robins (2004) , who introduced two alternative approaches based on G-estimation and SNMMs. These methods use approximate dynamic programming (ADP), where "approximate" refers to the use of an approximation of the value or Q-function introduced in (5). Thus, they can be considered as the first prototypes of RL-based approaches in the DTR literature. RL methods represent an alternative approach to estimating DTRs that have gained popularity due to their success in addressing challenging sequential decisionmaking problems, without the need to fully model the underlying generative distribution. The connection between statistics and RL-previously confined to the computer science and control theory literature-was bridged by Murphy (2005b) , who proposed to estimate optimal DTRs with Q-learning (Watkins, 1989; Sutton and Barto, 2018) . Promptly, a large body of research has embraced the use of Q-learning, combined with various parametric, semi-parametric, and non-parametric strategies (Murphy, 2005b; Chakraborty and Moodie, 2013; Chakraborty and Murphy, 2014; Laber et al., 2014a) for modelling the Qfunction. Q-learning and the semi-parametric strategies of Murphy (2003) and Robins (2004) are indirect methods: optimal DTRs are indirectly obtained by first estimating an optimal objective function (typically, an expectation of the outcome variable such as the Q-function), and then get the associated (optimal) policy. Contrarily, IPTW-based strategies (Robins, 2000; Murphy et al., 2001; Wang et al., 2012) , seek optimal policies by directly looking for the policy (within a class of policies) that maximises an objective (e.g., the expected return), without postulating an outcome model ; they are regarded as direct methods. In the RL literature, direct and indirect methods are sometimes referred to as model-free and model-based algorithms (Atkeson and Santamaria, 1997; Sutton and Barto, 2018) . However, more subtle classifications (see e.g., Guan et al., 2019; Sugiyama, 2015) tend to make a clearer separation between the two categories in the sense that direct/indirect are used for the learning process, while model-free/model-based refer to the modelling assumptions for the environment. In what follows, we review existing RL techniques for developing DTRs, covering both finite-and indefinitehorizon settings, and adopting the direct vs indirect taxonomy, in line with current DTRs literature (Chakraborty and Murphy, 2014) . We emphasize that most of the current work in DTRs deals with finite-horizon problems, and offline learning procedures that assume access to a collection of observed trajectories. Only recently, the indefinite-horizon setting, particularly suitable for chronic diseases where the number of stages can be arbitrarily large, has been addressed in the DTR literature, although it remains relatively understudied. We note that in this work, we use the term "indefinite", as opposed to "infinite", to account for the finite life expectancy of an individual. Finite-horizon DTRs problems are designed to identify optimal treatment policies d * = {d * t } t=0,...,T over a fixed period of time T , with T < ∞ and known. Learning methods are typically based on finite observational data trajectories, and causal assumptions about the data, of a sample of say N patients. We remind that each trajectory i = 1, . . . , N has the form (X 0 , A 0 , Y 1 , . . . , X T , A T , Y T +1 ), with X 0 and X 1 , . . . , X T the pre-treatment and evolving information, respectively, A 0 , . . . , A T the assigned treatments, and Y 1 , . . . , Y T +1 the intermediate and final outcomes. The problem conforms to what is known as offline RL in computer science. Throughout this section, we consider deterministic policies, which maps histories h directly into actions or decisions, i.e., d(h) = a. As discussed before, indirect methods involve learning intermediate objective functions, typically value functions, that lead to optimal policies. In finite-horizon problems, such methods are mainly based on DP or ADP interative procedures, which include Q-learning (Murphy, 2005b) , with the Q-function as objective, and Alearning (Robins, 2004; Murphy, 2003; Blatt et al., 2004) , that focus on contrasts of conditional mean outcomes. We discuss them later in this section. Traditional statistical likelihood-based methods (Thall et al., 2000 (Thall et al., , 2002 , including the parametric Gcomputation (Robins, 1986) and Bayesian methods (Thall et al., 2007; Arjas and Saarela, 2010; Zajonc, 2012; Xu et al., 2016) , also fall into this category. We point to Tsiatis et al. (2019) and Vansteelandt et al. (2014) for readers interested in these approaches, and to Murray et al. (2018) for an augmentation of the Bayesian setting with the flexibility of novel ADP approaches that overcome some of the drawbacks of likelihood-based methods (e.g., the practical limitation of modelling the full joint data trajectory distribution). Q-learning with function approximation. In Section 3 we introduced value functions (see (4)-(5)), and showed that optimal values can be obtained by iteratively solving the Bellman optimality equations in (10)-(11) they satisfy. In finite-horizon DP problems the procedure is also known as backward induction. However, the iterative process may be memory and computationally intensive, especially for large state and action spaces. Furthermore, traditional DP procedures assume an underlying model for the environment, which is often practically unknown due to unknown transition probability distributions. Qlearning (Watkins, 1989) , either tabular (see Supplementary Material B.1) or more particularly with function approximation (FA), offers a powerful and scalable tool to overcome both the modelling requirements and the computational burden of traditional DP-based RL methods, and constitute the core of modern RL. Q-learning with FA involves, first, assuming an approximation space for each of the t-th Q-functions in (5), e.g. Q t . = {Q t (h t , a t ; θ t ): θ t ∈ Θ t }, with parameter space Θ t a subset of the Euclidean space, and then estimating the optimal stage-t Q-functions Q * t backwards in time for t = T, T − 1, . . . , 0 (Bather, 2000) . According to (8), an estimate of the optimal regimed * = (d * 0 (x 0 ), d * 1 (h 1 ), . . . , d * T (h T )) is finally obtained from the associated optimal Q-functions as follows: The general iterative procedure is illustrated in Algorithm 1, while a more specific implementation with linear regression is reported in Supplementary material B.2. Notice that, being the Q-function a conditional expectation, linear regression represents a natural modeling approach. However, it is important to recognize that the estimated regimed * may not be a consistent estimator for the true optimal regime d * , unless all the models for the Q-functions are correctly specified . Given that a linear regression model may be quite simple (due to e.g., its linearity), and prone to misspecification, more sophisticated FA alternatives, such as support vector regression (SVR) and extremely randomized trees (ERT; Zhao et al., 2009) , or deep neural networks (DNNs; see e.g., Atan et al., 2018) . Another strategy that may offer robustness to Q-function misspecification is A-learning (Robins, 2004; Murphy, 2003; Blatt et al., 2004) . A-learning with function approximation. A-learning, where "A" stands for the "advantage" incurred if the optimal treatment were given relatively to that actually received, is used to describe a class of alternative methods to Q-learning, predicated on the fact that it is not necessary to specify the entire Q-function to estimate an optimal regime. Models can be posited only for parts of the expectation involving contrasts among treatments, as opposed to modeling the conditional expectation itself as in Q-learning. Recalling that d * . = {d * t } t=0,...,T denotes the optimal DTR, and denoting with d * t . = {d * τ } τ =t,...,T the optimal policy from t onwards, d ref . = {d ref t } t=0,...,T a regime of reference we want to make comparisons with, and with 0 the standard or placebo treatment, contrasts examples include: with Y a the potential outcome associated with policy a. Optimal blip-to-reference in (16) and optimal blip-to-zero in (17) evaluate removal of an amount ("blip") of treatment at stage t on the subsequent mean outcome, when the optimal DTR d * t+1 is followed from t + 1 onwards: "blips" are represented by the reference treatment d ref t and the 0 treatment, respectively. Regret in (18) evaluates the increase in the benefit-to-go we forego by selecting a t rather than the optimal action d * t at time t. While Robins (2004) advocates optimal blip functions and Murphy (2003) regrets, Moodie et al. (2007) demonstrated that they are mathematically equivalent. In addition, they both propose a SNMMs parametrization for each of the t ∈ [0, T ] conditional intermediate causal effects, or contrasts. Without loss of generality, models are of form γ t (h t , a t ; ψ t ), with γ t a known (T − t + 1)-dimensional function smooth in ψ t , and such that γ t (h t , 0; ψ t ) = 0 and γ t (h t , a t ; 0) = 0, so that ψ t = 0 encodes the null hypothesis of no treatment effect. However, it is important to note that the two authors propose different estimation techniques: Robins (2004) uses backward recursive G-estimation, while Murphy (2003) a technique known as iterative minimization of regrets (IMOR). We thus distinguish the two approaches in contrast-based Alearning and regret-based A-learning and discuss them more in depth in Supplementary material B.3. This is done in line with the work of Schulte et al. (2014) , which will be taken as a guide for discussing more in depth Alearning and compare it with the widely used technique of Q-learning. Comparing A-learning and Q-learning, Schulte et al. (2014) showed that Q-learning is more efficient when all models are correctly specified and the propensity model required in A-learning is misspecified. If the Q-function is misspecified, A-learning outperforms Q-learning, while with both propensity and Q-learning models misspecified, there is no general trend in efficiency of estimation across parameters that might recommend one method over the other. Deep Q-Network. The tremendous success achieved in recent years by RL has been largely enabled by the use of advanced FA techniques such as DNNs (Jonsson, 2019; Silver et al., 2017; Mnih et al., 2015) . Enhancing Qlearning with DNNs for approximating the Q-function, gives rise to the deep Q-network (DQN) algorithm (Mnih et al., 2015) . Specifically, at a given time t, a DNN (see Goodfellow et al., 2016 , for an overview of existing DNN architectures) is used to fit a model for the Q-function in a supervised way and estimate the optimal Q-value: histories {H t,i } i=1,...,N are given as input, and the predicted Q-values Q t (H t , a t ;Ŵ,b) associated with each individual action a t ∈ A t , say A t = {a 1 , . . . , a K }, are generated as output. W and b represent the unknown weight and bias parameters characterizing a typical DNN (see e.g., the schematic of a feed-forward neural network (FFNN) in FIG. 8) , respectively. Once Q-value estimates are obtained with the DNN, the DQN algorithm proceeds with executing, in an emulator, an action according to an exploration scheme namedgreedy (Sutton and Barto, 2018) which probabilistically Algorithm 1: Q-learning with Function Approximation (Murphy, 2005b) Input: Time horizon T , sample of N trajectories, approximation space for the Q-functions Q t . = {Q t (h t , a t ; θ t ) : θ t ∈ Θ t }, for all t = 0, . . . , T . Initialization: Stage T + 1 optimal Q-function, for convenience it is typically set to Q * T +1 (h T+1 , a T +1 ;θ T +1 ) =Ê[Y T +1 |H T = h T , A T = a T ] = 0. for t = 0, 1, 2, . . . T do Q-function parameters' estimate: get updated estimatesθ T −t backwards by minimizing a loss, e.g., MSE, (P N is the empirical mean over N trajectories) Optimal policy estimate: get the (T − t)-time optimal regime estimate as the one that maximises the optimal (T − t)-time Q-function estimate where each neuron processes the information forward from one layer to the next one. Information is non-linearly transformed according to unknown weights W (l) and bias b (l) parameters, l = 1, . . . , L − 1, which are estimated through the FFNN. chooses between the optimal action (i.e., the one with the highest estimated Q-value) and a random action. At the end of the execution sequence, first the Q-function is re-estimated based on the observed reward, and then the DNN parameters are updated using the lately observed Q-value estimates. A detailed explanation of the process is given in Algorithm 2. DNNs are regarded as a more flexible and scalable approach, particularly suitable for real-life complexity, high dimensionality and high heterogeneity. Compared to their shallow counterpart, they enable automatic feature representation as well as capturing complicated relationships. Within the DTR literature, such DQN has been implemented with observational data for estimating optimal regimes in the graft-versus-host disease and for sepsis treatments (Raghu et al., 2017a) . Other recent approaches (Atan et al., 2018; Wang et al., 2020) considered more sophisticated DNN architectures able to perform well in settings that may in-troduce considerable biases due to sparse rewards or missing outcomes for certain treatments. A general limitation of indirect methods such as Qand A-learning, independently on the FA, is that the optimal DTRs are estimated in a two-step procedure: first, the Q-functions or the contrast/regret functions are estimated using the data; and then these are optimized to infer the optimal DTR. In presence of high-dimensional information, even with flexible non-parametric techniques such as SVR and DNN, it is possible that these conditional functions are poorly fitted, with the derived DTR far from optimal. Moreover, as demonstrated by Zhao et al. (2012) , indirect approaches may not necessarily result in maximal long-term clinical benefit, motivating a shift to direct methods. Direct methods, also known in RL literature as direct policy search methods (Ng et al., 2000) , seek to maximize the return by learning the optimal policy directly, without involving estimation of intermediate quantities such as optimal Q-or C-functions. These methods typically do not assume models for conditional mean outcomes; thus, they are referred to as "non-parametric". However, they may consider a parametrization for the policies or regimes class. In direct methods, a class of policies D, often indexed by a parameter, say ψ ∈ Ψ, is first pre-specified. Then, for each candidate regime d ∈ D, an estimate of the corresponding utility is obtained. The utility may be a summary of one outcome, such as percent days abstinent in an alcohol dependence study or a composite outcome; for example, in Wang et al. (2012) the utility is a compound score numerically combining information on treatment efficacy, toxicity, and the risk of disease progression. Here, without loss of generality, we Algorithm 2: Deep Q-Network (Mnih et al., 2015; Liu et al., 2017) Input: Pre-processing real data profiles based on random parameters (W 0 , b 0 ). Train a DNN with labelled data D 0 and get estimates (Ŵ 0 ,b 0 ) for t = 0, 1, 2, . . . T do -Greedy step: select a random action a t with probability ; otherwise a t = arg max a∈A Q t (H t , a;Ŵ t ,b t ); Execute a t in emulator and observe state transition X t+1 and reward Y t+1 (H t , a t ); Update the experience memory data: D t+1 = (D t , Y t+1 , X t+1 ); Q-learning Update: update Q-function based on the Q-learning update. end for take the utility to be the policy's value (see (4)). The regime in D that maximizes the value function is then the estimated optimal DTR, i.e.,d * . = arg max d∈DVd , ord * . = arg max ψ∈ΨVdψ for parametric classes. A common example of a parametric classes is the soft-max class D . = {π(a k |x, ψ) = e −x T ψk / K j=1 e −x T ψj : ψ ∈ Ψ, k = 1, . . . , K}, where a 1 , . . . , a K are the K possible treatments and ψ . = (ψ T 1 , . . . , ψ T K ) the vector of parameters for the K treatments indexing the class of policies. Most of the statistical work in this area is based on the IPTW technique . It is used, for instance, in estimating MSMs (Robins, 2000; Orellana et al., 2010) or value functions (Zhang et al., 2012b (Zhang et al., , 2013 ; in classification-based frameworks, such us outcome weighted learning (OWL; Zhao et al., 2012 Zhao et al., , 2015 Liu et al., 2018) , and in combination with ML approaches, such as decision trees Tao et al., 2018) . Inverse Probability of Treatment Weighting. In the case of randomized trials such as SMARTs, optimal DTRs could be inferred directly from observed data in the study, given their causal inference guarantees (see Section 3.3.1) and other strong statistical properties (see also Rosenberger et al., 2019) . However, in observational data, to appropriately account for confounding, we employ a general technique known as IPTW (Robins, 2000) that makes use of importance sampling to change the distribution under which the regime's value is computed. In doing that, assuming P d absolutely continuous with respect to P π (corresponding to the positivity assumption), we basically weight our outcomes according to the relative probability of occurring under the target d and exploration π policies: To estimate V d , the Monte Carlo (MC) estimator, i.e., where P N denotes the empirical average over N trajectories, is generally employed. By the Strong Low of Large Numbers, the MC estimator is unbiased, but its variance is unbounded. To stabilize this estimator, the weights w d,π are normalized by their sample mean, leading to the IPTW estimator: The technique allows balancing the confounders across levels of treatment: higher the probability of receiving a specific treatment conditioning on confounder X, π(A|X), lower the weight w π = 1/π(A|X) of their outcome Y . When π is known (e.g., SMART design), the IPTW estimator is consistent, but it can be highly variable due to the presence of the non-smooth indicator functions inside the weights. An alternative version, which integrates the properties of the IPTW estimator with those of regression-assuming models for both the propensity score and the (conditional) mean outcome-is the augmented inverse probability of treatment weighting (AIPTW) estimator (Zhang et al., 2012b) . Its original version, reported in Supplementary material B.4, was designed for a single stage treatment regime, thus, does not make use of any RL strategies. However, it was subsequently adapted to two or more decision points (Zhang et al., 2013; Tao and Wang, 2017; Zhang and Zhang, 2018) , where, with models posited for either Q-functions or C-functions, a Q-learning or Contrast-based A-learning strategy was combined with the IPTW estimation, making it fully RL based. By requiring only one of the two models to be correctly specified, it ensures a double robustness property which enjoys protection against model misspecification and comparable or superior performance than do competing methods. IPTW represents a basis for other existing direct methods. For instance, it constitutes one of the most common approach for estimating MSMs (Robins, 2000; Neugebauer et al., 2012) , a powerful alternative to SNMMs for describing the causal effect of a treatment (hence "structural"), and pertaining to population-average effects ("marginal" over covariates and other outcomes); see Supplementary material B.5 for further discussion. Furthermore, it has a key role within the Outcome Weighted Learning (OWL) framework proposed by Zhang et al. (2012a) and Zhao et al. (2012) , which we discuss below. Outcome Weighted Learning. As an alternative direct approach, Zhao et al. (2012) studied the DTR estimation problem as a weighted classification problem, with weights retrospectively determined from clinical outcomes (hence "Outcome Weighted"); and proposed to solve it with tools from ML (hence "Learning"). In the case of two treatments, expressed as A ∈ {−1, 1}, Qian and Murphy (2011) first showed that the problem can be formulated as a weighted 0 − 1 loss in a weighted binary classification problem, where d * can be estimated as: However, due to the discontinuous indicator function, Zhao et al. (2012) proposed to address the optimization problem with a convex surrogate loss function for the 0 − 1 loss, corresponding to the hinge loss in ML (Hastie et al., 2009) . Considering that d(H) can be represented as sign(f (H)) for some suitable function f , the minimization problem is then expressed as: where λ N is a tuning penalty parameter that can be chosen via cross-validation, and φ(x) . = max(1 − x, 0) is the hinge loss. Although the seminal work of Zhao et al. (2012) allows the use of different loss functions, the specific considered setting (non-negative rewards, single stage, binary treatments) opened a number of problems for its practical employment. Many of these issues have been addressed by subsequent literature, which we illustrate in Supplementary material B.6. More recently, under both the direct weighted classification and indirect blip functions frameworks, Luedtke and colleagues (Luedtke and van der Laan, 2016; Montoya et al., 2021) introduced the SuperLearner ensemble method (Van der Laan et al., 2007) in the DTR arena. Rather than a priori selecting an estimation framework and algorithm, estimators from both frameworks (and a user-supplied library of candidate algorithms), are combined by using a super-learning based cross-validation selector that seeks to minimize an appropriate crossvalidated risk. The full approach is described in Section 3.3 of Montoya et al. (2021) . Tree-based methods. Similarly, by integrating tools from ML, first Laber and Zhao (2015) , in the context of individualized (single stage) treatment regimes, and then Tao et al. (2018) and Sun and Wang (2021) for dynamic regimes, introduced the tree-based approach (Breiman et al., 2001) for directly estimating optimal DTRs. The underlying idea of Tao et al. (2018) is, first, to define and estimate a purity (i.e., a target measure to be optimized), and then to improve the purity with a decision tree. Improvement is performed by splitting a parent node into child nodes repeatedly, and by choosing a split among all possible splits at each node so that the resulting child nodes are the purest (e.g., having the lowest misclassification rate). The mean outcome is used as purity measure, and its estimation is carried out with the IPTW or the AIPTW estimator, or alternatively a kernel smoother in the case of continuous treatments . Differently, Sun and Wang (2021) proposed a stochastic tree-based RL, which uses Bayesian additive regression trees and then stochastically constructs an optimal regime using a Markov chain Monte Carlo (MCMC) tree search algorithm. In the multiple stages setting, estimation is implemented recursively using backward induction, starting from t = T + 1 and using the outcome Y T +1 directly. More recent attempts proposed a restricted version of tree-based RL (Speth and Wang, 2021) , where the restriction refers to the set of covariates, or augmented the treebased learning with (unstructured) free-text clinical information extraction techniques, in addition to structured EHRs data (Zhou et al., 2022) . By leveraging the properties of a tree-based learning (straightforward to understand and interpret, and capable of handling various types of data without distributional assumptions) with those of the AIPTW (semi-parametric robust estimator), the tree-based approaches are robust, efficient and more interpretable and flexible (compared to e.g., OWL or DNN). While in computer science, a vast literature on estimating optimal policies over an increasing time horizon exists (Szepesvári, 2010; Sugiyama, 2015) , this scenario is rare in DTRs. By adopting backward induction, most of the existing methods cannot extrapolate beyond the time horizon in the observed data. However, for some chronic conditions, or those with very short time steps, including mHealth JITAIs (see Section 2), the time horizon is not definite, in the sense that treatment decisions are made continually throughout the life of the patient, with no fixed time point for the final treatment decision. To the best of our knowledge, only a limited number of statistical methodologies have been developed for the infinite horizon setting. Ertefaie and Strawderman (2018) proposed to estimate optimal regimes with a variant of Greedy Gradient Q-learning (GGQ). Luckett et al. (2020) proposed to search for an optimal policy over a pre-specified class of policies with V-learning, later extended by Xu et al. (2020) to latent space models. More recently, Zhou et al. (2021) proposed a minimax framework called proximal temporal consistency learning. We now review the first two methods, both developed under a time-homogeneous Markov behaviour, but, while the Vlearning technique of Luckett et al. (2020) uses direct RL, the GGQ method of Ertefaie and Strawderman (2018) is indirect. Greedy Gradient Q-learning. The first indefinite-time horizon extension in DTRs estimation was introduced by Ertefaie and Strawderman (2018), under the timehomogeneous Markov assumption (see Section 3.2.1). Although not imposed by general DTRs methods, such assumption exemplifies inference by working with timeindependent Q-functions, and overcomes the need for backward induction. We adopt the notation of the previous sections, and introduce an absorbing state c representing, for instance, a death event. We assume that at each time t, covariates X t take values in a finite state space X . = X ∪ {c}, with X ∩ {c} = ∅. Let the action space A x be finite and defined by covariates' information: A x consists of 0 < m x ≤ m treatments, with m the total number of treatments over the time horizon. For any t such that X t = c, let A x = A c = {u}, where u stands for "undefined". Now, denoted with T . = inf{t > 0 : X t = c} a stopping time (e.g., death), individual trajectories are of the form (X 0 , A 0 , R 1 , . . . , X T −1 , A T −1 , R T , X T ). Note that P( T < ∞|X 0 , A 0 ) = 1, regardless of (X 0 , A 0 ). Based on these specifications, the infinite time-horizon stage t action-value function for regime π(h t ) = π(x t ) = π(x), for x ∈ X , is given by: We set Q * (c, a) = 0 as the return is 0 after an individual is lost to follow-up. For estimating an optimal DTR, Q-learning with FA (see Section 4.1.1) is proposed. Let Q(x, a; θ * ) be a parametric model for Q * (x, a) indexed by θ * ∈ Θ ⊆ R q , and suppose a linear model with interactions, i.e. Q(x, a; θ * ) = θ * T ψ(x, a), with ψ(x, a) being a known feature vector summarizing the state and treatment pair. To ensure Q * (c, a) = 0, we also need ψ(c, a) = 0. Now, Bellman optimality suggests and motivates the following unbiased estimating function for θ * : . Note that the estimating function in (22) is a continuous, piecewise-linear function in θ * that is nonconvex and non-differentiable everywhere. Under regularity conditions, authors suggested that any solutionθ * can be equivalently defined as a minimizer ofM (θ * ) . , and x ⊗2 . = xx T , for any vector x. Ifθ * = arg min θ * ∈ΘM (θ * ) is the unique solution, thenQ * (x, a) = Q(x, a;θ * ) and the corresponding optimal regime is given byπ * = arg max a∈Ax Q(x, a;θ * ). Under additional assumptions, authors also proved that θ * is a consistent estimator for θ * and asymptotically normally distributed. V-learning. The GGQ approach based on (22) involves a non-smooth max operator that makes estimation difficult without large amounts of data Linn et al., 2017) , and, depending directly onθ * , it requires modeling the transition probabilities. Motivated by a mHealth application, where policy estimation is continuously updated in real time as data accumulate (starting with small sample sizes), an alternative method, which directly maximizes estimated values over a class of policies, was proposed in Luckett et al. (2020) . Under the same time-homogeneous MDP assumption, and provided interchange of the sum and integration is justified, the value function is given by: and, for any function ψ defined on the state space X t , it satisfies an importance-weighted variant of the Bellman optimality (Sutton and Barto, 2018) given by: Let now V (x; θ), with θ ∈ Θ ⊆ R q , be a model for V (x). Assuming that V (x; θ) is differentiable everywhere in θ, for fixed x and d, and denoted with ψ(x) . = ∇ θ V (x; θ), the proposed estimating equation function is given by: Again,θ can be obtained by minimizingM (θ) . = Λ(θ) TŴ −1Λ (θ) + λP(θ), withŴ a positive definite matrix in R q×q , λ a tuning parameter and P : R q → R + a penalty function. The estimated optimal regimed * is then the argmax of V (x;θ). Compared to GGQ, V-learning requires modeling the policy and the value function, but not the data-generating process. In addition, by directly maximizing the estimated value over a class of policies (see Luckett et al., 2020 , for more details), it overcomes the issues of the non-smooth max operator in (22). The method is applicable over indefinite horizons and is suitable for both off-line and online learning, typical of mHealth studies. We conclude this major methodological review section with an illustrative summary of the aforementioned methods. The schematic is based on the general taxonomy adopted in this work of direct vs indirect methods, and the modeling assumptions for the data trajectory (fully-, semi-and non-parametric). JITAIs are carried out in dynamic environments where context and needs of individual users can change rapidly (Nahum-Shani et al., 2015 . Methodologies for optimizing JITAIs are required to perform an almost continuous learning-with no definite time horizon-and to provide estimated optimal decisions online as data accumulate, often utilizing trajectories defined over very short time periods. Thus, existing methods for DTRs, mainly targeting a finite-time horizon problem and implemented offline, with backward induction (as in Q-learning), are not directly applicable in JITAIs. Furthermore, by carrying over an entire history of an individual, they may not be feasible from a computational perspective. As discussed in Section 3, the standard approach for developing JITAIs is given by contextual MABs , an intermediate between MAB (Bubeck and Cesa-Bianchi, 2012; Auer et al., 2002b) and full-RL. With a few exceptions, contextual MAB algorithms applied in mHealth rely on two fundamental bandit strategies, originally applied in advertising: the Linear Upper Confidence Bound (LinUCB; Li et al., 2010; Chu et al., 2011) and the Linear Thompson Sampling (LinTS; Agrawal and Goyal, 2013) . These were then transposed in the mHealth arena, motivating a number of extensions to better address the domain needs and characteristics. Alternative methods include the Actor-Critic strategy (Lei et al., 2017) and other more full-RL oriented techniques (Zhou et al., 2018) , which we discuss in Section 5.3 and 5.4, respectively. LinUCB (Li et al., 2010; Chu et al., 2011) , built on the context-free upper confidence bound (UCB; Auer et al., 2002a) method, is based on the assumption that the expected reward is a linear function of a context-action feature, say f (X t , A t ) ∈ R d . We consider features (constructed e.g., via linear basis, polynomials or splines expansion; Marsh and Cormier, 2001) rather than a standard linear function as they may capture non-linearities in the data, yielding more predictive and explanatory power. The idea behind UCB and LinUCB is to perform an efficient exploration by favouring arms for which a confident value has not been estimated yet, and avoiding arms which have shown a low reward with high confidence. This confidence is measured by the UCB of the expected reward value for each arm. More specifically, under the linearity assumption, i.e., E[Y t+1 |X t , A t ] = f (X t , A t ) T µ, with µ ∈ R d the unknown coefficients vector, the proposal is to estimate the UCB associated with arm a t at time t, where the first part f (X t , A t = a t ) Tμ t , withμ t . = B −1 t b t an estimator of µ, reflects the current point estimate of the reward, while the second part s t (a t ) represents an indication of its uncertainty, i.e., the standard deviation. B −1 t and b t are analogous to the terms "(X T X) −1 " and "X T Y ", respectively, that constitute the OLS estimator in a standard linear regression model (E[Y |X] = X T µ). Assuming a ridge penalized estimation, with penalty parameter λ ≥ 0, these values are recursively computed at each time t taking into account previously explored arms: = arg max aτ ∈A U τ (a τ ) τ =0,1,...,t−1 being the estimated optimal arms on previous rounds. Algorithm 3 provides a schematic of this approach. The tuning parameter α > 0 can be viewed as a generalization of the critical values typically used in confidence intervals. It controls the trade-off between exploration and exploitation: small values of α favor exploitation while larger values of α favor exploration. Theoretical studies on LinUCB showed that they provide high probability guarantees on the regret suffered by the learner. Several variations of the LinUCB were proposed in the bandit literature. These include: i) Linear Associative Q-learning with linear FA (Murphy, 2005) A-learning: -Contrats-based A-learning with G-estimation (Robins, 2004) ; -Regret-based A-learning with IMOR (Murphy, 2003) ; -Regret-based A-learning with FA (Blatt et al, 2004) IPTW (Robins, 2000) Fully parametric methods: -G-computation or G-formula (Robins, 1986) -Likelihood based methods (Thall et al, 2000; -AIPTW (Zhang et al, 2012b; -OWL/BOWL/SOWL -Tree-based RL Tao et al, 2018; Sun & Wang, 2020) -Q-learning with SVR & ERT (Zhao et al, 2009 ) -Deep Q-learning Raghu et al, 2017b; Atan et al, 2018) MDPs based methods for Indefinite-horizon DTRs: GGQ (Ertefaie & Strawderman, 2018) V-learning -Bayesian predictive methods (Arjas & Saarela, 2010; Zajonc, 2012) -Bayesian Machine Learning (Murray et al, 2018) FIG 9. Schematic of existing methods (in a temporal line) for developing DTRs in both finite and indefinite horizon. Grey coloured blocks denote direct methods, based on IPTW; while white dotted blocks denote indirect approaches. Algorithm 3: LinUCB (Li et al., 2010; Chu et al., 2011 ) end for Select armã t = arg max at∈A U t (a t ) and get the associated reward Y t+1 (X t , A t =ã t ); Update B t and b t according to the best armã t end for RL (LinREL) and SupLinREL (Auer, 2002) , based on singular-value decomposition rather than ridge regression for obtaining an estimate of the UCB; ii) generalized linear model versions (UCB-GLM; Filippi et al., 2010) and SupUCB-GLM (Li et al., 2017) , which assumes that the reward function can be written as a composition of a linear function and a link function; iii) non-parametric modeling of the reward function, such as Gaussian processes (GP-UCB; Srinivas et al., 2009 Srinivas et al., , 2012 ; contextual GP-UCB (Krause and Ong, 2011) and kernel functions (Sup-KernelUCB; Valko et al., 2013) ; iv) NeuralUCB, which leverages the representation power of DNNs and uses a neural network-based random feature mapping to construct the UCB for the reward . More recently, in addition to the (bandit) regretminimization goal, attention has been given to statistical objectives. To illustrate, Urteaga and Wiggins (2018) aimed to accommodate more complex models of the environment, e.g., non-linear reward functions and dynamic bandits, by integrating advances in sequential Monte Carlo methods -capable of estimating posterior densities and expectations in probabilistic models that are analytically intractable -within a Bayesian version of the LinUCB problem. In a similar context to ours, i.e., behavioural science, Dimakopoulou et al. (2019) introduced balancing methods from the causal inference literature, i.e., weighting each observation with the estimated inverse probability of a context being observed for an arm, in the regression estimation process, in order to make the bandit algorithm less prone to bias. Theoretical guarantees of their Balanced UCB match the state-of-the-art bounds, but it helps to reduce bias, particularly in misspecified cases, at a cost of increased variance (which can be controlled by probability clipping; Crump et al., 2009) . LinUCB has been successfully applied mHealth in Paredes et al. (2014) and Forman et al. (2019) . The former developed a LinUCB based intervention recommender system for learning how to match interventions to individuals and their temporal circumstances over time. The aim was to deliver stress management strategies (upon user's request in the mobile app), with the goal of maximizing stress reduction. After four weeks of study, participants receiving the LinUCB-based recommendations showed a tendency towards using more constructive coping behaviors. Similarly, Forman et al. (2019) , in the context of behavioural weight loss (WL), conducted a pilot study to evaluate the feasibility and acceptability of an RL-based WL intervention system, and whether it would achieve equivalent benefit at a reduced cost, compared to a non-optimized intervention system. To this purpose, participants were randomized between a non-optimized, a individually optimized (individual reward maximization) and a group optimized (group reward maximization) group. The study showed that the LinUCB-based optimized groups have strong promise in terms of outcome of interest, not only being feasible to deploy and acceptable to participants and coaches, but also achieving desirable results at roughly one-third the cost. Under the same linear reward assumption of LinUCB, Agrawal and Goyal (2013) proposed a randomized MAB algorithm, based on a generalization of the Thompson Sampling (TS) technique to stochastic contextual MABs problems. Rooted in a Bayesian framework, the idea of TS is to select at each time t an arm according to its posterior probability of being optimal at that time, i.e., maximizing the posterior reward distribution. More specifically, assuming a Gaussian prior for the regression coefficients vector µ, e.g., µ ∼ N (0 d , σ 2 µ I d ), and a Gaussian distribution for the reward, i.e., Y t |µ, f (X t , A t ) ∼ N (f (X t , A t ) T µ, ν 2 ), at each time t, the optimal armã t is the one that maximises the a-posteriori estimated expected reward, i.e., f (X t , A t ) Tμ t . The posterior nature is reflected inμ t , which represents a sample from the posterior distribution, given by N (μ t , ν 2 B −1 t ); hereμ t . = B −1 t b t is the posterior mean, with B t and b t defined in the same way as for LinUCB. The iterative process is given in Algorithm 4. Given all the data trajectory up to time t, T t−1 = {(X τ , A τ , Y τ +1 )} τ =0,1,...,t−1 and f (X t , A t ), LinUCB is deterministic and allows exploration through the uncertainty term s t (a t ), while TS is randomized, and exploration is given by the random draws from the posterior distribution. Note that the standard deviation s t (a t ) characterizing LinUCB has the same order of the standard deviation of the updated posterior distribution of the reward by definition. LinTS has been extensively studied within the theoretical bandit literature, and its theoretical guarantees are well recognized Agrawal and Goyal (2013) . Similarly to LinUCB, several extensions, including those proposed by Dimakopoulou et al. (2019) and Urteaga and Wiggins (2018) for LinUCB, have been considered. In what follows we focus on works which have been developed within the mHealth literature, specifically addressing field-related characteristics. Bootstrap TS. Under the normal conjugate family assumed for the LinTS, sampling from the posterior is straightforward. However, to be practically feasible in other reward distributions, and scalable to large T or to complex likelihood functions, TS requires computationally efficient sampling from the posterior distribution of µ t |T t−1 , f (X t , A t ). Already in the case of a logit or probit model is used to model, the posterior is not available in closed form, and Markov chain Monte Carlo (MCMC) methods may be very costly. Motivated by the above, and inspired by the existing relationship between bootstrap distributions Rubin (1981) and Bayesian posteriors (i.e., bootstrap distributions can be used to approximate posteriors; Efron, 2012; Newton and Raftery, 1994) , Eckles and Kaptein (2019) formulated a Bootstrap Thompson Sampling (BTS) technique for replacing the posterior by an online bootstrap distribution of the point estimateμ t at each time t. In empirical evaluations, authors showed that, in comparison with LinTS and other methods, BTS it is more robust to model misspecifications, thanks to the robustness of the bootstrap approach, and it can be easily adapted to dependent observations, a common feature of behavioral sciences Kaptein, 2014, 2019) . Intelligent-Pooling TS. When data on individuals is limited, learning an adaptive strategy separately for each user may be very slow, particularly if data are sparse and/or noisy and the process is non-stationary. Since direct pooling of data across users can introduce bias, Tomkins et al. (2021) introduced a novel Intelligent Pooling algorithm that generalizes LinTS by using a Gaussian mixed effects linear model for the reward. Mixed effects models are widely used across behavioral sciences and mHealth , to model the heterogeinity across individuals and within an individual across time (Raudenbush and Bryk, 2002; Laird and Ware, 1982) . Empirical evaluations showed that Intelligent Pooling achieves improved regret compared to stateof-the-art, demonstrating promise of personalization on even a small group of users. Action-Centered TS. Motivated by specific challenges arising in mHealth, Greenewald et al. (2017) extended the linear stationary model of LinTS to a non-stationary and non-linear version composed by two parts: a baseline reward (associated with a "do nothing" or control arm, denoted with 0) and a treatment or action effect. Assuming K (non-control) arms, in addition to the 0 (control) arm, Algorithm 4: LinTS (Agrawal and Goyal, 2013) Input: σµ ∈ R, ν ∈ R, T ∈ N, d ∈ N, λ ∈ R + Initialization: Observe feature f (X t , A t = a t ) and compute the 'a-posteriori' estimated expected reward, i.e., f (X t , A t = a t )μ t end for Select armã t = arg max at∈A f (X t , A t = a t )μ t and get the associated reward Y t+1 (X t , A t =ã t ); Update B t+1 and b t+1 according to the best armã t end for at each time step t ∈ N, the expected reward model is formalized as: with f (x t , a t ) ∈ R d a fixed context-action feature (with context X t chosen by an adversary on the basis of the trajectory T t−1 up to time t), µ ∈ R d the parameters vector, and g t (x t ) a time-varying component that can vary in a way that depends on the past, but not on current action (thus, allowing for non-stationarity). The term adversarial in contextual bandits refers to the context and reward generation mechanism: when both contexts and actions are allowed to be chosen arbitrarily by an adversary, no assumptions on the generating process are made . The indicator function I(a t = 0) specifies the additional component of the expected reward given by the non-control arms. To estimate the unknown parameter µ, due to the arbitrarily complex baseline reward, authors propose to work on the differential reward, involving the so called action-centering trick for eliminating the g t (x t ) component and derive an unbiased estimator. Furthermore, to avoid sending too few or too many interventions, and prevent the algorithm from converging to an ineffective deterministic policy, a constraint on the size of the probabilities of delivering a non-control intervention (i.e., probability clipping) is considered. The proposed Action-Centered TS (ACTS) can be viewed as a two-step hierarchical procedure, where the first step is to estimate the arm that maximizes the reward, and the second step is to randomly select a non-control arm A t = 0 (vs the control) with probability π t : where A t denotes a random non-control arm,μ a draw from the posterior distribution, and π min and π max in [0, 1] constant probability clipping constraints chosen by domain science. The algorithm was empirically evaluated on physical activity mHealth study, the HeartSteps (Liao et al., 2015; Klasnja et al., 2015) , which has been of great interest in both biostatistics (Liao et al., 2016; Boruvka et al., 2018) and the RL/bandit literature (Greenewald et al., 2017; Lei et al., 2017; Liao et al., 2020) . In this context, gollowing the ACTS strategy, Liao et al. (2020) , for instance, incorporated in the differential reward model an "availability" variable, stating whether the user is available to receive an intervention. Authors showed that ACTS achieves performance guarantees similar to the linear reward setting, while allowing for non-linearities in the baseline reward. Theoretical improvements over ACTS are given in Krishnamurthy et al. (2018) and Kim and Paik (2019) , in which a relaxation of the action-independent assumption of the component g t (x t ) in (5.2) is considered. By allowing dependence on both time and history, i.e., E(Y t+1 |H t = h t , A t = a t ) = f (x t , a t ) T µ + g t , the reward model is made entirely nonparametric. For estimating the unknown parameters, Krishnamurthy et al. (2018) proposed the adversarial Bandit Orthogonalized Semiparametric Estimation (BOSE) method, based on an action-elimination strategy adapted from Even-Dar et al. (2006) , and a centering trick as in Greenewald et al. (2017) to cancel out g t . The proposed estimatorμ t at time t is given by: where λ ≥ 0 is the ridge penalty parameter, and represents the centering trick. It is derived so that, condition- The BOSE algorithm does not require any constraint on the action choice probabilities as in (24), and it matches the best known regret bound of LinUCB for linear reward models. However, the action elimination step requires O(K 2 ) computations at each round, and, in order to meet the regret bound, the action-selection distribution should satisfy non-trivial conditions. This difficulty is overcome in Kim and Paik (2019) with an alternative estimator, given by: that requires only O(K) computations at each t, and enjoys a tighter high-probability upper bound than the BOSE, matching the bound of LinTS. Specifically addressing the problem of personalized mHealth interventions, Lei (2016) proposed to use an alternative class of RL algorithms-known as actor-critic (AC) RL (Sutton and Barto, 2018; Grondman et al., 2012) -in which both policies and value functions are learned. Actor is the component that learns policies, and critic the one that learns value functions, which is then used to "critizise" and update actor's policy. In this sense, AC architectures combine direct and indirect methods, and in specific settings they provide a framework of equivalence for the two distinct approaches (Guan et al., 2019) . Considering a binary action space A = {0, 1}, and assumptions similar to the ones of LinTS and LinUCB (i.e., linear reward model and bounded rewards and features), authors formulated the problem as a stochastic contextual MAB and proposed a class of parameterized stochastic policies, with P(A = 1|X = x) = π(1|x; θ) = e g(x) T θ 1+e g(x) T θ , and g(x) a p-dimensional feature. Similarly to the ACTS strategy, authors also considered a stochastic chance constraint of the form: with π min ∈ (0, 0.5) and α ∈ (0, 1) being constants controlling the amount of stochasticity. By improving treatment variety, this constraint may increase engagement and decrease the habituation effect (Raynor and Epstein, 2001; Epstein et al., 2009; Wilson et al., 2005) , which can easily incur in deterministic policies. An optimal policy is then obtained by maximizing the expected reward under policy π(a|x; θ), i.e., V (θ) . = E πθ (Y ), subject to the constraint in (25). To solve the non-trivial optimization problem (given the non-convex constraint on θ), first, a relaxation of (25) is made, and then the Lagrangian function J λ (θ), with λ the Lagrangian multiplier, is proposed as an alternative objective. That is, with p(x) a fixed unknown distribution of the context. For a given λ, the optimal policy π * . = π θ * is the one with θ * . = arg max θ∈Θ J λ (θ). However, both J λ (θ) and E(Y |X = x; A = a) are unknown. The conditional mean (Q-function) is first estimated through a penalized (L 2norm) linear regression; this is the critic step. Then, the estimates for each a and x are plugged into (26) and an estimated optimal actor's policy is derived based on the MC estimator: where P T denotes the empirical average on T i.i.d. samples. We illustrate the full procedure in Algorithm 5. An extension of this algorithm to settings with outliers, was proposed in Zhu et al. (2018) . The proposal involves, the use of the capped-L 2 norm, instead of the standard L 2 norm in the critic step, and a modification of the regularized average reward in (26) that takes into account only data trajectories whose residuals satisfy the capped condition. The procedure basically assigns zero weight to tuples with large residuals in the critic update, and excludes them from the actor step. RL methods in mHealth encompass, in addition to the aforementioned methods, a few other alternatives that fall outside these popular MAB categories. These include, among others, the works of: i) Yom-Tov et al. (2017) , for evaluating the effectiveness of personalized feedback in increasing adherence of diabetic patients to recommended physical activity regimes; ii) Zhou et al. Algorithm 5: Actor-Critic Contextual Bandits (Lei et al., 2017) Input: T ∈ N, λ ∈ R + , a class of parameterized policies {π θ : θ ∈ Θ ⊆ R p } based on a p-dimensional policy feature g(x) Critic Initialization: B 0 = λI d , b 0 = 0 d Actor Initialization: Initial policy parameterθ 0 based on domain theory or prior data for t = 0, 1, 2, . . . T do Observe context X t and the feature vector f (X t , A t ); Draw an actionã t according to probability distribution πθ t (X t , A t ); Get the associated reward Y t+1 (X t , A t =ã t ); Critic Updates: update and estimate the regression coefficientμ t = B −1 t+1 b t+1 and the associated rewardŶ t+1 = f (X t , A t =ã t )μ t Actor Update: estimate the unknown policy parameter θ (2018), for developing a fitness app, CalFit, which automatically sets personalized, adaptive daily step goals and adopts behavior-change features such as self-monitoring; and iii) Rabbi et al. (2015) , again for developing a physical activity app, MyBehavior, able to automatically learn users' physical activity and dietary behavior, and strategically suggest changes to those behaviors for a healthier lifestyle, also incorporating users' preferences. While the first two works are based on a more full-oriented RL, the latter considers an adversarial MAB approach, namely the randomized context-free exponential-weight algorithm for exploration and exploitation (EXP3; Auer et al., 2002b; Bubeck and Cesa-Bianchi, 2012) . EXP3 has been shown to be able to quickly adapt to changes in reward functions: if e.g., the user starts following new suggestions or his/her lifestyle changes (e.g., moving to a new location), then underlying benefits of certain behavior also change. In Yom-Tov et al. (2017) two policies were considered for the treatment arms: an "initial policy" based on the results of Elliot and Church (1997) to incentivise exploration, designed so that: i) no message was sent on 20% of days, and ii) for the remaining days, a negative or a positive feedback might be received by the user with equal probability based on their expected fraction of activity; and a "learning policy", based on a linear regression algorithm with interactions and the Bolzmann sampling (Watkins, 1989) on the outputs of the learning algorithm to choose the feedback message to be given. Finally, Zhou et al. (2018) proposed a predictive quantitative model for each participant based on the historical steps and goal data for that user, as in Aswani et al. (2019) . It involves a two-stage RL for selecting the optimal interventions: in the first stage, inverse RL (Ng et al., 2000) is employed to estimate the parameters of the predictive model for each user, while in the second stage, an RL technique equivalent to a direct policy search (Sutton and Barto, 2018) , with model parameters estimated in the first stage, is used. In the previous two sections we reviewed existing RLbased methodologies for constructing DTRs and JITAIs, respectively. These were introduced within the ML and statistical literature and generally evaluated through simulations. Now, we provide a more concrete idea of the application of RL in these two types of AIs in real-world settings, as found in the clinical literature. By combining our relevant methodological knowledge with motivating studies we conducted in both the AI sub-areas, we illustrate the main challenges that clinical or behavioral researchers face in applying these methods in practice, and the main limitations that might impact a successful application. We also illustrate the decisions we made and the tremendous collaboration opportunities between statisticians, ML researchers, and applied practitioners in the space of AIs. Based on a two-stage SMART design, the PROJECT QUIT -FOREVER FREE study aimed to develop/compare internet-based (precursors to mobilebased) behavioral interventions for smoking cessation and relapse prevention. The primary aim, based on the six-month-long first stage of the study, i.e., the PROJECT QUIT, was to find an optimal multi-factor behavioral intervention to help adult smokers quit smoking; see Strecher et al. (2008) for more details. The second stage, known as FOREVER FREE, was a six-month-long follow-on study to help PROJECT QUIT participants who quit remain non-smoking, and offer a second chance to those who failed to give up smoking at the previous stage. These two stages were then considered together with a goal of finding an optimal DTR over a twelve-month study period; this was a secondary aim of the main study. RL was not used in the design phase; in other words, this was not an instance of online learning. The RL-type learning happened offline on completion of data collection, when Q-learning (see Algorithm 1) with linear model and a variant (soft-thresholding) were employed. Choice of Q-learning was driven by its simplicity and interpretability. Detailed results from this secondary analysis can be found in and Chakraborty et al. (2010) . Here, we only summarize the main challenges faced and decisions made during this process. In PROJECT QUIT, the original plan was to randomly administer and test six behavioral intervention components (factors), each varied at two levels (highly individually tailored vs. not), according to a 32-cell fractional factorial design (FFD). However, due to a program error, one of those factors were not properly implemented. Subsequently, utilizing the factorial structure, the design was "folded" to convert it to a 16-cell FFD and that particular factor was dropped from further consideration at the analysis stage. In the primary analysis , which was a traditional logistic regression analysis, only two out of the five factors considered came out statistically significant. Based on this finding, only these two intervention factors (each at two levels) were considered in the stage-1 Q-learning model. Likewise, various participant-level contextual/tailoring variables were considered in the primary analysis, but only three of them (education, motivation, and self-efficacy) came out statistically significant. Again, informed by the primary analysis, only these variables were considered in the stage-1 model of Q-learning, allowing a parsimonious choice of model. In FOREVER FREE, originally there were four versions of an active behavioral intervention and a control arm, i.e., five arms in total. But later at the analysis stage, the four versions of the active intervention were found to be minimally different from each other, and hence was collapsed. That decision resulted in only two intervention arms at the second stage of Q-learning. In addition to education, motivation, and self-efficacy, the quit status at the end of stage-1 (PROJECT QUIT) was considered as a tailoring variable in the Q-learning model corresponding to stage-2 (FOREVER FREE). Some interaction terms between the chosen intervention factors and the tailoring variables were included in the Q-learning models, informed by psycho-social scientific theories of behavior change. At the time of the above analyses, techniques for checking model adequacy in Q-learning (or other offline RL methods) were not available; some methodology were subsequently developed . Reward function. The primary outcomes at both stages of the original study were the corresponding seven-day point prevalence of smoking (i.e., whether or not the participant smoked even a single cigarette in the last seven days at six months following the randomization), a dominant measure in the smoking cessation literature. These outcomes were considered as the stage-specific reward functions in Q-learning . However, the basic operationalization of Q-learning is for continuous outcomes, while the seven-day point prevalence outcomes were binary. Additional Q-learning analysis with a relatively more continuous reward function, number of months not smoked over the last six months (a secondary outcome in the main study), was also conducted . Qualitatively, the results were not too different. Missing data in the reward variable. In PROJECT QUIT, 1848 participants were randomly allocated to various interventions, but only 479 out of them decided to continue to FOREVER FREE; this flexibility was part of the protocol, and hence the remaining PROJECT QUIT participants were not considered to be drop-outs for the FOREVER FREE part of the study. However, only 1401 out of 1848 stage-1 participants completed the six-month outcome survey; these 1401 participants were treated as complete cases, while the remaining 447 participants were considered drop-outs in stage 1. Similarly, 281 participants (out of 479) who completed the stage-2 six-month survey were treated as complete cases, while the remaining 198 participants were considered drop-outs in stage 2. Descriptive checks revealed that drop-out was more or less uniform across the different intervention arms at both stages. One can employ modern missing data analysis techniques, e.g., multiple imputation, before applying Q-learning or other offline RL methods on SMART data to learn about optimal DTRs (see Shortreed et al., 2014, for details) . In case of PROJECT QUIT -FOREVER FREE data, Chakraborty et al. (2010) only presented a complete-case analysis, while Chakraborty (2009) also presented Q-learning analysis of multiply-imputed data. The DIAMANTE trial (Avila-Garcia et al., 2019; Aguilera et al., 2020) represents a recently NIH funded, in-thefield mHealth study for sending different combinations of interventions in an adaptive way that we helped designing. The overall aim of the study was to encourage individuals to become more physically active, by sending them suitable text-messages. A general overview of the study, including the MRT design, is given in FIG. 4 , while analysis of preliminary data, that showed promising results about the impact of such interventions on physical activity, are presented in Figueroa et al. (2021) . Here, we focus on the challenges and decisions we faced in relation to the RL-based strategy, while a broader list of fieldspecific problems were discussed elsewhere . Choice of the algorithm. In this trial, we employed RL only for one of the groups, regarded as the adaptive experimental group. We proposed the contextual linear TS algorithm (Algorithm 4), and decided to implement it after the first two weeks, in which text-messages were sent uniform-randomly (analogous to an initial "burn-in" period, or, more appropriately, an "internal pilot" for acquiring some prior data to feed into the main algorithm). The TS choice was motivated by several reasons. First, its empirical and theoretical properties have been wellstudied, showing great performances (Chapelle and Li, 2011; Agrawal and Goyal, 2013) . Second, it is computationally efficient, thus particularly suitable for online learning (Russo et al., 2017) . Third, it is a randomized algorithm, and as such, it mitigates different forms of biases and enables causal inference (Rosenberger et al., 2019) . Finally, TS has been widely applied in real-world applications, including mHealth (Liao et al., 2020) , showing successful results even with small amounts of data (Agrawal and Goyal, 2013) . Variable Selection. Contextual MABs formalize the reward model as a function of both intervention and contextual variables, which can be used for personalization. Thus, an adequate choice of the variables to be included in the model is crucial for unbiasedly estimating parameters of interest and associated causal inference. As shown in FIG. 4, the DIAMANTE study results in a relatively high-dimensional action space, where each arm is a combinations of the 4 × 5 × 4 factor levels. This is further complicated by the presence of a high number of both baseline and time-varying covariates that may also interact with the interventions. In the absence of reliable estimates at the start of the study (not enough data from pilot phases), we considered all arms, all baseline variables shown to be relevant in the literature, and also included action-action and actioncontextual interactions. Given this high-dimensionality, we adopted a slightly different reward model from the one proposed in Algorithm 4, which may provide regularization by shrinking coefficients, and avoid overfitting (Marquardt and Snee, 1975) . More specifically, at each time t, we assumed a Gaussian distribution for the reward, and a multivariate Normal Inverse Gamma (NIG) conjugate prior for the joint distribution of regression coefficients vector β ∈ R d and the variance parameter σ 2 > 0: (β, σ 2 )|µ β , Σ β , a, b ∼ NIG d+1 (µ β , Σ β , a, b). Hyperparameters µ β ∈ R d+1 , Σ β and a, b ∈ R >0 are assumed to be fixed and known. Reward variable. As in DTRs, appropriately choosing the reward variable is a major issue, and in mHealth, it strongly depends on the mobile or wearable data collection instrument. In this study, we used the daily step counts (collected by the pedometer on participants' personal phones) as proxy outcome variable for measuring physical activity. However, in order to account for users' baseline walking propensity, we decided to consider the steps change from one day to another, starting the steps count from the time an intervention message is sent. Compared to the steps count, the steps change measure shows a closer Gaussian shape assumed by our reward model. Missing data in the reward variable. To date, there has been a lack of mHealth studies that addressed this problem. However, in an online experimental setting, it is particularly relevant as it may impact subsequent selection of interventions when reward is missing. We tried to take a step forward, by first carefully pre-processing the data, e.g., setting as missing all the 0 steps; this is done in line with the existing literature, which suggests that this outcome is due to technical errors of the device, or simply users' forgetfulness to carry their phones when walking. Then we also performed multiple imputation of missing data. However, we take an exploratory approach, and perform this analysis as sensitivity analysis only, rather that executing them online and allowing the bandit algorithm to use the imputed data. We used the last observation carried forward technique (Hamer and Simpson, 2009 ) for dealing with missing reward data online. Speeding up learning. To speed up the learning rate of the algorithm, it is recommended to have prior knowledge to inform the prior distributions used therein. As this study was not designed with a pre-period of large scale data collection, we started the adaptive assignment only after an initial uniform random assignment of two weeks. This approach was demonstrated, via simulation studies, to effectively speed up the TS learning, with priors informed by the acquired data. Non-stationarity and delayed reward. Different studies have noticed the potential presence of non-stationarity in mHealth data (due to the recognized phenomenon of habituation in behavioral sciences; Klasnja et al., 2008; Dimitrijević et al., 1972) as well as a delayed feedback. Some alternative bandit strategies have been proposed in the theoretical literature to specifically address these issues (Pike-Burke and Grunewalder, 2019; Cella and Cesa-Bianchi, 2020). However, their effectiveness was not assessed in health-related settings, and their actualization in online learning may be computationally expensive. We proposed to use a simplified alternative to the recovering bandit approach of Pike-Burke and Grunewalder (2019). We basically included a time and action-dependent covariate in the reward model, defined as the "number of days since a specific intervention was not sent", based on which the expected reward of each arm can vary according to the number of rounds the arm was last played. Also, we modeled reward as a function of time, as previous research had shown that responses decrease with the time a user is in the study (Klasnja et al., 2019) . Additional existing challenges in the general current mHealth research are reported in Liao et al. (2020) , and they include accommodation of model misspecification and noisy data, users' disengagement, and possibility of conducting reliable secondary data analyses. Despite the promising effects of ML in mHealth, the number of studies is still too small for a rigorous evaluation of ML methods. In addition, significant challenges must be overcome before RL can be effectively deployed in a mobile-based healthcare setting. This calls for the development of appropriate guidelines in this emerging but notably growing field. The development of evidence-based AIs represents a key methodological line of research within the domain of personalized medicine. Starting from the pioneering works of Murphy (2003) and Robins (2004) , a growing body of literature has discussed the promising role of RL as an alternative to standard statistics, and applied it for constructing optimal AIs in real life. While the first proposals originated within causal inference and statistics for estimating optimal DTRs offline (based on finitetrajectory samples), recently, an increasingly active interdisciplinary area of research-consisting of behavioral scientists, computer scientists, and statisticians, among others-started to use RL for deploying AIs online. We are referring to JITAIs in mHealth, i.e., AIs in which interventions (typically behavioral) are continuously adapted according to users' in-the-moment context or needs. In this work, under a unified framework that brings together not only the two applied domains of DTRs and JI-TAIs but also their formalization as an RL problem, we provided a comprehensive state-of-the-art survey on RL methods, real examples, and challenges in AIs. To the best of our knowledge, this represents the first piece of work that bridges the different domains under a unique umbrella intersecting the RL, causal inference, and AIs, among other areas. We formalized the theoretical foundations of RL from a statistical perspective (notation and terminology), augmenting it with the fundamental pillar for developing AIs, i.e., the causal inference framework. We discussed the emerging field of JITAIs in parallel with the extensively studied and surveyed DTR literature, reporting similarities and divergences between these two types of AIs. Notably, probably due to the historical origins of these two lines of research, despite aiming to solve the same type of healthcare problem of improving/optimizing an outcome of interest, DTRs are mostly focused on (offline) estimation and identification of causal nexuses, while JITAIs on (online) regret performances, neglecting the problem of inference. Only recently, this area started to interrogate the possibility of secondary statistical goals, such as estimation and hypothesis testing, and the validity of the traditional statistics in data adaptively collected online through RL algorithms. The ML community led the way in addressing such issues, often borrowing tools from causal inference well studied in the DTRs literature. For example, the "stabilizing policy" approach of Zhang et al. (2021) is analogous to the "stabilized weights" of the causal inference literature (Robins, 2000) . Similarly the adaptively weighted augmented-inverse propensity weighted estimator in Hadad et al. (2021) , is inspired by the IPTW estimator discussed in Section 4.1.1. Such tools are employed not only for inferential objectives but also for improving online learning (see e.g., Dimakopoulou et al., 2019 Dimakopoulou et al., , 2021 . We notice that our focus in this work has been on the RL optimization problem rather than inference, but we also emphasize that like DTRs, JITAIs should be considerate of a valid statistical framework that may allow interpretability and generalizability of the results to future individuals and populations, not only trial participants. To this end, the DTR literature offers a rich basis of statistical challenges and tools that may benefit researchers and practitioners in the JITAIs area. We then provided an extensive instructive review of existing RL methods for developing AIs, discussing potential challenges that may impact their use in practice. In doing this, we built our narrative guided by two motivating real-life studies we have been involved in. From a methodological aspect, we noticed that the use of RL in DTRs has been extensively evaluated in the theoretical statistical or RL community, and several surveys exist (Chakraborty and Moodie, 2013; Chakraborty and Murphy, 2014; Tsiatis et al., 2019) . However, their realworld (clinical) application is still very limited. As we discuss in Supplementary material C, the majority of the existing studies use real-world data as motivating or illustrative examples only. The few clinical studies mainly focus on offline learning based on observational data (e.g., EHRs) and deep learning methodologies, which may limit interpretability of the results. We believe that the drivers of this tendency are related to: 1) the lack of existing guidelines for developing optimal DTRs, ensuring robust inferences, adequate power and enable generalizable conclusions; 2) the clinical settings itself, characterized by high costs, ethical concerns, and inherent complexities, that makes experimentation hard and limits the DTRs development to offline learning via observational datasets, which may be confounded; 3) existing challenges and open problems related to the RL process itself (see details in Section 6.1), which may require disease-specific adaptations. The RL process could and should be formalized according to suitable relationships characterizing the clinical (often complex) domain. When defining the reward function, for instance, one may take into account prior knowledge on the specific disease, multiple objectives, and the presence of unstructured data. From an implementation perspective, while for many of the reviewed algorithms several software packages exist (we report them in Supplementary Material D), these are often suitable only under specific (simplified) settings (e.g., continuous and positive rewards), and require users' knowledge about the specific software. On the other hand, we have identified an opposite tendency in the development of JITAIs in mHealth applications. Here, the majority of the surveys refer to realworld applications, generally targeting a specific problem area such as promoting physical activity (Hardeman et al., 2019) , rather than a methodological review, which currently does not exist. In this case, we recognize that the application area, mostly related to behavioral aspects rather than clinical, might have fewer concerns in terms of treatment costs, risks, and ethics, and the general aim is more focused on optimizing a proximal (behavioral) outcome, rather than allowing for generalizable conclusions. As mentioned above, inferential aspects of existing RL-based methodologies in this field are still poorly addressed, even though recent studies have demonstrated that data adaptively collected through RL and MABs may have a remarkable negative impact on inference (Zhang et al., 2020b; Hadad et al., 2021; Deliu et al., 2021) . Thus, despite the growing popularity of JITAIs, owing to several existing challenges (extensively discussed in Section 6.2) and a lack of guidance on constructing high-quality evidence-based JITAIs which may allow reliable comparisons across different interventions, the field is still insufficiently mature. In summary, this manuscript offers the first survey that brings together the areas of DTRs and JITAIs under the RL problem framework. While the two areas are ideally sharing the same problem of finding optimal policies (in line with the RL framework), their priorities are not always aligned due to historical links or domain interests and/or restrictions. For example, if SMARTs were to be used in practice more often, in addition to collecting highquality experimental data, decisions could also be optimized online benefiting trial participants as well (see e.g., proposals in Cheung et al., 2015; Wang et al., 2021) , as done in JITAIs. Analogously, by using the rich resources on statistical inference made available by the DTR literature, JITAI literature may extend their goal beyond the within-trial optimization goal. Furthermore, we offer instructive key contributions, that, in addition to a unified foundation of the diverse notations, terminologies, and formalization of the RL problem in our domains of interest, include: 1) a comprehensive study of the state-of-the-art RL methodologies in the AIs area, 2) our experience in using RL in real life, and 3) current challenges that prevent and/or impact their practical use in healthcare. Our hope is that such a unified common ground, where theoretical (statistics and ML) and applied healthcare disciplines can easily cooperate, would help to unlock the potential of exploring the opportunity RL offers in AIs and benefiting from it in a statistically justifiable way. Future work may expand the current project by addressing other healthcare domains outside the AIs area that use RL for either offline or online learning problems. A non-exhaustive list of examples is mentioned in the recent works of Levine et al. (2020) ; Clifton and Laber (2020) ; Kosorok and Moodie (2015) ; Yu et al. (2021) , and include among others the problem of automated medical diagnosis (see e.g., Ling et al., 2017) and the design of adaptive clinical trials (see e.g., Villar et al., 2015) . We aim to pursue some such direction as a separate work in the near future, starting from the growing area of adaptive clinical trial designs that is capturing an increasing attention from regulatory bodies (FDA, 2019). In such settings, by utilizing and processing accumulating data in an online fashion, RL and MAB methods could contribute to make clinical trials more flexible, efficient, informative and ethical (Pallmann et al., 2018) . Based on our findings, we strongly believe that RL offers a powerful solution in these areas, and we hope that our contributions may incentivize a higher synergy and cooperation between statistical and machine learning communities for supporting applied clinical or behavioral domains in carrying out real-world studies that may improve the quality of interventions delivery. We also recognize that this cooperation is very timely due to the spread of mHealth applications, which need to come with trustworthy and reproducible results in order to advance scientific progress and knowledge. from the stat-up funding and the Khoo Bridge Funding award (Duke-NUS-KBrFA/2021/0040) from the Duke-NUS Medical School, Singapore. estimate the optimal treatment regime, we model the regrets by defining an approximation space for the t-th advantage µ-function, e.g. M t . = {µ t (h t , a t ; ψ t ) : ψ t ∈ Ψ t }, with ψ ∈ Ψ t , a subset of the Euclidean space. As with Qlearning and in contrast-based A-learning, we use ADP and permit the estimator to have different parameters for each time t, but in this case the estimation strategy is IMOR, an iterative search algorithm proposed by Murphy (2003) . More specifically, Murphy (2003) proposed to simultaneously estimate the regret model parameter ψ plus a c parameter used for improving the stability of the algorithm, by searching for (ψ,ĉ) which satisfy − a µ t (H t , a;ψ)π t (a|H t ;α) for all ψ and c, with P N denoting the empirical mean of a sample of N patients. The technique proposed for finding solutions to 29 is an iterative search algorithm until convergence, known as iterative minimization for optimal regimes (IMOR). It has been shown that IMOR is a special case of G-estimation under the null hypothesis of no treatment effect, and modeling by a constant (Moodie et al., 2007) . We point to the original work of Murphy (2003) and Moodie et al. (2007) for readers interested in this technique and its relationship with G-estimation. Assuming a single-stage treatment regime with two treatment options (A ∈ {a, a }), let H = X 0 denote patient's history, d(H) . = d(H; ψ) a treatment regime indexed by ψ, µ(A, H;β) an estimated model for the mean outcome E[Y |H, A], and π(A|H,γ) an estimated propensity score. Then, the Augmented inverse probability of treatment weighting (AIPTW) estimator is defined bŷ It only requires either the propensity or mean outcome model to be correctly specified but not both, hence, doubly robust method. In addition to being more robust to model mis-specification, AIPW estimators tend to be more efficient than their non-augmented counterparts (Robins, 2004) . MSMs, originally proposed for estimating the effect of static treatment regimes (Robins, 2000) , provide a powerful alternative to SNMMs for describing the causal effect of a treatment (hence "structural"), and pertain to population-average effects ("marginal" over covariates, baseline and time-varying, and/or intermediate outcomes). Differently from the conditional approach of SN-MMs, which models the causal effect of a final blip as a function of the entire time-varying history (thus conditioning on that), the marginal approach of MSMs assumes models for the expectation of a potential outcome under a specified unobserved DTR d, marginalizing over the covariate history V d = E d [Y ] = E[Y d ], or alternatively as a function of the baseline covariates X 0 only, i.e., V d (X 0 ) = E d [Y |X 0 ] = E[Y d |X 0 ]. Most often, V d is specified as a linear combination of components of d, e.g., E[Y d ] = f (d; θ) = α + θd, with d = (d 0 , . . . , d T ) = (a 0 , . . . , a T ) the full treatment history, Y d the potential outcome that the subject would have observed under d, and θ a set of parameters. However, recently, more flexible, spline-based models have been considered (Xiao et al., 2014) . Of the different available methods, including maximum likelihood (Daniel et al., 2013) or targeted maximum likelihood estimation (Rosenblum and Van Der Laan, 2010) that have been proposed to estimate MSMs, or their parameters θ, IPTW (Robins, 2000; Neugebauer et al., 2012) is the most commonly used. As the name indicates, IPTW estimation attempts to control for confounding through assigning each participant a weight. The basic form of this weight for subject i at time t takes the form w π t,i = 1 t τ =0 π τ (A τ,i |H τ,i ) , where π t (a|h) = π t (A τ,i = a|H τ,i = h), so that the denominator is the probability that the subject received the particular treatment history they were observed to receive up to time t, given prior observed treatment and covariate histories. Applying the terminal weights w T,i to each subject in the sample results in a pseudo-population in which treatment is no longer affected by past covariates, breaking the confounding; but crucially, the causal effect remains unchanged. Then the parameters of the MSM coincide with those of the re-weighted observational marginal model, which may be estimated using standard methods on the re-weighted data. The resulting estimates are consistent under correct specification of the MSM and non-zero denominators. Overall, MSMs estimation is typically performed in two stages: in the first stage treatment weights are calculated; in the second stage the outcome model is fit. An extensive literature has considered some kind of extensions of the standard OWL estimator. We report these in TABLE B.6. However, the only two works which considered the estimation of DTRs, thus extended the OWL estimator to a multiple-stage setting are Zhao et al. (2015) and Liu et al. (2018) . Zhao et al. (2015) proposed the Backward Outcome Weighted Learning (BOWL) and Simultaneous Outcome Weighted Learning (SOWL) procedures. In the first approach, the stage-t estimator which we denote withf * B,t is obtained recursively bŷ Here (d * t+1 , . . .d * T ) are obtained prior to stage t, and the T -stage estimator does not account for treatments followed afterwards, i.e., T τ =T +1 I[A τ =d * τ (H τ )] . = 1. For the second approach, a simultaneous estimation is performed witĥ where ψ(x 1 , x 2 ) . = min(x 1 − 1, x 2 − 1, 0) + 1 is a concave surrogate for the product of two indication functions. Even if in numerical examples both BOWL and SOWL have demonstrated superior performances to existing direct methods, significant information loss is registered as t decreases. To overcome this problem, an augumented version integrating OWL and Q-functions is proposed in Liu et al. (2018) . Defining and denoting a pseudo-outcome withỸ t . = Y t +Q t+1 −ŝ t (H t ), this is given bŷ whereŝ t (H t ) is estimated via a least squares regression that minimizes P N [Y t +Q t+1 − s t (H t )] 2 , and function φ is analogous to the one previously defined. Thompson sampling for contextual bandits with linear payoffs mhealth app using machine learning to increase physical activity in diabetes and depression: clinical trial protocol for the diamante study Drug scheduling of cancer chemotherapy based on natural actor-critic approach April: Active preference learning-based reinforcement learning Introduction to smart designs for the development of adaptive interventions: with application to weight loss research Optimal dynamic regimes: presenting a case for predictive inference Modeling medical records of diabetes using markov decision processes Behavioral modeling in weight loss interventions Deep-treat: Learning optimal personalized treatments from observational data using neural networks A comparison of direct and model-based reinforcement learning Using confidence bounds for exploitation-exploration trade-offs Finite-time analysis of the multiarmed bandit problem The nonstochastic multiarmed bandit problem Engaging users in the design of an mhealth, text message-based intervention to increase physical activity at a safety-net health care system Decision theory: An introduction to dynamic programming and sequential decisions Notifications to improve engagement with an alcohol reduction app: Protocol for a micro-randomized trial Dynamic Programming Feasibility, acceptability, and preliminary efficacy of a smartphone intervention for schizophrenia. Schizophrenia bulletin Bandit problems: sequential allocation of experiments (monographs on statistics and applied probability) Reinforcement learning and optimal control A-learning for approximate planning Assessing time-varying causal effect moderation in mobile health Survey on applications of multi-armed and contextual bandits Statistical modeling: The two cultures (with comments and a rejoinder by the author) Regret analysis of stochastic and nonstochastic multi-armed bandit problems Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Machine Learning Incorporating patient preferences into estimation of optimal individualized treatment rules Stochastic bandits with delaydependent payoffs A study of non-regularity in dynamic treatment regimes and some design considerations for multicomponent interventions Inference for optimal dynamic treatment regimes using an adaptive m-out-of-n bootstrap scheme Statistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference, and Personalized Medicine Inference for non-regular parameters in optimal dynamic treatment regimes. Statistical methods in medical research Dynamic treatment regimes. Annual review of statistics and its application Bias correction and confidence intervals for fitted q-iteration An empirical evaluation of thompson sampling Estimating individualized treatment rules for ordinal treatments Preference-based policy iteration: Leveraging preference learning for reinforcement learning Sequential multiple assignment randomized trial (smart) with adaptive randomization for quality improvement in depression treatment program Contextual bandits with linear payoff functions Q-learning: Theory and applications A new initiative on precision medicine Design, and evaluation of adaptive preventive interventions Comparison of a phased experimental approach and a single randomized clinical trial for developing multicomponent behavioral interventions A conceptual framework for adaptive preventive interventions Activity sensing in the wild: a field trial of ubifit garden Dealing with limited overlap in estimation of average treatment effects Methods for dealing with time-dependent confounding An actorcritic based controller for glucose regulation in type 1 diabetes. Computer methods and programs in biomedicine Efficient design and inference for multistage randomized trials of individualized treatment policies Efficient inference without trading-off regret in bandits: An allocation probability test for thompson sampling Machine learning in medicine Online multi-armed bandits with adaptive inference Balanced linear contextual bandits Habituation: effects of regular and stochastic stimulation Thompson sampling with the online bootstrap Bootstrap thompson sampling and sequential decision problems in the behavioral sciences Bayesian inference and the parametric bootstrap A hierarchical model of approach and avoidance achievement motivation Habituation as a determinant of human food intake. Psychological review Clinical data based optimal sti strategies for hiv: a reinforcement learning approach Q-learning residual analysis: application to the effectiveness of sequences of antipsychotic medications for patients with schizophrenia. Statistics in medicine Constructing dynamic treatment regimes over indefinite time horizons Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems A smoothed q-learning algorithm for estimating optimal dynamic treatment regimes Adaptive designs for clinical trials of drugs and biologics: Guidance for industry Adaptive learning algorithms to optimize mobile applications for behavioral health: guidelines for design decisions Daily Motivational Text Messages to Promote Physical Activity in University Students: Results From a Microrandomized Trial Parametric bandits: The generalized linear case Can the artificial intelligence technique of reinforcement learning use continuously-monitored digital data to optimize treatment for weight loss Robust outcome weighted learning for optimal individualized treatment rules Preference-based reinforcement learning: a formal framework and a policy iteration algorithm The development of drink less: an alcohol reduction smartphone app for excessive drinkers Q-learning with censored data Adaptive q-learning Return of the jitai: applying a just-in-time adaptive intervention framework to the development of m-health solutions for addictive behaviors Deep learning Guidelines for reinforcement learning in healthcare Action centered contextual bandits A survey of actor-critic reinforcement learning: Standard and natural policy gradients Direct and indirect reinforcement learning A smartphone application to support recovery from alcoholism: a randomized clinical trial Confidence intervals for policy evaluation in adaptive experiments Last observation carried forward versus mixed models in the analysis of psychiatric clinical trials A systematic review of just-in-time adaptive interventions (jitais) to promote physical activity Reinforcement learning based control of tumor growth with chemotherapy The elements of statistical learning: data mining, inference, and prediction Ecological momentary interventions: incorporating mobile technology into psychosocial and health behaviour treatments DynTxRegime: Methods for Estimating Optimal Dynamic Treatment Regimes Using reinforcement learning to personalize dosing strategies in a simulated cancer trial with high dimensional data Mhealth: Emerging mobile health systems Convergence of stochastic iterative dynamic programming algorithms Multi-objective optimization of radiotherapy: distributed q-learning and agent-based simulation Simulation-based optimization of radiotherapy: Agentbased modeling and reinforcement learning Mimic-iii, a freely accessible critical care database Deep reinforcement learning in medicine Communication interventions for minimally verbal children with autism: A sequential multiple assignment randomized trial Contextual multi-armed bandit algorithm for semiparametric reward model Using wearable sensors and real time inference to understand human recall of routine activities Microrandomized trials: An experimental design for developing just-in-time adaptive interventions Efficacy of contextually tailored suggestions for physical activity: A micro-randomized optimization trial of heartsteps A markovian model for hospital admission scheduling Precision medicine. Annual review of statistics and its application Adaptive treatment strategies in practice: planning trials and analyzing data for personalized medicine Contextual gaussian process bandit optimization Semiparametric contextual bandits Mobile health technology evaluation: the mhealth evidence workshop Interactive model building for q-learning Set-valued dynamic treatment regimes for competing outcomes Dynamic treatment regimes: Technical challenges and applications Statistical inference in dynamic treatment regimes Tree-based methods for individualized treatment regimes Adaptive treatment allocation and the multi-armed bandit problem Random-effects models for longitudinal data Bandit algorithms A design for testing clinical strategies: biased adaptive within-subject randomization Dynamic treatment regimes: practical design considerations An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention A" smart" design for building individualized treatment sequences An actor-critic contextual bandit algorithm for personalized mobile health interventions Offline reinforcement learning: Tutorial, review, and perspectives on open problems A contextualbandit approach to personalized news article recommendation Provably optimal algorithms for generalized linear contextual bandits Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity Microrandomized trials in mhealth Sample size calculations for micro-randomized trials in mhealth Diagnostic inferencing via improving clinical concept extraction with deep reinforcement learning: A preliminary study iqlearn: Interactive q-learning in r Interactive q-learning for quantiles Deep reinforcement learning for dynamic treatment regimes on medical registry data Augmented outcome-weighted learning for estimating optimal dynamic treatment regimens Linear fitted-q iteration with multiple reward functions Multi-objective markov decision processes for data-driven decision support Estimating dynamic treatment regimes in mobile health using v-learning Super-learning of an optimal dynamic treatment rule Estimation of survival distributions of treatment policies in two-stage randomization designs in clinical trials Mediation analysis Optimizing drug therapy with reinforcement learning: The case of anemia management Ridge regression in practice Spline regression models A reinforcement learning approach for individualizing erythropoietin dosages in hemodialysis patients Human-level control through deep reinforcement learning Foundations of machine learning The optimal dynamic treatment rule superlearner: Considerations, performance, and application Estimating optimal dynamic regimes: Correcting bias under the null Demystifying optimal dynamic treatment regimes Reinforcement learning for closed-loop propofol anesthesia: a study in human volunteers Optimal dynamic treatment regimes An experimental design for the development of adaptive treatment strategies A generalization error for q-learning Customizing treatment to the patient: Adaptive treatment strategies. Drug and alcohol dependence A batch, off-policy, actor-critic algorithm for optimizing the average reward Developing adaptive treatment strategies in substance abuse research Marginal mean models for dynamic regimes A bayesian machine learning approach for optimizing dynamic treatment regimes An introduction to adaptive interventions and smart designs in education. ncser 2020-001 Building health behavior models to guide the development of just-in-time adaptive interventions: A pragmatic framework Just-in-time adaptive interventions (jitais) in mobile health: key components and design principles for ongoing health behavior support Delivering "just-in-time" smoking cessation support via mobile phones: current knowledge and future directions Dynamic marginal structural modeling to evaluate the comparative effectiveness of more or less aggressive treatment intensification strategies in adults with type 2 diabetes Approximate bayesian inference with the weighted likelihood bootstrap On the application of probability theory to agricultural experiments. essay on principles. section 9 Algorithms for inverse reinforcement learning Reinforcement-learning optimal control for type-1 diabetes Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part i: main content Reinforcement learning-based control of drug dosing for cancer chemotherapy treatment Adaptive designs in clinical trials: why use them, and how to run and report them Qualitative behavior of a family of delay-differential models of the glucoseinsulin system Combining kernel and model based learning for hiv therapy selection Poptherapy: Coping with stress through pop-culture A text message-based intervention for weight loss: randomized controlled trial Causality: Models, Reasoning and Inference Effects of methyphenidate and expectancy on children with adhd: Behavior, academic performance, and attributions in a summer treatment program and regular classroom settings Smart: study protocol for a sequential multiple assignment randomized controlled trial to optimize weight loss management Recovering bandits Constructing evidence-based treatment strategies using methods from computer science Markov decision processes: discrete stochastic dynamic programming Performance guarantees for individualized treatment rules Linear mixed models with endogenous covariates: modeling sequential treatment effects with application to a mobile health study. Statistical science: a Mybehavior: automatic personalized health feedback from user behaviors and preferences using smartphones Deep reinforcement learning for sepsis treatment Continuous state-space models for optimal sepsis treatment-a deep reinforcement learning approach Machine learning in medicine Hierarchical linear models: Applications and data analysis methods Dietary variety, energy regulation, and obesity Some aspects of the sequential design of experiments A new approach to causal inference in mortality studies with a sustained exposure period-application to control of the healthy worker survivor effect Estimation of the time-dependent accelerated failure time model in the presence of confounding factors The analysis of randomized and non-randomized aids treatment trials using a new approach to causal inference in longitudinal studies. Health service research methodology: a focus on AIDS Correcting for non-compliance in randomized trials using structural nested mean models Latent variable modeling and applications to causality. Causal inference from complex longitudinal data Marginal structural models versus structural nested models as tools for causal inference Optimal structural nested models for optimal sequential decisions Estimation of regression coefficients when some regressors are not always observed Randomization: The forgotten component of the randomized clinical trial Targeted maximum likelihood estimation of the parameter of a marginal structural model Estimating causal effects of treatments in randomized and nonrandomized studies Bayesian inference for causal effects: The role of randomization. The Annals of statistics Randomization analysis of experimental data: The fisher randomization test comment The bayesian bootstrap. The annals of statistics A tutorial on thompson sampling Qand a-learning methods for estimating optimal dynamic treatment regimes Informing sequential clinical decisionmaking through reinforcement learning: an empirical study. Machine learning A multiple imputation strategy for sequential multiple assignment randomized trials Mastering the game of go without human knowledge Critical analysis of big data challenges and analytical methods Penalized q-learning for dynamic treatment regimens Restricted sub-tree learning to estimate an optimal dynamic treatment regime using observational data Causation, prediction, and search Gaussian process optimization in the bandit setting: No regret and experimental design Information-theoretic regret bounds for gaussian process optimization in the bandit setting Web-based smoking-cessation programs: results of a randomized trial Statistical reinforcement learning: modern machine learning approaches Stochastic tree search for estimating optimal dynamic treatment regimes Reinforcement learning: An introduction An overview of clinical decision support systems: benefits, risks, and strategies for success Synthesis lectures on artificial intelligence and machine learning Adaptive contrast weighted learning for multi-stage multi-treatment decision-making Tree-based reinforcement learning for estimating optimal dynamic treatment regimes From ads to interventions: Contextual bandits in mobile health Evaluating multiple treatment courses in clinical trials Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials Bayesian and frequentist two-stage treatment strategies based on sequential failure times subject to interval censoring On the likelihood that one unknown probability exceeds another in view of the evidence of two samples Control systems engineering for understanding and optimizing smoking cessation interventions Intelligentpooling: Practical thompson sampling for mhealth. Machine learning Deep reinforcement learning for automated radiation adaptation in lung cancer Dynamic Treatment Regimes: Statistical Methods for Precision Medicine Asynchronous stochastic approximation and q-learning (sequential) importance sampling bandits Finite-time analysis of kernelised contextual bandits Toward a persuasive mobile application to reduce sedentary behavior. Personal and ubiquitous computing Super learner. Statistical applications in genetics and molecular biology Reinforcement learning and markov decision processes Structural nested models and g-estimation: the partially realized promise Machine learning in critical care: state-of-the-art and a sepsis case study Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics Reinforcement learning in models of adaptive medical treatment strategies Informing the dosing of interventions in randomized trials Improving chronic illness care: translating evidence into action Semiparametric efficient estimation of survival distributions in two-stage randomisation designs in clinical trials with censored data DTRreg: DTR Estimation and Inference via G-Estimation, Dynamic WOLS, Q-Learning, and Dynamic Weighted Survival Modeling (DWSurv) Doubly-robust dynamic treatment regimen estimation via weighted least squares Adaptive randomization in a two-stage sequential multiple assignment randomized trial Evaluation of viable dynamic treatment regimes in a sequentially randomized trial of advanced prostate cancer Adversarial cooperative imitation learning for dynamic treatment regimes Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation Learning from delayed rewards. King's College The pleasures of uncertainty: prolonging positive moods in ways people do not anticipate Flexible marginal structural models for estimating the cumulative effect of a time-dependent treatment on the hazard: reassessing the cardiovascular risks of didanosine treatment in the swiss hiv cohort study qLearn: Estimation and inference for Q-learning Bayesian nonparametric estimation for dynamic treatment regimes with sequential transition times Latent-state models for precision medicine Agentbased simulation for blood glucose control in diabetic patients Reinforcement learning with actionderived rewards for chemotherapy and clinical trial dosing regimen selection Encouraging physical activity in patients with diabetes: intervention using a reinforcement learning system Reinforcement learning in healthcare: A survey Deep inverse reinforcement learning for sepsis treatment Bayesian inference for dynamic treatment regimes: Mobility, equity, and efficiency in student tracking Estimating optimal treatment regimes from a classification perspective A robust method for estimating optimal treatment regimes Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions C-learning: A new classification framework to estimate optimal dynamic treatment regimes Multicategory outcome weighted margin-based learning for estimating individualized treatment rules Near-optimal reinforcement learning in dynamic treatment regimes Statistical inference with m-estimators on adaptively collected data Inference for batched bandits Reinforcement learning design for cancer clinical trials Estimating individualized treatment rules using outcome weighted learning Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer New statistical learning methods for estimating optimal dynamic treatment regimes Neural contextual bandits with upper confidence bound-based exploration Personalizing mobile fitness apps using reinforcement learning Optimal dynamic treatment regime estimation using information extraction from unstructured clinical text Estimating optimal infinite horizon dynamic treatment regimes via pt-learning Residual weighted learning for estimating individualized treatment rules Robust actor-critic contextual bandit for mobile health (mhealth) interventions Proper inference for value function in high-dimensional q-learning for dynamic treatment regimes Big data: Challenges and opportunities Table C, we characterize existing studies for developing DTRs in cancer diseases. An increased number of real-world studies can be found in chronic diseases different from cancer, such as diabetes while diabetes 2020) and mental health 2018) seem to be a more fertile area. However, excluding some exceptions these studies used real data only for evaluating the proposed RL method, thus, only as an illustrative example. In between pure simulations and real data, there's also an intermediate line of research for DTRs estimation which used either real-data to build a simulator environment BOWL + SOWL From a single stage to general T -stages, with T < ∞. Authors proposed two methods: one performs an iterative backward OWL (BOWL) estimation, the other a simultaneous OWL (SOWL) estimation AOL Extends to negative outcomes and considers multiple stages. Authors proposed an augumented version for the weight of the OWL (AOL) integrating OWL and Q-functions. The robust augmentation, making use of predicted pseudo-outcomes from regression models for Q-functions, reduces the variability of weights and improves estimation accuracy Authors proposed a general framework, called Residual Weighted Learning (RWL) for improving the finite sample performance, where they employ a smoothed ramp loss and derive outcome residuals with a regression model Authors generalize the OWL (GOWL) by using a modified loss function and a reformulation of the objective function in the standard OWL Authors use sequential binary methods by proposing a margin-based learning (build upon the large-margin unified machine (LUM) loss) Authors propose a robust OWL (ROWL), based on an angle-based classification structure, designed for multicategory classification problems, and a new family of robust loss functions to build more stable DTRs Statistical methods which extended the standard OWL estimator for developing optimal treatment regimes Authors would like to thank Eric Laber for the constructive feedback received. Nina Deliu acknowledges support by the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014). Joseph J. Williams was supported by the Office of Naval Research (N00014-21-1-2576) and Natural Sciences and Engineering Research Council of Canada (RGPIN-2019-06968) . Bibhas Chakraborty would like to acknowledge support APPENDIX A: GOOGLE SCHOLAR SEARCH (FIG. 1) The volume of DTRs and JITAIs literature was identified on Google Scholar with the following keywords, respectively:• "dynamic treatment regime" OR "dynamic treatment regimes" OR "dynamic treatment regimen" OR "dynamic treatment regimens"; • "just in time adaptive intervention" OR "just in time adaptive interventions".Returned items contain both published articles and grey literature (e.g., preprints). Citations and patents were excluded form the literature search. A minimum screening was performed to evaluate the consistency of the identified items in relation to the searched term and the respective online publication date. Items that were incorrectly returned in correspondence of a certain date were removed from that date group. RL algorithms that do not need an underlying model for the environment are known as temporal-difference (TD) learning and they constitute the core of modern RL, with Q-learning (Watkins, 1989) representing one of the most popular (off-policy) TD approaches. Temporal-difference (TD) learning methods refers to a class of model-free RL and they constitute the core of modern RL.The fundamental component of TD-learning is the incremental implementation, which requires less memory for estimates and less computation. The general idea is to update an estimate based in part on other previously learned estimates in the following way:Based on this form, they can be naturally implemented in an on-line, fully incremental fashion, without waiting until the end of an episode, thus the agent learns about a policy d from experience sampled from the same policy π, i.e., d = π. When full access to samples from another policy d are available, one can still use those samples to learn about policy π; we call this off-line learning. One of the early breakthroughs in RL was the development of an off-policy temporal-difference (TD) control algorithm known as Q-learning (Watkins, 1989) . The general idea is that, for each time step t ∈ [0, T ], a new estimate is obtained based in part on an old previously learned estimate:The constant α t determines to what extent the newly acquired information will override the old information, that is, how fast learning takes place: a factor of 0 will make the learner not learn anything, while a factor of 1 would make the learner fully update based on the most recent information. The discount factor γ balances a learner's immediate rewards and future rewards, and in a finite horizon problem is generally set to one. The original version of this approach is known as tabular Q-learning, and it is based on storing the Q-value for each possible state and action in a lookup table and chose the one with the highest value.Under some appropriate and rigorous assumptions, Q t has been shown to converge to the optimal Q-function Q * t with probability 1 (Watkins, 1989; Jaakkola et al., 1994; Tsitsiklis, 1994) . However, this simple approach is practical in a small number of problems because it can require many thousands of training iterations to converge in even modest-sized problems. In addition, it represents value functions in arrays, or tables, based on each state and action. Thus, large state spaces will lead not just to memory issues for large tables, but also to time problems needed to fill them accurately. Several Q-learning function approximators have been proposed in literature, including linear regression, decision-trees or neural networks. As Q-functions are conditional expectations, the first natural approach to model them is through regression models. Letting θ t . = (β t , ψ t ), Chakraborty and Moodie (2013) proposed stage specific optimal Q-functions to be parametrized aswhere H t0 and H t1 are two (possibly different) vector summaries of the history H t , with H t0 denoting the "main effect of history" and H t1 denoting the "treatment effect of history". The collections of variables H t0 are often termed predictive, while H t1 are said prescriptive or tailoring variables. Parameterŝ θ t . = (β t ,ψ t ) are obtained by solving suitable estimating equations such as ordinary least squares (OLS) or weighted least squares (WLS). Given a sampletrajectories, WLS (whose choice might be dictated by heteroschedastic errors), will estimateθ t by solvingwhere Σ t is a working variance model. Taking Σ t to be a constant yields the OLS estimator. As noticed first by Robins (2004) for G-estimation, and then by Chakraborty et al. (2010) for Q-learning, the treatment effect parameters at any stage prior to the last, can be non-regular under certain longitudinal distributions of the data. Q-learning, for instance, involves modeling nonsmooth, non-monotone functions of the data, which complicates both regression function and inference. In the specific modeling assumption of Eq. (27), due to the argmax operator involved in Q-learning,ψ t is a nonregular estimator, and inferential problems arise when ψ T t H t1 is close to zero, leading to non-differentiability at that point. In DTRs, this may occur for instance when two or more treatments produce (nearly) the same mean optimal outcome. To solve this issue, Chakraborty et al. (2010) , adapting previous work in the context of G-estimation (Moodie and Richardson, 2010) , proposed two alternative ways of shrinking or thresholding values ofψ T t H t1 near zero. In a similar spirit, Song et al. (2015) and Goldberg et al. (2013) proposed minimizing a penalized version of the objective in the first step of Q-learning, where the penalty is given by a function (27), given by. = K(x/α), with α > 0 a smoothing parameter and K(·) a kernel function that admits a probability density function. Another proposal for conducting inference for the estimated Qfunction parameters arised in , where a general method for bootstrapping under nonregularity, i.e., m-out-of-n bootstrap was presented. Subsequently, Laber et al. (2014a) derived a new interactive Q-learning method, where the maximization step is delayed, by adding an additional step between (14) and (15). This enables all modeling to be performed before the nonsmooth, non-monotone transformation. Contrast-based A-learning. We define the optimal contrastor C-function C * t (H t , A t ) at time t, as the expected difference in potential outcomes when using a reference regime d ref = {d * τ } τ =t+1,...,T . It is basically the optimal blip-to-reference given in (16) with g the identity function, i.e., C * t (H t , A t ). For simplicity, we consider here only the case of two treatment options coded as 0 and 1, i.e., A t = {0, 1} for all t ∈ [0, T ], and we let the standard or placebo "zero-treatment" to be the reference treatment, i.e., d ref t = 0, leading to an equivalence between (16) and (17).To determine an optimal DTR, we begin by defining an approximation space for the contrast functions, e.g.,with ψ ∈ Ψ t , a subset of the Euclidean space. Then, in a backward fashion, starting from t = T , and denoting the propensity of receiving treatment A T = 1 in the observed data with π T (A T |h T ) = P(A T = 1|H T = h T ), we obtain a consistent and asymptotically normal estimator for ψ T by G-estimation (Robins, 2004) , i.e., by solving estimating equations of the form:, for arbitrary functions λ T (H T , A T i ) of the same dimension as ψ T and arbitrary functions θ T (H T , A T i ).To implement estimation of ψ T via (28), one may adopt parametric models for all the unknown functions, including π T (A T i |H Ti ) if randomization probabilities are not known, i.e., in observational studies. Under certain conditions, Schulte et al. (2014) report that an optimal choice for λ T (H Ti , A T i ; ψ T ) is given byOnce we get estimatesψ T , the contrast based A-learning algorithm iteratively proceeds by estimatingψ T −1 ,ψ T −2 , . . . ,ψ 0 . Finally, in this twotreatment setting, the optimal DTRs is given by the one that leads to a positive C-function, i.e.,Notice that, as the additional models specified in (28) are only adjuncts to estimating ψ T , as long as at least one of these models is correctly specified, (28) will provide a consistent estimator for ψ T (this property is called double robustness property). In contrast, Q-learning requires correct specification of all Q-functions. An intermediate approach between G-estimation and Q-learning, which affords double-robustness to model misspecification and requires less computational skills compared to the former, was later introduced by Wallace and Moodie (2015) as the dynamic weighted ordinary least squares (dWOLS). Regret based A-learning. Rather than modelling a contrast defined as the expected difference in outcome when using a reference regime d ref t instead of a t at time t, Murphy (2003) and Blatt et al. (2004) proposed to model a regret function similar to the one introduced in (18). Denoting it by µ * t , it is defined as µ * t (H t , A t )Here the "advantage"/regret, is the gain/loss in performance obtained by following action A t at time t and thereafter the optimal regime d * t+1 as compared to following the optimal policy d * t from time t on. Again, toCare (MIMIC)-III (Johnson et al., 2016) freely accessible database, and mainly use a DL framework for approximating the Q-learning functions. A key motivation for using the DL, is related to its higher flexibility and adaptability to high dimensional action and state spaces compared to standard RL methods and its superior capability in modelling real-life complexity in heterogeneous disease progression and treatment choices, and automatic feature extraction directly from the input data. As in the previous case, data are generally used for illustrative purposes.