key: cord-0514513-bn9zl4pj
authors: Zhang, Chao; Vanschoren, Joaquin; Wissen, Arlette van; Lakens, Daniel; Ruyter, Boris de; IJsselsteijn, Wijnand A.
title: Theory-based Habit Modeling for Enhancing Behavior Prediction
date: 2021-01-05
journal: nan
DOI: nan
sha: e20752744beccfd9c1afd03f4967a2e182602502
doc_id: 514513
cord_uid: bn9zl4pj

Psychological theories of habit posit that when a strong habit is formed through behavioral repetition, it can trigger behavior automatically in the same environment. Given the reciprocal relationship between habit and behavior, changing lifestyle behaviors (e.g., toothbrushing) is largely a task of breaking old habits and creating new and healthy ones. Thus, representing users' habit strengths can be very useful for behavior change support systems (BCSS), for example, to predict behavior or to decide when an intervention reaches its intended effect. However, habit strength is not directly observable and existing self-report measures are taxing for users. In this paper, built on recent computational models of habit formation, we propose a method to enable intelligent systems to compute habit strength based on observable behavior. The hypothesized advantage of using computed habit strength for behavior prediction was tested using data from two intervention studies, where we trained participants to brush their teeth twice a day for three weeks and monitored their behaviors using accelerometers. Through hierarchical cross-validation, we found that for the task of predicting future brushing behavior, computed habit strength clearly outperformed self-reported habit strength (in both studies) and was also superior to models based on past behavior frequency (in the larger second study). Our findings provide initial support for our theory-based approach of modeling user habits and encourages the use of habit computation to deliver personalized and adaptive interventions.

Behavior change support systems (BCSS) are digital systems that support users to change their behaviors in desirable ways by using various intervention techniques [1, 2] , including education, persuasion, reminders, contingent rewards, and self-monitoring [3] . In many application domains where behaviors are repeated frequently, such as promoting healthy lifestyles, one of the challenges for successful change is the task of breaking bad old habits and forming healthy new habits [4, 5, 6] . Habitual behaviors are characterized as automatic responses triggered by cues in the environment (e.g., eating crisps when watching TV) or by goals activated in one's working memory (e.g., using a bike when commuting to work) [7, 8] . The lack of deliberations about behavioral consequences makes habitual behaviors to persist even when a contradicting goal or intention is formed [9] . For example, in a press conference on curbing the spread of COVID-19, right after Mark Rutte, the Prime Minister of the Netherlands, told people not to shake hands, he welcomed a medical expert on stage and shook his hand. On the bright side, when a good habit is formed, it helps behavioral maintenance and prevents relapses. Modeling users' habits can potentially increase the effectiveness of BCSS.

Although the term "habit" is intuitively understood by most people, it is important to clarify what we mean by "habit" in our paper. In the field of ubiquitous computing, modeling habits usually refers to the modeling of users' actual behaviors, i.e., detecting and recognizing recurrent behavioral patterns and routines [10, 11, 12] , sometimes contingent on specific user contexts [13] . In contrast, based on psychological theories [7, 8, 14, 15, 16] , we define habits as the cognitive associations between user behaviors and the triggering user contexts, thus separating habits from habitual behaviors themselves. Habits are strengthened through context-dependent behavior repetitions and in turn increase the probability that the behavior is performed in the same context. Our modeling approach harnesses this reciprocal causal relationship for behavior prediction and personalized intervention.

Modeling the habit strength of a particular user behavior can benefit BCSS in at least two ways. First, assuming a causal effect of habit on behavior, knowing the habit strength can assist a system to predict a user's behavior more accurately. Accurate behavior prediction is the basis for personalizing interventions, for example, sending a reminder when the system predicts that the user is unlikely to perform the desirable behavior on their own. Second, it is widely acknowledged that reminders in many so-called "habit-formation" apps induce behavior repetition but hinder the formation of real habits that are supposed to be connected to environmental cues [17, 18, 19] . Thus, representing habit strength as a cognitive state enables a system to distinguish genuine context-driven habitual behaviors from repeated behaviors that are simply prompted by digital systems. It also allows a system to decide when to withdraw proactive interventions on a specific behavior, knowing from the model that the user's behavior will likely be maintained by the strong habit alone.

In psychology, habit strength is often measured by the Self-report Habit Index (SRHI) [20] or its behavioral automaticity sub-scale [21] . Although these methods can be implemented in a BCSS as daily questionnaires, it burdens users a lot and may interfere with primary intervention techniques. Recently developed computational models provide a new approach of quantifying habit strength based on observable behavior and contexts [22, 23, 24, 25] , but these models have not been extensively validated in real-world behavior change interventions. In this paper, we evaluated one habit-learning model using two field intervention studies on dental hygiene behavior, testing whether computing habit strength contributes to more accurate behavior prediction, when compared with theory-free predictive models. Enhanced prediction performance also empirically validates the model for representing users' habit strengths.

In the remainder of the paper, we start with the theoretical background of our work, followed by the overall modeling and evaluation approach. Next, the data-collection method and results of the two field studies are presented. The paper concludes with a general discussion, including implications for designing more personalized BCSS.

Habits can be best understood in the context of human goal-directed behavior. To say a behavior is habitual implies that the behavior is performed merely due to its repetitions and instrumental values in the past (e.g., shaking hands as a social norm), but may be at odds with one's current goals (e.g., to prevent the spread of virus). The most powerful demonstrations of habitual behavior come from animal and human instrumental learning experiments [9] , in which extensively repeated behavioral responses become insensitive to sudden changes in the immediate decision-making environments. Although a functional dissociation between habitual and goal-directed behavior is well-established [26] , there remains a strong dispute between a value-based and a value-free view regarding the specific cognitive mechanisms underlying habit learning and its control over overt behavior.

The value-based account of habit conceptualizes habit learning as a form of value-free reinforcement learning [27, 28] . Through repeated choices, an organism learns the goal-satisfying values of different behavioral options and uses the "cached" action-values to make new decisions, without resorting to a model of its environment. This contrasts with goal-directed learning or model-based reinforcement learning, which informs decisions based on an up-to-date model of the environment. Theories in the value-based account also hypothesize that after extensive training, a cognitive system based on habit learning controls behavior thanks to its simplicity and efficiency.

However, the value-based account conflicts with the more traditional value-free view of habit in psychology, which has its root in the classic distinction between "law of exercise" and "law of effect" by Edward Thorndike [29] . FOr the value-free account, a habit only relates to the repetition of a behavior (exercise), but not the outcomes of executing the behavior (effect). In modern terms, as a by-product of goal-directed learning, a habit is a learned cognitive association between a behavior and its triggering context or goal, strengthened by repeated behavior executions [8, 15, 16] . When the same context is encountered or the same goal is activated, this association immediately brings a representation of the behavior into one's working memory [25] or enhances the baseline preference signal of the behavior in decision-making [30, 31] . The present research follows the value-free account of habit learning and makes use of its computational models.

Following the value-free view, four computational models have been proposed to account for the relationship between behavior repetition and habit strength [22, 23, 24, 25] . As these models were developed by researchers from very different fields, they haven't been reviewed in the same context. While a detailed comparison between the models is beyond the scope of the paper, it suffices to say that they all followed the value-free account and were inspired by the Hebbian learning principle in neuroscience [32] . In a network of cognitive nodes representing behaviors and contextual cues, the links between pairs of behaviors and cues are strengthened when the two nodes are activated at the same time (i.e., a behavior is performed in that particular context). Figure 1 shows the mathematical equations of these models and a simulation of how habit strength changes over time in a prototypical scenario with plausible parameter values of each model. Despite their differences, all models produce a similar pattern for the dynamics of habits: habit strength increases over time when the behavior is performed consistently but the rate of growth decreases so that habit strength approaches a plateau. When the behavior is not performed, habit strength decays proportionally 2 . These basic patterns are consistent with the empirical data of habit formation in a field study where participants reported their habit strengths using the SRHI [33] .

The automaticity of habits in daily environments has been partly attributed to their facilitation role in memory processes [24, 25] . Unlike laboratory studies on habits (e.g., [9] ) where behavioral options are presented to the participants, in daily lives people have to recall certain behavioral options before they can choose among them [34] . When a habit is strong, the learned association between a behavior and a context ensures that when the same context is encountered, the behavior is activated in one's working memory as a choice option [8] . According to [24] , habitual behaviors are hard to control because habitual options are recalled and evaluated first. When habitual options are sufficiently satisfying, people may act on them immediately without trying to recall more options.

When a habit is still weak, the newly learned behavior can often be "forgotten" in the moments of generating behavioral options. Thus, to ensure early behavioral repetitions, memory aids such as digital reminders are often needed. In addition to modeling habit formation, other researchers also proposed a computational model of how memory accessibility of behavioral options changes over time [25] . Like any other memory processes, the accessibility of a behavioral option decays gradually over time but can be restored upon receiving reminders or when the behavior is executed. Other unobservable factors, such as the mental rehearsal of an option [35] , also influence accessibility but their effects are integrated into a decay parameter in [25] . We adopted Tobias's equation, which is defined formally in the next section. 

Based on the theories and computational models reviewed, we focus on two cognitive quantities that can be computed by a digital system. Of our primary interest, the habit strength of a target behavior for a user in a behavior change process is computed. In principle, any of the 4 computational models reviewed above [22, 23, 24, 25] can be used for this computation, but we chose to use the equation in [22] since it is the only model that can match the empirical observation that habit growth is faster than habit decay [33] . The equation with a habit decay parameter (HDP) and a habit gain parameter (HGP) is as the following:

The equation implies that given an initial habit strength of a user (HS 0 ), the subsequent habit strength at any time point (HS t ) can be computed as long as the past occurrences of behavior (Beh) and cues (Cue) are known. In an empirical study or a behavior change application, users can be asked to self-report their habit strengths at the beginning and the self-reported values (scaled to [0, 1]) can be used as initial values. Both actual behavior and environmental cues can be potentially monitored by sensors in a BCSS. In the current research, we make a simplifying assumption that users always perform the target behavior in the same context (i.e., participants in our studies always brushed teeth in their own bathrooms and at similar time), so the variable Cue t is always 1.

In addition to habit strength, the memory accessibility of a behavioral option can be computed using the equation in [25] . Accessibility (Acc) decays naturally as a natural memory process, but can be enhanced by behavior executions (Beh) and external reminders (Rem). The equation controlled by three free parameters -accessibility decay parameter (ADP), accessibility gain parameter with behavior execution (AGP beh ), and accessibility gain parameter with reminder (AGP rem ), is as the following:

When a user is persuaded by a BCSS to learn a new behavior, the initial value of memory accessibility (Acc 0 ) of the target behavior can be assumed to be 1 (maximum). Subsequent memory accessibility can be easily updated by monitoring actual behavior and reminders sent by the digital system itself. For simplification, any procedure used in our empirical studies (e.g., face-to-face meeting, email communication, etc.) that reminded participants of the target behavior was assumed to restore memory accessibility by the same amount controlled by a single parameter AGP rem .

The primary goal of the current research is to evaluate the utility of computing habit strength and memory accessibility in a use case of behavior prediction. In a behavior change intervention, predicting future behavior based on information already collected is often an important and challenging task. For example, when a user is prompted by a BCSS to brush teeth every morning, it is a meaningful task to predict whether the user will brush their teeth the next morning (also known as a 1-step forecast) based on all the system knows about the user at that point. A conventional approach for behavior prediction in psychology relies on self-reported behavioral determinants measured by periodical surveys (survey model, see Figure 2a ), such as attitude, intention, and self-report habit strength [36] . Another method is simply to use past behavior to predict future behavior, for example, by counting how many times the user brushed teeth in the morning prior to the to-be-predicted date (past-behavior model, see Figure 2b ). Instead of these two approaches, the system can also compute habit strength and memory accessibility based on historical data (past behavior, reminder, etc.) and use the computed theoretical quantities to predict future behavior (theory-based model, see Figure 2c ). Computing the theoretical quantities is useful if the theory-based model predicts future behavior more accurately than the past-behavior model and at least as accurate as the survey model, given that it bypasses the need to burden users with questions. Note that we focus on comparing the relative performance of the models rather than optimizing absolute performance. To fulfill the research goal, we conducted two intervention studies on dental health behavior where participants were trained to brush their teeth twice a day for about three weeks. Participants' brushing behavior was monitored by sensors and their attitude towards toothbrushing and self-report habit strength were measured once a week. We chose to study toothbrushing behavior because of its relative simplicity, context stability (e.g., usually in the bathroom at home), and high occurrence frequency.

Study 1 Forty healthy university students or young workers were recruited through a local participant database and personal network. The main inclusion criterion was that they used to only brush their teeth once a day (or at least rarely brushing twice), and the criterion was checked by personal communication with the participants. The sampling consisted of 26 males and 14 females, and the average age was 24.48 (SD = 3.13, median = 24). Eight participants were randomly selected and awarded 25 euros. The study was reviewed and approved by an ethical review board at Eindhoven University of Technology.

Study 2 Study 2 was conducted in collaboration with Philips Research. Seventy-nine adults were recruited through a recruitment agency contracted by Philips. A lenient main criterion was used that the participants used to brush only once a day, or they usually brushed less than two minutes for each session. Other criteria include that they were between 18 and 60 years old, understood Dutch, and were manual toothbrush users. The eventual sample consisted of 41 females and 37 males (1 chose "other"), with ages between 20 and 63 years old (mean = 39.63, median = 38, SD = 10.97). Most participants were healthy, except that one suffered from cystic fibrosis and one from narcolepsy. Participants were paid 80 euros by the recruitment agency. The study was reviewed and approved by the Internal Committee on Biomedical Experiments (ICBE) at Philips Research.

Study 1 Participants were enrolled in a 4-week intervention program during which they were persuaded to change their oral health routine from brushing teeth once a day to brushing twice a day. The main outcome variable was whether they complied with the new target brushing behavior (i.e., brushing also in the morning or in the evening) on each day during the study period. At the beginning, a face-to-face meeting was held between the experimenter and each participant. During this meeting, participants were introduced to the study and the intervention, signed a consent form, and were given a sensor to be attached to their own toothbrush. After participants returned home, their toothbrushing behaviors were monitored by the sensors for 3 weeks, and at the end of the third week they returned the sensor to the experimenter. Reminders for the target brushing behaviors were sent daily in the first week using a self-programmed mobile app, every other day in the second week, and were dismissed in the third and fourth week. at the end of each week, a short survey was sent using the same app to ask questions about attitude and habit strength.

Study 2 Participants were enrolled in a multi-phase intervention program during which they were persuaded to develop an optimal oral health routine of two brushing sessions that last for at least 2 minutes (or at least a 4-minute brushing daily). The main outcome variable was whether they brushed their teeth twice a day or not. At the beginning, participants came to the lab in groups of 10-15 for an introduction session, in which general study information and procedure were explained, but not the specific intervention. Also in the meeting, participants were offered new manual toothbrushes with sensors attached, and were asked to sign a consent form and to complete the first survey. After the baseline period of about 5-10 days, they were invited back to the lab for the intervention session individually. They were shown presentations about oral healthcare, and were exposed to the intervention target of brushing twice a day for at least 4 minutes. During the lab session, physiological data from the participants were recorded for purposes unrelated to this paper (see [37] ). The second and third survey, with mostly identical questions, was completed by the participants before and after the lab session. After the lab session, participants returned home and were monitored for a follow-up period that led to a total of approximately 3 weeks. Two additional surveys were sent by e-mail in the middle and at the end of the follow-up period.

Toothbrushing behavior Participants' toothbrushing behavior was measured by the Axivity AX3 sensors attached to the lower-end of their toothbrush grips (see Figure 3 ). The Axivity AX3 sensor is a 3-axis accelerometer developed by Newcastle University specifically for scientific research on human movements [38] . Constrained by the memory space of the device, the sampling frequency was set at 50 Hz to ensure the storage of data for three weeks. The sensitivity range for accelerations was set at ±8g. The sensor was waterproof, and a fully-charged sensor could work for 3 weeks without additional charges. Participants in both studies also self-reported on how many days of the previous week they brushed their teeth in the morning/evening (Study 1) or brushed teeth twice a day for at least 2 minutes each time (Study 2).

Habit strength Habit strength was measured using the 4-item Self-report Behavior Automaticity Index (SRBAI) with 7-point response scales [21] . It assessed behavioral automaticity by prompting participants to rate their agreements with descriptions of performing a target behavior (e.g., "Behavior X is something. . . "), including "I do automatically", "I do without having to consciously remember", "I do without thinking", and "I start doing before I realize I am doing it". The target behavior in Study 1 was "brushing teeth in the morning" or "brushing teeth in the evening", depending on which behavior was not performed by each participant before the study. In Study 2, because of the lenient inclusion criterion, the behavior was more generally phrased as "brushing teeth twice a day and in total at least 4 minutes".

Internal reliabilities of the SRBAI were very high in both Study 1 (Cronbach's α = 0.95) and Study 2 (Cronbach's α = 0.94). These items were translated into Dutch in Study 2.

Attitude Attitude was measured using 7-point semantic differential scales that were typically used in studies that followed the Theory of Planned Behavior [39] . Four items were used in Study 1 (bad -good, useless -useful, harmful -beneficial, unpleasant -pleasant), while in Study 2 three more items were added (foolish -wise, unhealthy -healthy, difficult -easy). We also made a common distinction between instrumental attitude and affective attitude [25] , because inter-item correlations and factor analysis clearly suggested that there were two separate factors. Instrumental attitude focuses on how a behavior satisfied instrumental goals, such as health benefits in the context of dental behaviors, while affective attitude taps more onto the emotional aspects of the experience relating to the behavior (e.g., comfort of brushing, effort spent on brushing). The affective attitude score was based on a single item in Study 1 (unpleasantpleasant) and the average score of two items in Study 2 (unpleasant -pleasant, difficult -easy). Internal reliabilities (Cronbach's α) for instrumental attitude were 0.94 and 0.93 for the two studies, while affective attitude also had a satisfying internal reliability of 0.71 in Study 2. The attitude items were translated into Dutch in Study 2.

Pre-processing was performed to transform the raw 3-axis accelerometer data to behavioral data at the day-level (i.e., brushing twice or not on a specific day). The same procedure was used in both studies, which included the following steps: converting 3-axis signals to signal vector magnitudes (SVM), extracting brushing episodes, and classifying episodes to determine the main outcome variables.

Converting 3-axis signal to SVM The first step was to compute SVM based on the raw three-axis accelerometer data, according to the equation below: SVM provided a summarized movement magnitude measure by combining the acceleration information from the x, y, and z axis, and down-sampling the 50 Hz raw data to magnitude measured at 1 Hz (n = 50 in the equation above). Figure 4a shows one participant's data after SVM-transformation, where each data point (dot) represents the average movement magnitude in each 1-second time window. This processing was done using a built-in SVM algorithm Open Movement v1.0.030, the default software for the Axivity AX3 sensor.

Extracting brushing episodes From Figure 4a , it was clear that brushing episodes could even be visually identified (the spikes) when the data were clean, but not when there was noise caused by other movements. Given this problem, a threshold-based algorithm was first used to scan the data sequentially to efficiently extract all potential brushing episodes (see the classified data points in Figure 4b ), and then a manual check was performed to exclude "invalid" episodes. The details of this step can be found in [31] .

Classifying episodes to create the main outcome variables The remaining episodes were then classified into 6 categories based on the starting time of the episodes: morning (5:00 -12:00), morning-afternoon (12:00 -15:00), afternoon (15:00 -19:00), afternoon-evening (19:00 -21:00), evening (21:00 -24:00), and overnight (0:00 -5:00). The final episode-level data may contain more than one episode for each time category on each date. At the data level, two variables -morning brushing and evening brushing -were created, and their values (0 or 1) were determined by searching in the relevant categories on the same date to see if any episode existed. For morning brushing, category morning was searched first, and if no episode was found, category morning-afternoon was searched. For evening brushing, categories evening and overnight were searched first, and if no episode was found, category afternoon-evening was searched. When there were known or unknown events that caused noise in the data in a certain period, the values for the two brushing variables were coded as missing data. Eventually, at the day level, dichotomous indicators (0 or 1) for the target brushing behavior and for brushing twice were used as the outcome variable in Study 1 and Study 2 respectively. 

The target for prediction was the brushing behavior on the next day, with the occurrence of brushing as the negative cases and the absence of brushing as the positive cases. They were coded in this way because for real applications a potentially more important goal would be to detect the positive cases, i.e., the days on which the brushing behavior was likely to be omitted. The theory-based computational approach would be considered valuable if it led to models that performed better than models based simply on past behavior or on weekly self-reported variables. Specifically, models with 4 different feature sets were compared:

• Survey model: The primary features in the survey model were the variables measured by weekly surveys, including instrumental attitude, affective attitude, and self-reported behavioral automaticity. In addition, the occurrence of lab sessions (including the introduction meeting in Study 1) and the occurrence of reminders (including notifications and e-mails for surveys) were also included as features.

• Past-behavior model: The primary feature in this model was the past behavior rate until the day of the last observation. For example, if the brushing behavior on the 11 th day was to be predicted, the brushing rate in the last 10 days (e.g., 0.8) would be the value for this variable. For the first day, past behavior rate was set to 0 in Study 1, as participants self-reported to rarely brush in the morning or in the evening. In Study 2, the self-reported behavior rates in the previous week were used for the initial values. Again, the occurrence of lab sessions and the occurrence of reminders were also included as features. For each model type, three common statistical learning algorithms were used, namely logistic regression, support vector machine, and random forest. In total, this resulted in 12 models (4 model types × 3 algorithms) to be trained and tested.

Two different approaches were used to compare model performance. First, a two-level hierarchical k-fold crossvalidation procedure was used on each of the two data sets separately (see Figure 5 ). For each data set, all observations were divided into k non-overlapping groups (with the restriction that one participant's data were always in only one group), so that 1 group was reserved for model testing, and the remaining k-1 groups were used for training in each round (the outer loop). Because tuning was needed for both the free parameters in the equations of HS and Acc and the hyperparameters for support vector machine and random forest, the training set in each round was further divided, with 1 group reserved as the test set for parameter tuning and the remaining k-2 groups as the training set for parameter tuning (the inner loop). For each free parameter in the theory-based equations, a 1000-step random search was used, and in each step a random value was drawn from a uniform distribution between 0 and 1. For the hyperparameters, grid-search was used to swipe the parameter space as defined in Table 1 . These parameter values were optimized to obtain the best overall prediction performance in the inner cross-validation loop, indicated by area under curve (AUC) in receiver operating characteristic (ROC) curves. Due to the sample size difference between the two studies, 9 folds were used for Study 1 (4 participants in each group) and 5 folds were used for Study 2 (15 participants in each group), in order to have sufficient data for training. Figure 5 : An illustration of the nested cross-validation procedure used (it shows the 5-fold scenario for Study 2, but the same idea applies to Study 1). Since we had two similar data sets, in a second approach, we evaluated the ability of each model type to predict new data. This approach was used to evaluate the generalizability of the models, in particular the generalizability of the parameters used to the compute theory-based features (e.g., HGP, ADP). Specifically, one of the two data sets was used to train the models, and the resultant models were used to predict the observations in the other data set. When parameter tuning was required, a k-fold cross-validation was used on the whole training data set, with the same search methods indicated above. Again, 9-fold or 5-fold cross-validation was used when Study 1 or Study 2 was used as the training data set respectively.

For model comparison, we primarily focused on AUC. Compared with other performance metrics, AUC takes both positive and negative cases into account and is generally considered the best for both balanced and unbalanced data sets [40] . AUC was also chosen because we were more interested in predicted probabilities of brushing rather than the classifications under a particular threshold. For comprehensiveness, we also report other performance measures computed using the optimal threshold for each model, namely Matthew Correlation Coefficient (MCC), overall accuracy, F-score, true positive rate, false positive rate, precision, and negative prediction value. All analyses were performed in R statistical programming environment (version 3.3.3), with the help of the mlr (machine-learning R, version 2.1.3) package [41] .

Study 1 Study 1 included 711 non-missing observations for the prediction task, with 376 positive cases (non-brushing) and 335 negative cases (brushing). Thus, the prediction accuracy would be 53% if a no-skill model always predicts positive cases. Figure 6 shows the testing ROC curves of different models, and Table 2 compares additional testing performance measures of the models (aggregated over cross-validation iterations) 3 . All models were able to perform substantially better than the no-skill model, with average accuracy ranging between 64% and 71%. Various performance measures indicated that the theory-based models were better than the survey models, but were slightly worse than the past-behavior models. It was also clear that combining the features of the theory-based and past-behavior models did not improve performance any further. In terms of learning algorithms, their results were largely the same, although random forest showed more decline of performance from training to testing set, suggesting some overfitting during training. Study 2 Study 2 included 1508 non-missing observations for the prediction task, with 557 positive cases (nonbrushing) and 951 negative cases (brushing). Thus, the data were less balanced and the prediction accuracy would 63% if a no-skill model always predicts negative cases. Figure 7 shows the testing ROC curves of different models, and Table 3 compares additional testing performance measures of the models in Study 2 4 . Since the data were more unbalanced (more negative cases due to a higher brushing rate) compared with Study 1, all models were able to predict more accurately, with average accuracy between 64% and 78%. In contrast with Study 1, the theory-based models performed much better than the survey models, and also slightly better than the past-behavior models. The models with combined features was arguably the best, although the improvements over the theory-based models were very small. Again, differences between the three algorithms used were very small, but again logistic regression was the best overall.

Because all three algorithms gave similar results and logistic regression was the best overall, we only report the results of logistic regression models. Results of the models' abilities for predicting unseen data from a different study are summarized in Figure 8 and Table 4 5 . For all models except the theory-based model, prediction performance when predicting data from a different study was close to the corresponding performance when predicting unseen subsets from the same study. However, the performance of theory-based model dropped quite a bit when predicting data from a different study (e.g., AUC decreased from 0.737 to 0.718 for Study 1, and from 0.815 to 0.753 for Study 2), suggesting that parameters in the theory-based equations (e.g., HDP and HGP) tuned using one data set were not necessarily the optimal for a different study.

Lastly, for theoretical interests, we examined the optimal parameter values for the free parameters in the theory-based equations of habit strength and accessibility. For parameters governing the dynamics of habit strength, optimal ranges of parameter values could be found, and the results were similar regardless of the data set used (see Figure 9 ). To achieve best performance based on AUC, the optimal value for the habit decay parameter (HDP) was in the range of 0.15 and 0.2, while the optimal value for the habit gain parameter (HGP) was in the range of 0.1 and 0.2.

In contrast, for parameters that determine the dynamics of accessibility, there was no clear relationships between their values and model prediction performance (see Figure 10 ). If one examined the individual features in the theory-based models, the feature habit strength contributed to most of their predictive powers, while the feature accessibility did not contribute as much. predicting Study 2's data using models trained on Study 1's data; Right panel: predicting Study 1's data using models trained on Study 2's data). 

Recently developed theory-based computational models allow BCSS to model users' habit learning in behavior change processes. In this paper, we reviewed the computational models of habit learning and evaluated the utility of one of the models in a use case of behavior prediction, based on data collected in two field intervention studies on toothbrushing behavior. Through a nested cross-validation procedure, theory-based models with computed habit strength and memory accessibility were compared with two baseline models in terms of how well they could predict brushing behavior on the next day. In both studies, the theory-based models performed better than the survey models that used self-reported behavioral determinants as features. In the second larger study, the theory-based models also performed slightly better than the models that are based simply on past behavior rate. However, as suggested by the cross-study prediction results, the small advantage of using the theory-based approach comes with a cost that for predicting behavior in a different context and/or with different users, the free parameters in the theory-based equations need to be tuned again. One cannot assume that the same rates of habit formation and decay generalize to all application situations.

We initially thought that the computed cognitive variables would increase the predictive power of model based on other commonly used features, such as behavioral determinants (e.g., attitude) and contextual factors (e.g., emotional states, environmental cues). Therefore, we were surprised that combining the computed habit strength with either instrumental or affective attitude self-reported in the surveys did not perform better than the theory-based models alone (not reported in the result sections). Of course, the measurements of attitude were on the weekly level, so it is yet to know whether knowing more immediate contextual factors would further increase prediction accuracy of brushing behavior (e.g., sleepiness of the person in the evening, behavior of the partner, etc.). Without knowing this information, the current prediction accuracy of around 65 -77% might be the limit.

Although the equation of habit strength was motivated by theories (e.g., [22, 23] ), the computed variable also represents a specific summary of past behavior. The similarity between the theory-based models and the past-behavior models was also reflected in the fact that they seemed to provide similar information, since adding these features together did not improve performance much further. Compared with past-behavior models that weight each behavior in the past equally, the equation of habit strength weights behavior at different time point in the past in a more sophisticated way. Given the habit decay parameter, the contributions of behaviors that are far in the past to current habit strength are discounted in an exponential way, given by the decay parameter to the power of n (HDP n ), where n denotes the number of time steps to the past. Behaviors in the later stage of habit formation also tend to have increasingly smaller immediate contributions to the current habit strength because the habit gain parameter is modulated by the term 1-HS t . For the purpose of behavior prediction alone, it would be interesting to examine more closely the mathematical properties of the equation and to explore whether other ways of weighing past behaviors could result in better prediction performance.

Besides the interest in behavior prediction, the parameter estimation procedure used in our studies also has implications for the theoretical understanding of habit formation. The optimal values tuned for the habit gain parameter are very close to the corresponding values of 0.19 obtained through a statistical modeling of the temporal dynamics of self-reported habit strength or behavioral automaticity [33] . However, inconsistent with previous studies that suggested much smaller habit decay parameter [25, 33] , its value was in the same range with the habit gain parameter. In general, these results speak to the theoretical meaningfulness of the computational model of habit strength used for prediction. In contrast, the parameters in the equation of accessibility did not seem to have optimal values, which casts doubts onto the validity of modeling memory accessibility in its current form.

While our current findings are limited in the context of two intervention trials, our theory-based approach can be easily implemented in real BCSS. As long as behaviors and other contextual variables are observed by sensors and parameter values are estimated from existing data, a digital system can update its representations of the user's habit strength after every relevant behavioral context, without asking the user to report it repeatedly. Real-time behavior prediction using computed habit strength provides the basis for delivering personalized and adaptive interventions. Instead of classifying brushing and non-brushing, the system can simply estimate the probability of brushing (non-brushing) and then use different thresholds for delivering different types of interventions. For example, if brushing probabilities stay very low for several days (e.g., 10%), the system may decide to repeat an extensive education session about the importance of optimal oral health routine. Instead, if a user is predicted to brush the next morning with a probability of 0.6, a gentle reminder may be sent. Such adaptive interventions are important because even though the costs of delivering digital interventions are low, too frequent or inappropriate actions may disrupt or even irritate users [42] . Besides behavior prediction, a system may use the computed habit strength more directly. For example, tracking a user' habit strength of a newly trained behavior may give the system a better idea about the progress of behavior change. Even when the target behavior is already performed consistently, a habit strength weaker than a certain threshold (e.g., 0.8) would suggest that the current intervention should be continued to reduce the risk of relapse. Future research should explore these different ways of implementing our theory-based approach in real-world applications and extend our work to other behavioral domains beyond oral health.

A foundation for the study of behavior change support systems. Personal and ubiquitous computing

Smartphones for large-scale behavior change interventions

A taxonomy of behavior change techniques used in interventions

Habit formation and behavior change

Opportunities and challenges of behavior change support systems for enhancing habit formation: a qualitative study

Digital behaviour change interventions to break and form habits

The goaldependent automaticity of drinking habits

A new look at habits and the habit-goal interface

Actions and habits: the development of behavioural autonomy

Monitoring eating habits using a piezoelectric sensor-based necklace

Towards online and personalized daily activity recognition, habit modeling, and anomaly detection for the solitary elderly through unobtrusive sensing

Towards detection of bad habits by fusing smartphone and smartwatch sensors

Modeling and understanding human routine behavior

Psychology of habit

Studying human habits in societal context: Examining support for a basic stimulus-response mechanism

Psychology of habit. Annual review of psychology

Don't kick the habit: The role of dependency in habit formation apps

Don't forget your pill! designing effective medication reminder apps that support users' daily routines

Beyond self-tracking and reminders: designing smartphone apps that support habit formation

Reflections on past behavior: a self-report index of habit strength 1

Towards parsimony in habit measurement: Testing the convergent and predictive validity of an automaticity subscale of the self-report habit index

A computational model of habit learning to enable ambient support for lifestyle change

Habits without values. Psychological review

A bounded rationality model of short and long-term dynamics of activity-travel behavior

Changing behavior by memory aids: A social psychological model of prospective memory and habit development tested with dynamic field data

The role of the basal ganglia in habit formation

Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control

Speed/accuracy trade-off between the habitual and the goal-directed processes

The fundamentals of learning

Multialternative decision field theory: A dynamic connectionst model of decision making

Towards a psychological computing approach to digital lifestyle interventions

The organization of behavior: a neuropsychological theory

How are habits formed: Modelling habit formation in the real world

Why option generation matters for the design of autonomous e-coaching systems

Retrieval processes in prospective memory: Theoretical approaches and some new empirical findings. Prospective memory: Theory and applications

A review and analysis of the use of 'habit'in understanding, predicting and influencing health-related behaviour

Persuasioninduced physiology partly predicts persuasion effectiveness

Large scale population assessment of physical activity using wrist worn accelerometers: the uk biobank study

Habit, information acquisition, and the process of making travel mode choices

Empirical comparison of area under roc curve (auc) and mathew correlation coefficient (mcc) for evaluating machine learning algorithms on imbalanced datasets for binary classification

mlr: Machine learning in r

My phone and me: understanding people's receptivity to mobile notifications