key: cord-0114734-3xdqfgar
authors: Liu, Hanyang; Lou, Sunny S.; Warner, Benjamin C.; Harford, Derek R.; Kannampallil, Thomas; Lu, Chenyang
title: HiPAL: A Deep Framework for Physician Burnout Prediction Using Activity Logs in Electronic Health Records
date: 2022-05-24
journal: nan
DOI: nan
sha: ca210cdf52f7d41b33867cd70baec7b3c8ccb80d
doc_id: 114734
cord_uid: 3xdqfgar

Burnout is a significant public health concern affecting nearly half of the healthcare workforce. This paper presents the first end-to-end deep learning framework for predicting physician burnout based on clinician activity logs, digital traces of their work activities, available in any electronic health record (EHR) system. In contrast to prior approaches that exclusively relied on surveys for burnout measurement, our framework directly learns deep workload representations from large-scale clinician activity logs to predict burnout. We propose the Hierarchical burnout Prediction based on Activity Logs (HiPAL), featuring a pre-trained time-dependent activity embedding mechanism tailored for activity logs and a hierarchical predictive model, which mirrors the natural hierarchical structure of clinician activity logs and captures physician's evolving workload patterns at both short-term and long-term levels. To utilize the large amount of unlabeled activity logs, we propose a semi-supervised framework that learns to transfer knowledge extracted from unlabeled clinician activities to the HiPAL-based prediction model. The experiment on over 15 million clinician activity logs collected from the EHR at a large academic medical center demonstrates the advantages of our proposed framework in predictive performance of physician burnout and training efficiency over state of the art approaches.

Burnout is a state of mental exhaustion caused by one's professional life [16] . It contributes to poor physical and emotional health, and may lead to alcohol abuse and suicidal ideation [13] . Physician burnout is widespread in healthcare settings and affects nearly 50% of physicians and health workers. It is associated with negative consequences for physician health, their retention, and the patients under their care [24] . The recent COVID-19 pandemic has further highlighted the negative impact of physician burnout [27] . In essence, burnout is a considerable public health concern, and effective tools for monitoring and predicting clinician burnout are desperately needed [24] .

One of the key contributors to burnout is clinical workload. With advances in clinical informatics, there are new approaches to track a clinician's activities on an electronic health record (EHR). The availability of clinician activity logs, the digital footprint of physicians' EHR-based activities, has enabled studies for tracking EHR-based workload measures [5, 31] , offering new opportunities for assessing its associations with burnout. More recently, such activity logs have been used to predict burnout using off-the-shelf machine learning models [12, 22] . However, as these models are unable to directly process unstructured data, they rely exclusively on feature engineering, using hand-crafted summary statistics of activity logs as features. Developing such models, hence, requires considerable domain knowledge in medicine and cognitive psychology to obtain clinically meaningful measures, and these statistical features are often less effective in capturing complicated dynamics and temporality of activities.

An ideal burnout prediction framework should be end-to-end and able to efficiently learn deep representations of workload dynamics directly from raw activity logs. This enables the potential for real-time phenotyping of burnout that is of high clinical value in facilitating early intervention and mitigation for the affected clinician. There are two major challenges in building such a framework. The first challenge is to extract useful knowledge from unstructured raw log files, which track over 1,900 types of EHR-based actions (activities associated with reports, note review, laboratory tests, creating and managing orders, and documenting patient care activities). A predictive framework must be able to encode both these activities and associated timestamps and capture the underlying dynamics and temporality that build up the high-level workload representations. In other words, an effective data encoding mechanism tailored for activity logs is a key for any deep sequence model.

The second challenge for training a deep predictive model is the large scale of activity logs (i.e., long data sequences) and however limited number of eligible surveys (i.e., limited labels). In order to measure workload and predict burnout at a per-month basis, the sequence model must be able to efficiently process sequences with large-scale and highly variant length from a few hundred to over 90,000 events. On the other hand, due to the high cost and uncertainty of survey collection, only half of the activity logs recorded are labeled with burnout outcomes. These require the sequence model to have a long-term memory (i.e., wide range of receptive field [1] ) but meanwhile relatively small model complexity (i.e., number of parameters) to prevent overfitting. However, many popular sequence models based on recurrent neural networks (RNN) or 1D Transformer [6, 42] are not suitable for raw activity logs of this large scale due to high time or memory cost.

Apart from addressing the above-mentioned two challenges, it would be useful for a predictive model to further capture and utilize the hierarchical structure naturally embedded in clinician activity logs (see Figure 1 ). Physicians' work life is intrinsically hierarchical. They interact with the EHR system in sessions -clusters of activities -with various lengths that are embedded within a shift, and then a month. Intuitively, the temporal congregation of clinical activities may contain useful information associated with burnout, i.e., the same total workload spread evenly over a week likely has a different effect on the wellness status than more intense shiftwork over two days. However, the single-level sequence models are unable to unobtrusively recover the hierarchical structure or the multi-scale temporality of data. And none of the recently proposed hierarchical sequence models such as [33, 38, 41] are designed for burnout prediction or similar problems, nor efficient enough in processing sequences at this large scale.

To addresses these challenges, we propose the Hierarchical burnout Prediction based on Activity Logs (HiPAL), a deep learning framework from representation learning to hierarchical sequence modeling as a comprehensive solution to the burnout prediction problem. To the best of our knowledge, this is the first end-toend approach that directly uses raw EHR activity logs for burnout prediction. HiPAL features the following key components:

• A pre-trained time-dependent activity embedding mechanism tailored for EHR activity logs is designed to encode both the dynamics and temporality of activities. The encoded joint representations build up and shape the bottom-up workload measures at multiple levels as the basis for burnout prediction. Figure 1 : Illustration of data including EHR activity logs with shift-month hierarchical structure and monthly surveys.

(deep representations), whereas the high-level RNN-based encoder further cumulatively arranges all workload measures into deep monthly workload measures and captures the long-term temporality of daily workload development. The hierarchical architecture enables HiPAL to maintain long memory with multiscale temporality without increasing the model complexity. The sharing of low-level sequence encoder across shifts helps maintain a moderate number of parameters and enables high computation efficiency over large-scale activity logs. • To utilize the large amount of unlabeled activity logs, we extend HiPAL to a semi-supervised framework (Semi-HiPAL). For this, we pre-train the low-level encoder and transfer knowledge to the predictive model with an unsupervised sequence autoencoder that learns to reconstruct the action-time sequences. • The experiment on a real-world dataset of over 15 million activity logs from 88 physicians over a 6-month period shows improved performance and high computational efficiency of HiPAL and Semi-HiPAL in predicting burnout.

The data used in this work were collected from 88 intern and resident physicians in Internal Medicine, Pediatrics and Anesthesiology at the Washington University School of Medicine, BJC HealthCare and St Louis Children's Hospital During the data collection phase from September 2020 through April 2021, all participants consented (IRB# 202004260) to provide access to their EHR-based activity logs and complete surveys every month. Please see Appendix Section A for more details of the dataset.

2.1.1 EHR Activity Logs. EHR activity logs are traces of a clinician's EHR interactions. These log files record all activities performed on an EHR system including the time, patient, activity, and user responsible for each data access event and are a comprehensive record of a clinician's work activities. Activity logs are commonly created for security and compliance purposes in EHR. In our dataset, there were 1,961 distinct types of clinical activities such as review of patient data, note writing, order placement and review of clinical inbox messages. Each activity log action (i.e., clinical activity) of the user (i.e., physician participant) was recorded by the EHR system with a timestamp specifying the start time of each action. All activity log actions are recorded passively in system background without interfering with the user's normal clinical workflow. In total, over 15 million activity logs across 6 to 8 months were collected in the dataset (on average over 20,000 logs per month per participant).

Patient-related EHR Time Figure 2 : CDFs of hand-crafted clinical workload measures of EHR activity logs on a monthly basis grouped by various burnout score ranges from low to high.

Intern and resident physicians included in this study rotated between different clinical assignments (e.g., Internal Medicine, Pediatrics, and Anesthesiology) every 4 weeks. Surveys were designed to evaluate each participant's recent wellness status and were sent to each participant at 4-week intervals, timed to coincide with the end of each rotation. Each participant is asked to complete 6 surveys. The monthly surveys were used to evaluate the participant's burnout status using the Stanford Professional Fulfillment Index (PFI) [34] based on workload exhaustion and depersonalization, with scores ranging from 0 to 4. We follow the previous work [22] to define burnout as the PFI score being greater than or equal to 1.33. Only about half of the activity logs (391 of 754 months) were labeled with eligible surveys.

A physician's shiftwork may occur during the day or night with the start and end time varying over rotations and individuals. The work within a shift is in general continuous with relatively short intervals between activities. Typically a physician has one shift per day. The work of a monthly rotation is naturally segmented into separate temporal clusters by shifts (see Figure 1 ). We follow [11] to automatically segment the activity logs based on the lengths of intervals of recorded time stamp.

As part of preliminary data analysis, we assessed the correlation between several hand-crafted workload measures and burnout scores. We selected several basic summary statistics (e.g., time spent on EHR) for assessment. The cumulative distribution function (CDF) of each measurement is displayed in Figure 2 . Each group of physicians with different burnout severeness range on a monthly basis are colored progressively (darker colors mean more severe burnout). In general, there was an association between physician workload and risk of burnout. For example, participants with higher burnout score tend to have spent more time interacting with EHR and had more patients. Meanwhile, some workload measures (e.g., patient related EHR time) appear progressive but have complicated associative pattern. Hence, there exists a certain level of predictive information in the EHR activity logs that can be extracted via designed workload measures and then used for burnout prediction as in our prior work [22] . Nevertheless, the design and selection of clinically meaningful workload measures require considerable domain knowledge. These delicately designed measures (usually with summary statistics as in Figure 2 ) show limited effectiveness in capturing the complicated predictive patterns [22] . An end-to-end deep learning framework is therefore needed to learn complex workload representations from raw activity logs.

Formally, in longitudinal EHR activity log data, the activities of the -th month for the -th clinician can be represented as P , = {V ( ) } , =1 , where , is the number of shifts in the -th month for clinician . The whole activity log dataset can be written as

is the total number of survey months of data collected from clinician and is the total number of clinicians. Here the indices of clinicians and survey months are omitted for simplicity. The work at the -th shift for a clinician can be represented as a sequence of actions and their time stamps

( ) )], where e ( ) ∈ {0, 1} | A | is the one-hot vector denoting the -th action at time , is the number of actions in the -th shift, A is the set of clinician actions, and |A| is the number of unique actions among all clinicians. The goal is to use the activity logs, P , , to predict the binary label , ∈ {0, 1} that denotes the wellness status of clinician in the -th month, by learning a predictive model : (P , ) → , .

Burnout Prediction. Early studies mainly focused on risk factors associated with burnout using self-reported surveys and have highlighted the link between perceived workload and burnout [2, 29, 30] .

Machine learning models such as k-means and linear regression were used to identify the level of burnout [3, 4, 18] . Purely relying on self-reported measures, they are subject to inaccuracy of workload measurement and unable to provide unobtrusive burnout prediction [14] . Recently, the quantification of clinician workload based on EHR use has enabled studies to track EHR-based workload measures [5, 31] , and apply off-the-shelf machine learning to predict burnout based on delicately design summary statistics of clinical activities as features [12, 22] .

Sequence Models. The directional nature made Recurrent Neural Network (RNN) and its popular variants LSTM and GRU [8] the default choices for modeling sequential data such as natural language, time series, and event sequences. But RNN-based models are slow and difficult to train on large-scale sequential data due to its stepby-step recurrent operation and potential issues with long-term gradient backpropagation [39] . Recently, 1D Transformer variants [6, 42] applied multi-head self-attention [35] to time series and natural language and seek to model long dependencies, but they still have significant time and space complexity, higher than RNNs. In contrast, convolutional neural networks (CNN) based models are more efficient in modeling long sequences due to the ability of parallelism. Fully Convolutional Networks (FCN) and Residual Convolutional Networks (ResNet) [36] and have demonstrated superior performance in time series classification on over 40 benchmarks. Meanwhile, 1D CNNs with dilated causal convolutions have shown promise in efficiently modeling long sequential data such as audio [25] . This idea has been further extended and developed as a class of long-memory sequence models, Temporal Convolutional Networks (TCN), to model large-scale sequential data such as videos and discrete events [1, 9, 17] . In ResTCN [1] , multiple layers of dilated causal convolutions are stacked together to form a block combined with residual connections between blocks in order to build deeper networks efficiently.

Hierarchical Sequence Models. Hierarchical models have shown promise in capturing the hierarchical structure of data [19, 38, 41] or obtaining multi-scale temporal patterns [28, 39] from long sequences such as document [38] , videos [41] , online user record [28, 32, 39] , and sensor data [10, 26] . However, even with a multilevel architecture, RNN-based hierarchical models [19, 38] would still struggle with the efficiency in processing long sequences. Recently, [39] proposed to build a hierarchical model on top of TCN for efficiently modeling multi-scale user interest for recommendation system, in which the TCN is used as the decoder for sequence generation conditioned on a high-level RNN for long-term dependencies. Despite the similar motivation, [39] is unsuitable for classification due to the sequence-to-sequence architecture designed for recommendation systems. In general, none of these approaches were designed for burnout prediction or similar problems, nor tailored for EHR activity logs with unique data modality and hierarchical structure significantly distinct from all the above applications. Figure 3 shows the overview of our HiPAL framework, featuring a pre-trained time-dependent activity embedding mechanism tailored for activity logs, a hierarchical predictive model that models clinician workload in multiple temporal levels, and the semisupervised framework that transfers knowledge from unlabeled data. Our HiPAL framework is generalizable to building upon any convolution-based sequence model as the base model.

Different from other sequential data such as natural language and time series, activity logs consists of sequences of clinical actions and associated timestamps. The clinician workflow and workload information associated with risk of burnout are preserved in the dynamics and temporality within these ordered action-time pairs. We design a specific action-time joint embedding method to extract these patterns for prediction. Context-aware Pre-training. The representation of action b should be able to encode the contextual information, i.e., similar embedding b should represent actions with similar context. We propose to pre-train the embedding matrix W in an unsupervised fashion inspired by Word2Vec [23] word embedding that has been widely used in many natural language processing tasks. Here we adopt skip-gram [23] for embedding pre-training. The embedding of the -th action e is linearly projected to predict actions {e − , ..., e −1 , e +1 , ..., e + }. We set = 5 in this paper. 

where W and d are the weight and bias variables for the time interval embedding. The time intervals with long-tail distribution are log-transformed to obtain embedding with more stationary dynamics. Note that for the first action, b ( )

3 Encoding Time Periodicity. Additionally, inspired by [15] , in order to capture the periodicity patterns in the time sequence within a work shift, we transform the scalar notion of time

where c ( ) [ ] is the -th entry of the vector c ( ) , and and are trainable frequency and bias variables. The first entry in c captures the aperiodic pattern while the rest of the entries captures the periodic pattern by a sinusoidal function. 4.1.4 Action-time Joint Embedding. We obtain the time-dependent activity embedding by concatenating the vectors of the action embedding, time interval embedding, and time periodicity embedding

Alternatively, the aggregation method Concat(·) can be replaced by addition. We find concatenation works better for this dataset. The joint embedding g ( ) can then be used as the input to a sequence model for burnout prediction. With the physician's activity logs encoded into the representation in Eq. (3), in theory we can apply any single-level sequence model on top of the activity embedding layers for burnout prediction.

To capture the temporal clustering patterns of clinician activities and the natural shift-month hierarchical structure while improving computational efficiency, we propose HiPAL, a two-level model that naturally mirrors the hierarchical structure of sequential data. 

where denotes the parameters of low-level encoder, and h ( ) denotes the representation of daily workload. 

where denotes the parameters of high-level model LSTM, and v ( ) denotes the cumulative workload representation. We use the last cumulative representation v ( ) as the workload measurement for the whole survey month, where denotes the number of work shifts in the month. Then we map the representation monthly workload v ( ) to estimate the burnout label using multi-layer perceptron (MLP) as the classifier

where is the monthly risk score. The network parameters including the embedding parameters W , W , d , and , the low-level model parameters , the high-level model parameters , and the classifier parameters , can all be jointly trained by minimizing the cross entropy loss between true and estimated labels. The cross entropy (CE) loss can be written as

Please refer to the Supplementary Sections for more details about the design choice of sequence encoder Φ.

A potential challenge with the proposed hierarchical architecture is that the network must learn to pass input information across many low-level sequence steps and two-level sequence models in order to affect the output. Inspired by RNN target replication [20] , we propose Temporal Consistency (TC) regularization to apply an additional loss L for the low-level encoder of HiPAL. In TC, we use a linear scoring layer to measure the daily risk w.r.t. each shift:

where W and d are trainable parameters, is the number of shifts, ( ) ∈ [0, 1] denotes the daily risk of the -th shift, and L denotes the TC loss that penalizes the overall mismatch between daily risks [ (1) , ..., ( ) ] and the monthly burnout outcome . Then the network can be trained by minimizing the composite loss L = L + L

where is a trade-off parameter balancing the effect of accumulative monthly risk and overall daily risks. We select the value of from {10 −4 , 10 −3 , ..., 10 4 }, and find that = 10 −1 works the best for our the activity log dataset.

The TC regularization has three major effects. First, it helps pass the supervision signal directly to the low-level encoder for learning better class-specific representations. Distinct from target replication [21] that replicates labels to the LSTM output at all steps, TC bypasses the LSTM-based high-level encoder and directly regularizes the low-level representation learning with higher flexibility. Second, it regularizes the network to maintain relative temporal consistency across shifts by minimizing the overall mismatch between daily risks and the burnout outcome. The intuition is that for most cases the workload of a physician within a monthly rotation may remain similar across different shift days. Third, it enables better model interpretability of HiPAL. During inference time, while the monthly risk is used to estimate risk of burnout, the daily risks [ (1) , (2) , ..., ( ) ] in Eq. (8) can be used to reflect the dynamics of daily workload in that month.

Tail Drop. During model training, the activity logs of the whole month until the time of survey submission are mapped to the binary burnout labels. A potential problem for model training is that, the time between end of the clinical rotation and survey submission may vary by hours to a few days for different individuals in different months. Thus the latest activity logs may describe the work of a new monthly rotation with potential significantly different workflow and workload, which can confuse the predictive model. An ideal predictive model should not be overly dependent on the length of data and when to run model inference. To seek more robust prediction, we propose Stochastic Tail Drop that stochastically drops out the sequence tail with a length randomly drawn from a fixed distribution at each iteration during training. Formally, we draw the length of dropped tails from a power distribution

where is the maximum number of days allowed to drop and controls the distortion of distribution. We set = 5 and = 2 for all HiPAL variants.

A challenge in burnout studies is the difficulty to collect surveys from physicians. As a result, only about half of activity logs are associated with burnout labels in our study. In contrast, activity logs are regularly collected for all physicians. To exploit the unlabeled activity logs, we design a semi-supervised framework that learns from all recorded activity logs and allows the generalizable knowledge transfer from the unlabeled data to the supervised predictive model. We adopt an unsupervised sequence autoencoder (Seq-AE) that learns to reconstruct the action-time sequence of each work shift, where the encoder shares the same model parameters with the low-level encoder Φ in Eq. (4). We adopted an appropriate sequence decoder Ψ that mirrors the encoder Φ accordingly (e.g., use TCN as decoder for a TCN encoder), 

where W and d are weight and bias parameters. The action projection matrix W is initialized with the transpose of the pre-trained embedding matrix W for quicker convergence. Then the encoder parameters and decoder parameters , W and d can be pretrained by minimizing the cross entropy

The Seq-AE is pre-trained on all available activity logs (labeled or unlabeled) and then transferred to and reused by HiPAL as the lowlevel encoder and fine-tuned with the predictive model on labeled data (see Appendix for design details of Seq-AE).

The primary focus of our analysis was to develop generalizable predictive models that could be translated for predicting burnout outcomes for new unseen participants. Towards this end, the training and testing data are split based on participants, where no activity logs (from different months) of any participant simultaneously exist in both training and testing set. Considering the relatively small sample size (i.e., number of valid surveys), we evaluate each method with repeated 5-fold cross-validation (CV) in order to get as close estimation as possible to the true out-of-sample performance of each model on unseen individuals. For each fold, the whole dataset is split into 80% training set (10% training data used for validation) and 20% testing set. We repeat the cross-validation with different random split for 6 rounds and report the mean and standard deviation of CV results. We use accuracy, area under the receiver operating characteristic (AUROC), and area under the precision-recall curve (AUPRC) as the metric measures to evaluate the burnout prediction performance. All non-neural models were implemented using Scikit-learn 1.0.1 with Python, and all deep learning models were implemented using TensorFlow 2.6.0. All models were tested on Linux Ubuntu 20.04 empowered by Nvidia RTX 3090 GPUs.

We compare our proposed burnout prediction framework to the following baseline methods.

• GBM/SVM/RF: Gradient Boosting Machines implemented with XGBoost [7] , Support Vector Machines, and Random Forests, used in [22] for burnout prediction. We follow [22] to extract a set of summary statistics of activity logs as features for prediction. All the compared deep models were implemented with our proposed time-dependent activity embedding for activity logs. As variants of our proposed hierarchical framework, HiPAL-f, HiPAL-c, and HiPAL-r corresponds to HiPAL-based predictive model with the low-level encoder Φ instantiated by FCN, CausalNet, and ResTCN. Similar for the semi-supervised model variants. Table 1 summarizes the performance of all the models, including non-deep-learning models, single-level Hierarchical. Among the single-level sequence models, ResTCN achieves the best performance due to its effective deep architecture with residual mechanism and delicately designed convolutional blocks. The corresponding HiPAL extension improves the performance of each base model, respectively. Especially for CausalNet, performing the worst among the single-level baseline models due to its relatively primitive architecture, HiPAL-c achieves 10.9%/14.8% average improvement on AUROC/AUPRC. The steady improvement regardless of the base model shows that the hierarchical architecture tailored for the problem enables HiPAL to better capture the multi-level structure in activity logs and complex temporality and dynamics. 5.3.4 Supervised vs. Semi-supervised. Compared to the supervised HiPAL models, in general all the three Semi-HiPAL counterparts achieve better average performance (except that Semi-HiPAL-r has slightly worse AUPRC than HiPAL-r). We can see that based on the Seq-AE pre-training, our semi-supervised framework is able to effectively extract generalizable patterns from unlabeled activity logs and transfer knowledge to HiPAL. This sheds light to potential improved prediction efficacy in real-world clinical practice when the costly burnout labels are limited in number but the huge amount of unlabeled activity logs are available. Table 2 summarizes the model complexity and training time. Our proposed HiPAL framework is able to train with high efficiency, spending just a few seconds for one epoch on over 6 million activity logs (training set). In contrast, it takes hours to finish one epoch for LSTM and GRU on our dataset, which makes them unsuitable for this problem. Hence we do not report the performance in Table 1 due to extremely high time cost. Despite the hierarchical structure, RNN-based models H-RNN and HierGRU still run about 50× slower than HiPAL-c. All the three HiPAL variants maintain comparable training speed of their base models. With comparable low time cost in training, HiPAL can further reduce the memory consumption during model inference, since HiPAL only needs to store the activity logs of the latest shift instead of the whole month as for single-level models. Figure 3 , both the TC and Tail Drop regularization help improve the prediction performance. We further test the effect of Stochastic Tail Drop on model robustness to uncertainty of time and data lengths. Figure  5 shows the performance variance with various prediction time offset (by days) that corresponds to different lengths of input activity logs. Compared to the HiPAL variants, the AUROC of the best Table 3 summarizes the ablation study based on HiPAL-c. Among different embedding configuration, we can see that our current HiPAL design that incorporates the pretrained action embedding, time interval embedding and time periodicity embedding performs the best. Figure 4 shows the visualized daily risks (Eq. (8)) across all 6 months of two typical physician participants from the testing set, one being consecutively burnedout in every month and the other staying unaffected. The daily risk scores can reflect the dynamics of daily workload across shifts of a month. We can observe that the physicians with the two typical types of wellness status show visually distinct daily risk patterns (see Appendix for more examples). The daily risks of Physician A continuously remain a medium to high level across shifts in each month, which reflects an overall continually heavy workload. In contrast, Physician B seems to have various levels of workload that changed between low and high intermittently. This may contribute to the reduction of the accumulative monthly risk of burnout. We can see that the TC regularizer with daily risk measures allows HiPAL to provide interpretable burnout prediction. This mechanism can potentially be used to facilitate the burnout root-causing and intervention.

High levels of burnout can lead to medical errors, substance abuse, and suicidal ideation [13] , so preventing or mitigating burnout has meaningful public health benefits. Currently, due to the difficulty of assessing physician wellness in real time, the interventions for physician burnout mainly focus on organizational and system-level measures, such as improving workflow design and enhancing teamwork and communications [37] . An end-to-end burnout monitoring and prediction system like HiPAL can offer new potential to realtime burnout phenotyping and personalized interventions. Different individual interventions for physician burnout may have different cost, financially and administratively. For different situations, we may require different sensitivity (true positive rate) and specificity (1 − false positive rate) for the burnout predictive model, and thus the model output must be tuned accordingly. As the best variant, Semi-HiPAL-f achieves an AUROC of 0.6479, which reflects the average sensitivity over all specificity levels. Table 4 shows the model performance under two different practical situations. Scenario A represents the situation with low-cost interventions, such as EHR training and online cognitive therapy, where usually a predictive model with high sensitivity -detecting most of true burned-out cases -is preferred. With the sensitivity set as 0.8, our model presents a moderate specificity of 0.4054. It means that nearly 60% of unaffected physicians would be included in the personal interventions. This is acceptable since these interventions also benefit the physicians with normal wellness status and can help prevent future burnout. Scenario B represents the situation with high-cost interventions, such as taking days off or vacations. This is when a more specific predictive model is preferred. With the specificity set as 0.8, our model still achieves a sensitivity of 0.4782, meaning that the prediction can still benefit nearly half of the burned-out physicians in this extreme.

This work is not without limitations. In clinical practice, not all the physician work is EHR-related or tracked by the EHR system, and as such the workload reflected in the activity logs do not always align with the actual wellness outcome. This may bound the prediction performance of a learning model exclusively based on activity logs. Constrained by the current extent of our study at the moment, only a limited number of burnout labels have been collected, which could have restrained the power of deep learning from being fully exploited. For future work, the availability of much larger amount of EHR activity logs may allow us to explore more advanced semi-supervised or transfer learning approaches for better burnout prediction that facilitates physician well being.

In this paper, we presented HiPAL, the first end-to-end deep learning framework for predicting physician burnout based on clinician activity logs available in any electronic health record (EHR) system. The HiPAL framework includes a time-dependent activity embedding mechanism tailored for the EHR-based activity logs to encode raw data from scratch. We proposed a hierarchical sequence learning framework to learn deep representations of workload that contains multi-level temporality from the large-scale clinical activities with high efficiency, and provide interpretable burnout prediction. The semi-supervised extension enables HiPAL to utilize the large amount of unlabeled data and transfer generalizable knowledge to the predictive model. The experiment on over 15 million real-world clinician activity logs collected from a large academic medical center shows the advantages of our proposed framework in predictive performance of physician burnout and training efficiency over state of the art approaches. 

In this section, we introduce the three single-level convolutionbased models we adopted, CausalNet, ResTCN, and FCN as the low-level sequence encoder (Eq. (4)) in HiPAL-c, HiPAL-r, and HiPAL-f, respectively. All these sequence model must work with our proposed activity embedding for burnout prediction based on activity logs.

Full Convolutional Networks (FCN), a deep CNN architecturewith Batch Normalization, has shown compelling quality and efficiency for tasks on images such as semantic segmentation. Later work [36] applied TCN on time series classification, which has shown to have outperformed multiple strong baselines on 44 time series benchmarks. It also outperformed another widely used CNN-based model with residual connections -ResNet [36] -on most of the above datasets. Hence, in this paper, we select FCN implemented by [36] as the representative of conventional CNN-based sequence model for comparison and also as the one of the base model choices for our HiPAL framework. An FCN model consists of several basic convolutional blocks. A basic block is a convolutional layer followed by a Batch Normalization layer and a ReLU activation layer, as follows:

where * is the convolution operator. For HiPAL-f, we use 3 blocks for the FCN 

Temporal convolutional networks (TCN) is a family of efficient 1-D convolutional sequence models where convolutions are computed across time [1, 17] . Different from the RNN family of sequence models, in TCN computations are performed layer-wise where every time-step is updated concurrently instead of recurrently [17] . TCN differs from dypical 1-D CNN mainly by using a different convolution mechanism, dilated causal convolution. Formally, for a 1-D sequence input X = [x 1 , ..., x ] ∈ R × and a convolution filter f ∈ R × , the dilated causal convolution operation on element of the sequence is defined as

where is the dilation factor, is the filter size, and − · accounts the past. Dilated convolution, i.e., using a larger dilation factor , enables an output at the top level to represent a wider range of inputs, effectively expanding the receptive field [40] of convolution. Causal convolution, i.e., at each step the convolution is only operated with previous steps, ensures that no future information is leaked to the past [1] . This feature enables TCN to have similar directional structure as RNN models. Then the output sequence X ′ ∈ R × of the dilation convolution layer can be written as

Usually Layer Normalization or Batch Normalization regularization is applied after the convolutional layer for better performance [1, 17] . A TCN model is usually built with multiple causal convolutional layers with a wide receptive field that accounts for long sequence input. There are two major variants of TCN, with their architecture shown in Figure 6 .

• CausalNet [17] :An early practice as in [17] that connects multiple causal convolutional layers together using downsampling layers (e.g., Average Pooling) between each two convolutional layers as many CNN models do. With the help of the downsampling layers, a long sequence input can be progressively summarized into a lower-dimensional dense representation. Figure 7 : Visualized daily risks (Softmax output of low-level encoder) for two groups of typical physician participants over 6 months, a group that has been consecutively burned-out (left column) and the other stayed unaffected (right column). Darker colors correspond to larger Softmax score. Grey colors denote empty shifts (all shifts are aligned to the right).

dilated causal convolutional layers to further obtain deeper TCN. Instead of using downsampling layers, in ResTCN the dilation is increased exponentially (e.g., using {1, 2, 4, ...} as the dilation factors) to realize exponentially large receptive field. In CausalNet, the spatial scale (number of time steps) keeps reducing with higher layers by the Max Pooing layers, while in ResTCN, the spatial scale keeps unchanged. For our prediction task, we have different configuration for HiPAL implemented with CausalNet and ResTCN. For CausalNet, we use a Flattening layer as the final feature aggregation layer before the final Softmax layer. For ResTCN, only the TCN output at the final step is used as the representation for prediction. In our implementation, both Causal-Net and ResTCN based HiPAL model have 6 causal convolutional layers. For single-level CausalNet and ResTCN, we set the number of layers as 12 to increase the convolutional receptive field over much longer sequences.

Based on the different architecture of the three base sequence models, FCN, CausalNet, and ResTCN, we configure the Seq-AE in Eq. (11) differently for different base model. For FCN, since there are no spacial size change at all, for the decoder in Eq. (11), we directly take the same FCN for the encoder in Eq. (4) but reverse the order of the layers to get a mirrored structure. For CausalNet as the encoder where downsampling layers (e.g., MaxPool) are used to reduce the spatial scale, in the decoder, we replace the downsampling layers to upsampling layers to increase spatial scale for data reconstruction. And for ResTCN, since we usually use the last output of ResTCN as the representation for any downstream task, in the decoder, we first replicated the representation produced by the encoder in Eq. (4) to every time steps and then feed them to the decoder counterpart of ResTCN configured in the same way.

The code is available at https://github.com/HanyangLiu/HiPAL. Table 7 shows the hyperparameter used in the implementation.

We follow our prior work [22] in selecting the summary statistics of activity logs as features for GBM, SVM and RF in Table 1 . The features include:

• Workload measures -total EHR time, after-hours EHR time, patient load, inbox time, time spent on notes, chart review, and number of orders, per patient per day. • Temporal statistics -mean, minimum, maximum, skewness, kurtosis, entropy, total energy, autocorrelation, and slope of time intervals.

An empirical evaluation of generic convolutional and recurrent networks for sequence modeling

Surgeon distress as calibrated by hours worked and nights on call

Mixed machine learning and agent-based simulation for respite care evaluation

Subtypes in clinical burnout patients enrolled in an employee rehabilitation program: differences in burnout profiles, depression, and recovery/resources-stress balance

Measures of electronic health record use in outpatient settings across vendors

The long-document transformer

Xgboost: A scalable tree boosting system

Empirical evaluation of gated recurrent neural networks on sequence modeling

Selfattention temporal convolutional network for long-term daily living activity detection

Hierarchical recurrent neural network for skeleton based action recognition

Automatic detection of front-line clinician hospital shifts: a novel use of electronic health record timestamp data

Understanding physician work and well-being through social network modeling using electronic health record data: a cohort study

Discrimination, abuse, harassment, and burnout in surgical residency training

Conceptual considerations for using ehr-based activity logs to measure clinician burnout and its effects

Learning a vector representation of time

Physician burnout: The hidden health care crisis

Temporal convolutional networks for action segmentation and detection

An app developed for detecting nurse burnouts using the convolutional neural networks in microsoft excel: population-based questionnaire study

Hierarchical recurrent neural network for document modeling

A critical review of recurrent neural networks for sequence learning

Learning to diagnose with lstm recurrent neural networks

Predicting physician burnout using clinical activity logs: model performance and lessons learned

Efficient estimation of word representations in vector space

Taking action against clinician burnout: a systems approach to professional well-being

A generative model for raw audio

Seqsleepnet: end-to-end hierarchical recurrent neural network for sequence-to-sequence automatic sleep staging

Prevalence and correlates of stress and burnout among us healthcare workers during the covid-19 pandemic: A national cross-sectional survey study

Personalizing session-based recommendations with hierarchical recurrent neural networks

Burnout and career satisfaction among american surgeons

Burnout and satisfaction with work-life balance among us physicians relative to the general us population

Metrics for assessing physician activity using electronic health record log data

Hierarchical context enabled recurrent neural network for recommendation

Personalized multitask learning for predicting tomorrow's mood, stress, and health

A brief instrument to assess both burnout and professional fulfillment in physicians: reliability and validity, including correlation with selfreported medical errors, in a sample of resident and practicing physicians

Time series classification from scratch with deep neural networks: A strong baseline

Intervention for physician burnout: a systematic review

Hierarchical attention networks for document classification

Hierarchical temporal convolutional networks for dynamic recommender systems

Multi-scale context aggregation by dilated convolutions

Hierarchical recurrent neural network for video summarization

Informer: Beyond efficient transformer for long sequence time-series forecasting