key: cord-0178356-1q72fyy6 authors: Merrill, Mike A.; Althoff, Tim title: Transformer-Based Behavioral Representation Learning Enables Transfer Learning for Mobile Sensing in Small Datasets date: 2021-07-09 journal: nan DOI: nan sha: dc8a4be44f213ed06afca468596259a19cbefd77 doc_id: 178356 cord_uid: 1q72fyy6 While deep learning has revolutionized research and applications in NLP and computer vision, this has not yet been the case for behavioral modeling and behavioral health applications. This is because the domain's datasets are smaller, have heterogeneous datatypes, and typically exhibit a large degree of missingness. Therefore, off-the-shelf deep learning models require significant, often prohibitive, adaptation. Accordingly, many research applications still rely on manually coded features with boosted tree models, sometimes with task-specific features handcrafted by experts. Here, we address these challenges by providing a neural architecture framework for mobile sensing data that can learn generalizable feature representations from time series and demonstrates the feasibility of transfer learning on small data domains through finetuning. This architecture combines benefits from CNN and Trans-former architectures to (1) enable better prediction performance by learning directly from raw minute-level sensor data without the need for handcrafted features by up to 0.33 ROC AUC, and (2) use pretraining to outperform simpler neural models and boosted decision trees with data from as few a dozen participants. In recent years, deep learning has provided performance improvements across NLP and computer vision. However, most off-the-shelf methods require large datasets, which has slowed their adoption in behavioral modeling applications such as mobile sensing where researchers report difficulties collecting labeled datasets of even dozens of participants. Without substantial funding for personnel and devices data collection quickly scales to an arduous process of recruiting and paying participants, debugging devices, and monitoring data quality [27] . Therefore, researchers have historically often been limited to using small datasets, hand-crafted features, and non-neural models [11, 27] . If mobile sensing applications could achieve improved performance and generalizability with smaller datasets they could be deployed quickly in situations with limited training data. For example, in the crucial early days of a emerging disease outbreak, such as the COVID-19 pandemic, laboratory testing may not be widely available, and many positive cases may remain undetected. In this setting, a highly generalizable model could be trained on what few positive test results a researcher had, and used to identify members of a population who may be infected and should be targeted for additional testing. In this paper we enable these conditions through transfer learning, which has recently driven advancements in NLP and computer vision [4, 5, 9, 10, 16] . In each case, a researcher first obtains a model trained on a large, more readily available, task-independent dataset (such as the popular language model BERT, which was trained in part on Wikipedia) and finetunes it on a typically smaller task-specific dataset. The resulting finetuned model provides higher predictive power than either the pretrained model without finetuning or a model trained exclusively on the smaller dataset. However, existing architectures and pretraining techniques cannot trivially accept high dimensional, heterogeneous multi-modal timeseries data, and are not equipped to support missing data, both of which are rare in computer vision and NLP but endemic in behavioral modeling applications [27] . Accordingly, our goal is to not only demonstrate that transformers provide significantly improved predictive power over other methods, but also how they can be coupled with pretraining to help researchers make stronger inferences with limited datasets. We first describe the HomeKit Flu monitoring study, which we use as a test bed for exploring transfer learning for behavioral data. We then show that our model provides significantly improved predictive power over other methods on three tasks related to detecting the flu. Next, we show that by pretraining our model on a large dataset and finetuning it on only one dozen participants' data we can outperform neural and non-neural baselines to deliver inferences on unlabeled participants. Finally, we discuss the implications of these findings in the context of behavioral monitoring. Our model builds upon prior work in neural methods and transfer learning for behavioral sensing and modeling. Our model is the first to mine raw sensor signals for generalizable feature representations to enable transfer learning in small datasets. Behavioral data has been modeled and mined using deep learning techniques across a variety of domains, including human activity recognition (using CNN) [28] , personalized fitness recommendation (using stacked LSTM [6] ) [18] , mood prediction (using RNN, GRU, or autoencoder) [2, 21, 22] , stress prediction (using LSTM and autoencoder) [12] , and personality prediction [26] . Two studies experimented with multi-head attention and convolution as we do here, but neither paper applies this architecture to transfer learning [20, 23] . Transfer learning for wearable and sensor data has been explored in human activity recognition [15] , stress and mood prediction [7, 12] , and forecasting adverse surgical outcomes in an ICU [3] . However, none of these applications focus explicitly on model performance on small datasets, as we do here. In this respect, the most relevant work is Tang et. al [23] which tests its methods on populations with as small as 1.5k samples. However, in this work we not only train our model on a dataset with less than half as many samples, but also show that its predictions can be generalized to a population up to five hundred times as large (Section 4.2). We first describe a dataset of FitBit recordings and flu test results, which we use as a test bed for transformer-based behavioral representation learning. Then, we detail our model which is composed of a CNN Encoder for capturing hierarchical and temporal features and transformer for learning relationships between these features. Data and Hand-Crafted Features. Our dataset consists of 118k user-days of FitBit data collected from 983 participants in the Homekit Flu Monitoring Study over the course of six months. Each minute the devices recorded the participant's total steps, average heart rate, and a binary flag indicating if the participant was sleeping. Participants also completed daily surveys which asked if they were experiencing flu symptoms, including coughing, chills, fever, and fatigue. When a participant indicated that they were experiencing a cough and one other symptom, they were asked to self-administer a saliva swab test kit, which was then mailed to a lab for further analysis. Missing Data. We note that our dataset demonstrates modest missingness, with the median participant supplying data on 114 of 120 possible days. Researchers frequently report missingness as an obstacle to adopting deep learning techniques, and so we model missingness by replacing missing values with zeros and including a binary flag at each timestep for each sensor stream to indicate if the value is missing in that timestep. Our model is composed of a convolutional encoder, a stack of transformer blocks, and a final densely connected linear layer that is used for classification. Intuitively, the convolutional encoder learns Convolutional Encoder. The convolutional encoder learns a temporal, hierarchical feature representation of the raw sensor data. We experimented with several architectures, and found that three layers with kernel sizes of 5,3,1, stride sizes of 2,2,2, and output channels of 256,128,64 worked best, although in practice the model does not appear to be particularly sensitive to this module's hyperparameters. Note that unlike SAnD (which applies convolution across the features at each timestep), we treat each sensor stream as an input channel, and compute multiple convolutions between timesteps [20] . As a result, the input to our transformer blocks is significantly compressed. We experimented with SAnD's convolutional layer, but failed to achieve above random performance on any task. We suspect that this is because our model's inputs are longer (four days of data as opposed to one) and "thinner" (six features as opposed to 76), meaning that compression along the feature axis alone does little to compress the overall dimensionality. Transformer Blocks. Our model uses a stack of two transformer layers, each composed of four attention heads and a feed forward layer. We take the output of the final layer to be the learned representation of the input time series. Training. We train our model with the Adam optimizer [8] and Focal loss [13] , which has been previously shown to be highly effective in cases such as ours that exhibit extreme class imbalance (roughly 500:1 in the case of the Kit Trigger task). We selected data from a subset of 10% of users in our test set, and used this data as an evaluation set for hyperparameter tuning. To evaluate our model's performance we simulate a realistic setting where a hypothetical researcher deploys a model's predictions on a population after an initial training period (Figure 1 ). We train on the first three months of data, test on the following three months, and make a prediction on each task for every participant on every day of the study. Additionally, our model only uses data from the four days prior to the day on which we make a prediction, so that no information from the future is used to make a prediction about the past. We also include no explicit information about a users identity (e.g. participant id or demographics) to encourage the model to learn generalizable motifs about activity data rather than facets of individual users' behavior. This evaluation setting follows bestpractice recommendations and avoids falsely overstating the level of performance [17] . We evaluate our model on three behavioral modeling tasks: • Flu Positivity: Will the participant produce a swab that tests positive for the flu today? • Kit Trigger: Will the participant trigger their home test kit today (i.e., report a cough and at least one other symptom)? • Flu Symptoms: Will the participant report any flu symptoms today? In each case, we compare our model to the following baselines: • CNN: How important are the transformer layers to our pretrained model's performance? To answer this question, we removed the transformer blocks from our model and passed the CNN's final output directly to a linear layer. • XGBoost: How well does our model perform relative to a non-neural baseline? Boosted decision trees are frequently used in many sensing studies because they are supported by common, easy to use libraries and often achieve strong performance out-of-the-box [27] . Since boosted trees expectedly do not scale well to the thousands of observations in our raw time series data, we compute a set of commonly used features for each day in the window, and then concatenate these features for a final input. A list of all features is available in Table 1 . • XGBoost -Expert: What if our non-neural model had access to features that were designed by experts? This model is similar to the previous baseline, but we add six features from prior work which are shown to be relevant to respiratory viral infections [19] . We do not include a "transformer only" baseline (i.e. our model but without the CNN encoder) because multi-head attention scales quadratically in memory with the length of the input, making it computationally infeasible to perform such an experiment on a multi-day timeseries window (i.e. minute level data on a four day window produces a 5,760 dimensional vector, which exceeds common context sizes in transformer models for NLP) with commodity GPUs [1] . Our model outperforms all baselines on all three tasks (Table 2a) . Specifically, we perform 13% to 53% better than a CNN alone, and 1.4% to 27% better than XGBoost models. Interestingly, our model's marginal performance gain on the Flu Positivity task is much higher than on other tasks. One might expect that predicting flu positivity is inherently more difficult than predicting symptoms alone (as in the other two tasks) since the former requires the model to separate participants who are sick with some other respiratory viral infection from flu positive participants. However, we theorize that participants who are flu positive are more likely to display the kind of behavioral changes (e.g. staying home from work, sleeping in later) that our model is designed to capture. What if a researcher was interested in making inferences about their whole population, but only had access to labels from a small subset of the population (as might be the case in an emerging disease scenario)? Here we showcase our model's ability to learn from small datasets with pretraining and finetuning. Pretraining. The HomeKit Flu monitoring study included a daily questionnaire which asked participants to indicate if they were experiencing fatigue. Much the same way as Section 4.1, we take this response as a label and associate it with a four day window of data preceding the response. We then pretrain our model on the first three months of data from the study. This process simulates a researcher using all available data up to the point where they begin testing for a new condition, like a novel disease. Finetuning. Next, we randomly select twelve participants and then use their data (constituting 600 user-days) from the second three month period to finetune the model on the "Flu Symptoms" task (Section 4.1). This small dataset size mirrors many projects in ubiquitous computing, mobile sensing, and clinical studies, which use on the order of a dozen participants for their inferences [14, 24, 25] . Finally, we used this finetuned model to make predictions about the remaining participants in the second three months of data. This step simulates a researcher using limited information from a small subset of their population to make inferences about the remaining participants. Results. With training data from just twelve users, our pretrained model outperforms all baselines on the Flu Symptoms task (Table 2b) . Notably, our model with pretraining outperforms XGBoost on this task (0.59 ROC AUC v.s. 0.54 ROC AUC) while the same model without pretraining barely performs above random chance (0.51 ROC AUC). This result shows that our model could be deployed in scenarios with limited information from which to draw inferences. In this paper we presented the first transformer-based architecture for behavioral representation learning that can learn directly from raw sensor data and, when coupled with pretraining-based transfer learning, can be effectively trained on as few as a dozen users' data. We believe that a model like ours could feasibly transform behavioral modeling by providing a common generalizable set of learned feature representations for small data applications across domains like mobile sensing, ubiquitous computing, and machine learning for health. Such a tool could democratize behavioral data science by offering significantly improved predictive performance to researchers who lack the resources or opportunity to scale data collection or model training beyond dozens of participants and extend recent performance advances in NLP and computer vision to behavioral modeling. Longformer: The Long-Document Transformer Deepmood: modeling mobile phone typing dynamics for mood detection Forecasting adverse surgical events using self-supervised transfer learning for physiological signals BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Deep residual learning for image recognition Long short-term memory Predicting Tomorrow's Mood, Health, and Stress Level using Personalized Multitask Learning and Domain Adaptation Adam: A Method for Stochastic Optimization Imagenet classification with deep convolutional neural networks Albert: A lite bert for self-supervised learning of language representations A review of mobile sensing systems, applications, and opportunities Extraction and Interpretation of Deep Autoencoder-based Temporal Features from Wearables for Forecasting Personalized Mood, Health, and Stress Kaiming He, and Piotr Dollar Focal Loss for Dense Object Detection Digital Biomarkers of Symptom Burden Self-Reported by Perioperative Patients Undergoing Pancreatic Surgery: Prospective Longitudinal Study Transfer learning for activity recognition in mobile health Distributed representations of words and phrases and their compositionality Dear Watch, Should I Get a COVID-19 Test? Designing deployable machine learning for wearables Modeling heart rate and activity data for personalized fitness recommendation Assessment of Prolonged Physiological and Behavioral Changes Associated With COVID-19 Infection Attend and diagnose: Clinical time series analysis using attention models Sequence multi-task learning to forecast mental wellbeing from sparse self-reported data Deepmood: Forecasting depressed mood based on self-reported histories via recurrent neural networks SelfHAR: Improving Human Activity Recognition through Self-training with Unlabeled Data Tanzeem Choudhury, and Vidhya Navalpakkam. 2021. Digital biomarker of mental fatigue. npj Digital Medicine CrossCheck: toward passive sensing and detection of mental health changes in people with schizophrenia Representation learning on variable length and incomplete wearable-sensory time series Understanding Practices and Needs of Researchers in Human State Modeling by Passive Mobile Sensing Deepsense: A unified deep learning framework for time-series mobile sensing data processing