key: cord-0163562-183bai4d
authors: Jauhar, Sujay Kumar; Chandrasekaran, Nirupama; Gamon, Michael; White, Ryen W.
title: MS-LaTTE: A Dataset of Where and When To-do Tasks are Completed
date: 2021-11-12
journal: nan
DOI: nan
sha: a517b4bec71136b8c5c0fa97630733afc9c6a2f5
doc_id: 163562
cord_uid: 183bai4d

Tasks are a fundamental unit of work in the daily lives of people, who are increasingly using digital means to keep track of, organize, triage and act on them. These digital tools -- such as task management applications -- provide a unique opportunity to study and understand tasks and their connection to the real world, and through intelligent assistance, help people be more productive. By logging signals such as text, timestamp information, and social connectivity graphs, an increasingly rich and detailed picture of how tasks are created and organized, what makes them important, and who acts on them, can be progressively developed. Yet the context around actual task completion remains fuzzy, due to the basic disconnect between actions taken in the real world and telemetry recorded in the digital world. Thus, in this paper we compile and release a novel, real-life, large-scale dataset called MS-LaTTE that captures two core aspects of the context surrounding task completion: location and time. We describe our annotation framework and conduct a number of analyses on the data that were collected, demonstrating that it captures intuitive contextual properties for common tasks. Finally, we test the dataset on the two problems of predicting spatial and temporal task co-occurrence, concluding that predictors for co-location and co-time are both learnable, with a BERT fine-tuned model outperforming several other baselines. The MS-LaTTE dataset provides an opportunity to tackle many new modeling challenges in contextual task understanding and we hope that its release will spur future research in task intelligence more broadly.

Tasks are the primary unit of personal and professional productivity. People capture, organize, track and complete tasks as a way to measure and make progress towards their goals. Task management strategies range from scribbled sticky notes on refrigerators to complex group collaboration platforms such as Microsoft Planner, and everything in between. Digital tools for task management support are prevalent in a range of applications including electronic mail (Bellotti et al., 2003) , to-do applications (Bellotti et al., 2004) and digital assistants (Graus et al., 2016) , and for a range of user scenarios such as contextual reminders (Kamar and Horvitz, 2011) , task duration estimation (White and Hassan Awadallah, 2019) , and complex task decomposition (Zhang et al., 2021) . Many of these applications record and utilize a number of signals such as the text of a task, the time it was created, it's importance, due-date, and who it is assigned to, in order to build intelligent solutions for the user. However, they all notably assume tasks are homogeneous from the perspective of the completion context. This assumption is flawed because tasks are, in fact, deeply context dependent; completing or making progress on tasks depends on extraneous factors, such as time in schedule, proximity to home or businesses, and resource availability. For example, a user is likely to want a reminder to buy eggs when they are close to a grocery store, as opposed to when they are at an airport. Two types of context that are especially salient are location and time. These signals are readily accessible to systems and have been studied in previous work on contextual understanding (Graus et al., 2016; Bellotti et al., 2004; Benetka et al., 2019) , recommendation (Zhuang et al., 2011; Yao et al., 2015; Zeng et al., 2016) and reminders (Kamar and Horvitz, 2011) . However, there continues to remain a disconnect between logged contextual signals and actions taken in the real world, since users often record completion of tasks at a later time and a different location (Zhang et al., 2022) . Moreover, most prior work leverages proprietary data and does not make these available to other researchers for further study. The lack of a sizeable publicly available task dataset, and especially one that is tagged with location and time meta-data, has limited research on task intelligence in general, and particularly on the important area of contextual task modeling. Thus, in this paper we are releasing a novel resource called the Microsoft Locations and Times of Task Execution (MS-LaTTE) dataset. This dataset of 10,101 tasks sourced from real-world data is the largest publicly available repository of to-do tasks of any kind, and an order of magnitude larger than previously collected datasets (Landes and Eugenio, 2018) . Additionally, it is the only dataset that also contains contextual location and time labels for where and when the tasks are likely to be completed. The dataset is available here. In addition to collecting and releasing the data, we also explore MS-LaTTE in this paper, to assess its utility for contextual task modeling. Specifically, we analyze the annotations to see if they capture interesting properties or regularities that might be useful for downstream modeling. Additionally, we motivate and establish two new benchmark evaluation tasks derived from the MS-LaTTE dataset -Co-location and Co-time prediction -and evaluate a number of popular language modeling approaches against these benchmarks. We find that learning is possible on both benchmarks, and a finetuned BERT approach significantly outperforms other baselines. However, both Co-location and Co-time prediction are difficult problems, and allow for more complex modeling efforts in future work to outperform the systems that we evaluate. In summary, we make the following contributions in this paper:

1. Highlight the opportunities and challenges in contextual task modeling. 2. Compile and publicly release a novel dataset with over 10k tasks (an order of magnitude larger than previously available datasets) labeled with location and time meta-data ( §3). 3. Perform a detailed analysis of the annotations, demonstrating that they capture intuitively reasonable regularities ( §4). 4. Conduct a modeling experiment to demonstrate the viability of the data for machine learning applications, setting up the novel benchmark tasks of Co-location and Co-time prediction in the process ( §5). 5. Present future directions, including further applications and additional contexts ( §6).

Previous investigations in a few distinct areas are relevant to the research described in this paper, and include work on task management, context-aware computing, and tasks data.

Task management systems that assist people in successfully managing, prioritizing, structuring, organizing and completing tasks have been the subject of a large body of research. There are many applications and assistants on the market today, including Amazon Alexa, Google Assistant, todoist, Trello, Microsoft To-Do, to name only a few. It has long been recognized that task management does not take place in isolation but rather in the rich context of daily life. Hence research on task management and assistants is situated within the broader research area of context-aware computing. Location and time are two important contextual factors for both personal task assistance (Benetka et al., 2019; Wang and Pérez-Quiñones, 2014 ) and workrelated task management (Anhalt et al., 2001; Rhodes, 1997) . Even before the existence of smaller computing devices such as cellphones that allow location-aware computing, actions as simple as setting an alarm or time-based reminder make it abundantly clear that there is a strong need for temporal context in successful task management. Bellotti et al. (2004) point out that co-location of tasks minimizes the need for multiple excursions for a user and that certain kinds of tasks show effects of periodicity, in terms of certain days, weeks and time of year. Graus et al. (2016) study reminder data from the Cortana personal assistant. They establish that different task types cluster around different user-selected notification times. Communication task reminders, for example, tend to have a notification time that falls into typical work hours, chore notifications are more commonly set to early morning hours and tasks that involve moving to a specific location tend to have reminders set for lunch time at work. Interestingly in their data set, knowing the time at which a reminder was created is a stronger predictor of notification time than the task title itself. Ludford Finnerty et al. (2006) highlight the fact that not only location but also movement patterns detected by a cellphone are important. By combining location with time information, context-aware task management systems can better assist users by suggesting the right tasks at the right time (Kessell and Chan, 2006; Rhodes, 1997) . However, the role of context in task management is still only partially understood.

Beyond task management, in the wider arena of context-aware computing and recommendation systems, time and location signals are also fundamental. Context-aware recommendation of points of interest (POI) is typically studied based on check-in data from social network applications such as Foursquare and uses location as well as time (Yao et al., 2015; Zhao et al., 2019) . POI recommendation also has a strong social component, making it amenable to the use of collaborative filtering signals in addition to spatiotemporal context (Yuan et al., 2013) . Other applications include generating online recommendations for advertising and news, based on temporal patterns (Zeng et al., 2016) , and recommending entities based on historic user behavior and spatio-temporal sensor context (Zhuang et al., 2011) . Other broad research areas where time and location (as well as social context and travel trajectory) provide important information are mobile search (Teevan et al., 2011; Amini et al., 2012) , opportunistic routing (driving assistance) (Horvitz and Krumm, 2012) , and information retrieval in general (Bennett et al., 2011; Radinsky et al., 2013 ).

Despite the extensive research in task management and assistance, there is very little publicly available data. Typically, the data sources used in the literature are proprietary and subject to strict privacy requirements. To the best of our knowledge, Landes and Eugenio (2018) are the only researchers to have publicly released a dataset of to-do tasks. It consists of approximately 600 tasks provided by users of Trello who volunteered their data, as well as from public Trello boards. The tasks are annotated for (a) an intelligent agent that would be appropriate to assist with the task (the set of 15 agents serve as a task taxonomy), and (b) the argument(s) of the action expressed in the task. In comparison, MS-LaTTE is the first public largescale dataset of real-world to-do tasks, consisting of more than 10k instances. It is also the only dataset to contain annotated meta-data about the locations and times at which tasks are likely to be completed.

In this section, we describe the annotation setup for collecting the MS-LaTTE dataset. Our goal is to source real-world tasks and capture the locations and times at which they are usually completed.

We source tasks for our dataset from a sample of the logs of the now-defunct Wunderlist application. These logs are only obtained after a thorough legaland trust-approved enterprise-grade pipeline processes them to anonymize and scrub all personally identifiable information. In addition, the pipeline performs kanonymization so that tasks that were created by fewer than five users or fewer than 100 times in total are automatically discarded. The result is an aggregate view of the logs, devoid of any identifiers, private information or infrequent tasks that can be correlated back to a user. What remains is a collection of task titles (such as "buy milk", "mow the lawn" etc.) along with list titles to which they are commonly assigned by users (such as "groceries", "home chores", etc.). This aggregate view provides a rich collection of realworld to-dos from which we sample items for annotation. Unfortunately, grocery -and to a lesser extent packing -related tasks are over-represented in the aggregate data, accounting for over 70% of distinct items, by our estimates. Therefore, rather than sample uniformly at random, we use heuristics to under-sample grocery and packing tasks. This is to avoid a resulting annotated dataset where a large part is trivially assigned a single location (or time) label. Our heuristic uses a manually curated set of the most popular list titles from the data related to groceries and packing (e.g. "grocery", "safeway", "packing list"), and samples task titles from these lists at a lower rate than from other lists. Specifically we use an 10-10-80 percentage sampling probability for grocery, packing, and other lists respectively.

In this manner, we sample a total of 12,000 distinct task-list pairs that are subsequently annotated with location and time information.

We used an internal crowd-sourcing platform to delegate work to non-expert workers in India contracted to perform annotations These workers are payed a fixed hourly rate rather than an amount per HIT completed, which in our experience, disincentivizes cheating and has led to better annotation quality. We conducted the annotation over the sampled collection of task-list pairs in two distinct stages: first for location, then for time. This ordering was motivated by a couple of initial pilot annotation rounds that demonstrated that annotators agreed on location to a much higher degree than on time; this makes sense, since tasks are much more likely to be completed at the same location by different users than at the same time (due to individual schedules and preferences).While the annotation stages are broadly similar, there are a few important differences that we highlight in what follows. Annotators are presented with an interface like the one in Figure 1 , where the unit for a HIT is a single task-list pair. After providing their consent to have their annotations collected for research and machine learning, annotators are first asked whether they are familiar with or have performed the task in the task-list pair before. If they answer in the negative, the HIT is considered complete and they are allowed to move on to the next one. This is consistent across both stages of annotation. If, however, they answer in the affirmative, they are then asked to provide labels for location or time. Specifically, they are asked WHERE or WHEN they would normally do this task, and they are permitted to select more than one label. For location, annotators are provided with four broad categories from which to select: (a) Home (b) Work (c) A public location (d) Somewhere else (along with a free-form text box) Note that our guidelines broadly specify that "Work" can cover a number of locations, depending on a task creator's likely primary vocation, including "school" or "college" for students. Further, if (c) is selected, annotators are asked to specify at least one of several public location labels (e.g., "grocery store", "dentist", "library" etc.). Initially, a list of 36 such public locations were manually curated from a taxonomy of map location categories. Then, over a few initial pilot rounds of annotation, inputs from option (d) are used to refine this list into a final set of 69 public location labels. For annotator convenience, public locations are manually organized into seven broad categories such as "retail", "recreation", "finance" etc.

-these are only shown in the annotation interface, and do not form part of the final dataset. It should be noted that annotators were instructed to respond by providing labels for physical locations of task completion. That is, even if a task may be completed online, the labels reflect the physical locations at which it is normally completed. This was done to reduce confusion and ambiguity, as well as to make the data more directly grounded in the real world and usable by future applications that can leverage geo-location information. Tasks that are completed online are nevertheless a very interesting area of research that we hope to explore in future work. Annotations were collected between May and August Figure 1 : The interface used for collecting location and time for task completion. Question 2 differs between stages of annotation; in this example an annotator is asked to provide time labels 2020, during the COVID-19 pandemic, when boundaries between Home, Work and other public locations were sometimes blurred 1 . To account for this, annotators were instructed to answer questions using pre-COVID times as a contextual frame of reference.

In the first stage, each of the 12,000 HITs was labeled by three annotators. Those HITs that were marked as unfamiliar or unknown to two or more annotators were discarded from the dataset. Furthermore, in instances where all three annotators did not have a single location label in common, the first author of this paper acted as a fourth annotator; 365 HITs were thus supplementally annotated. The few remaining instances where there was four-way disagreement on location labels were also subsequently discarded. In the end, a total of 1,899 task-list pairs from the original set of 12,000 were removed, leaving 10,101 tasks. The remaining tasks were annotated for time in the second stage of annotation. Annotators were asked to select one or more time labels from a set of 10 time buckets (as shown in Figure 1 ), each specified by two distinct dimensions: the time of the day, and the day of the week. Times of day included: (a) Morning (b) Afternoon (c) Evening (d) Night, and (e) Anytime; while days of the week could be either: (i) Weekday, or (ii) Weekend. Annotators were instructed to interpret the different times of day according to their own frame of reference (for example Morning might mean 5am-8am to an early riser, but 8am-11am for someone else), thereby allowing for label alignment across annotators and tasks at a conceptual level rather than according to strictly defined time buckets. Additionally, they were told to use Anytime when they were no more likely to complete a task at any one of the other times of day. As with location labeling, annotators were asked to consider their pre-pandemic schedules and propensities for doing tasks, when judging the data.

Since WHEN someone completes a task is altogether more subjective than WHERE, it is expected that there will be a great deal more variation in labels for the second stage of annotation. In fact, it would be more appropriate to call time annotation a survey rather than a labeling task. To account for and capture some of this variability we asked five annotators to label each of the 10,101 task-list pairs in the dataset, and none of the responses were discarded. Of course, even five annotations is insufficient to fully capture the breadth of preferences for when tasks are completed; however, the dataset we have collected allows for easy extension of time information with more responses in future work. In summary, the final MS-LaTTE dataset -collected over two stages of annotation -consists of 10,101 realworld task-list pairs, each of which includes a set of labels for location from three annotators 2 , and for time from five annotators. To the best of our knowledge, it is the largest (by an order of magnitude) dataset of to-do like tasks sourced from real users, and the only one to contain explicit contextual information in the form of locations and times at which tasks are completed.

In this section, we describe an analysis of the MS-LaTTE dataset. We begin by measuring agreement between annotators. We then investigate the distributions of labels across location and time independently, and what annotator consensus looks like for each. Finally, we examine the relationship between location and time labels by performing a cross-correlation exploration of the data.

We use Krippendorff's Alpha (Krippendorff, 2011) to measure the degree of annotator agreement. Since annotators can provide multiple labels to instances in the dataset, we apply the MASI (Passonneau, 2006) distance metric, a measure of agreement over set-valued objects. This setup yields agreement values of 0.50 and 0.09 for Location and Time respectively, which in turn are considered moderate and poor degrees of inter-rater agreement (McHugh, 2012) . As noted in §3.2, we hypothesized that labels on Time are subjective and therefore the low annotator agreement is expected. The ability for annotators to provide multiple labels for each instance also negatively impacts the inter-rater reliability metric. If singleton labels -that is, those that were provided only by one annotator for a given instance -are removed, Krippendorff's Alpha values become 0.87 and 0.26, which are excellent and fair for location and time respectively. In other words, while human judges may have individual preferences for where or when they expect to complete tasks, they are more likely to agree on a core set of locations or times at which these tasks should be completed. This is important, not only to contextualize the agreement numbers, but because it means that we can leverage the existence of this core agreement set for predictive modeling.

We now turn to the analysis of the dataset itself and we begin by providing a snapshot view of the label distributions over location and time portions of the data, in the form of histograms. They are shown in Figures 2a  and 2b respectively. Note that log counts (base 2) are used instead of raw counts to show differences between labels more clearly. We plot the histogram over labels on which there was majority agreement between annotators; this is to mitigate any impact that singleton labels (as noted in §4.1) may have on observable trends. Figure 2a demonstrates that home and work are by far the most popular locations for tasks in the dataset, and are an order of magnitude more frequent than any of the other labels. Grocery stores are the third most frequent label, despite the heuristics that we used to undersample grocery related tasks ( §3.1). The rest of the labels, corresponding to purchase or errand related locations (and including the remaining 60 not shown on the figure) , follow a long-tailed distribution. These general trends align with our observations about the larger aggregate Wunderlist logs, which forms the basis for the MS-LaTTE dataset; namely that grocery, home, and work-related tasks form the overwhelming majority of tasks that users like to track in to-do related apps. Meanwhile, Figure 2b also displays some interesting trends over the distribution of time labels. Weekday evenings appear to be the most active time for completing tasks, perhaps due to the likelihood of users possibly being at home, work or running errands (like at a grocery store), where many tasks are frequently completed. The popularity of home or work related tasks is also potentially a contributing factor to why mornings and afternoons are additional common times for completing tasks. A more in-depth analysis on crosscorrelations between location and time labels is presented in §4.4. The difference between weekdays and weekends also evidences some interesting properties. While people are generally less active completing tasks on weekends (sometimes starkly, such as in the evenings), they are actually slightly more active on weekend morningsperhaps due to home chores or other errands that are set aside specifically for non-working days. Additionally, tasks that are categorized as being done anytime (often very short, simple tasks that require little planning, as we show in §4.3) are more likely to be completed on weekends than on weekdays. This is perhaps due to weekends providing more leisure time for unplanned tasks.

Our computation of annotator agreement by discarding singleton labels ( §4.1), seemed to indicate that annotators tend to agree on core sets of labels for specific tasks, especially in the case of location. Thus, we now provide examples of high-agreement labels for both location and time, to show that the information provided by annotators captures reasonable location and temporal expectations for task completion. Tables 1 and 2 give some examples of tasks that were assigned the same label by a majority of annotators; not all labels are included in either table due to space constraints. Table 1 demonstrates that location labels are often incontrovertible and that annotators are able to correctly agree on the most likely location for tasks, despite the more than 70 possible labels to choose from. Meanwhile, Table 2 also shows that when the majority of annotators agree on a time bucket for a task, they select labels that are very reasonable. In other words, while individual people may choose to complete some of these example tasks at different times, due to personal preference or schedule, when several judges agree on a time bucket for a task, it appears to be a label that is easily interpretable -for example, work tasks on weekday afternoons, or errands and hobbies in the evenings. Additionally, as previously noted, the anytime label is often associated with very short tasks that do not require prior planning. While tables 1 and 2 provide agreement on single labels, a reasonable follow-up question is to ask whether multi-label annotations also capture useful signal. We attempt to answer this question by calculating the point-wise mutual information (PMI) between location and time labels independently, when these labels are home work office supply pharmacy electronics store clothing store hardware store rearrange closet meeting tasks buy envelopes dr refill bring in headphone office attire get a tape measure fix tv remote sociology paper buy sharpies zzzquil liquid dad speakers astronaut costume pressure washer part put on license sticker finish udemy course get packing tape pick up relpax more usb cables scouts uniform make house key copy Table 3 : Examples of location and time labels that were commonly co-assigned by annotators, as measured by point-wise mutual information. applied to the same task by annotators. Intuitively, the PMI captures the degree to which labels are coassigned, thereby providing a way to assess the interpretability of multi-label annotations. Table 3 provides examples of such co-assignments. Each column gives the top three labels (by PMI) that co-occur with the label in the column header. Note that items within a column are unrelated, and their only shared connection is co-occurrence with the column header. Only a subset of location and time labels are provided, for brevity. As can be seen from the table, coassignment by annotators often makes intuitive sense. For example, laundry can be done both at home or at a laundromat, home & garden stores sell similar goods to hardware stores, grocery stores and pharmacies are co-located in the same building, and short unplanned tasks can be done anytime on weekdays or weekends.

Up to this point, our analyses have looked at the location and time parts of the dataset separately, but since each of the 10,101 tasks in the dataset are annotated with both types of labels we can also conduct an analysis that looks at them jointly. Specifically, we can investigate whether location labels assigned to tasks make intuitive sense with respect to time, and viceversa. Stack plots attempting to answer the question of cross-correlation between label sets are presented in Figures 3a and 3b .

For each pair of location category and time bucket, we count the number of tasks that were assigned both labels (each by majority agreement). Then we marginalize over time and location to get the stacked bars in Figures 3a and 3b , respectively. Note that we only use the 10 most popular location labels in this analysis. Table 4 : Examples of some location and time labels that were highly correlated, as measured by point-wise mutual information.

These plots reveal some interesting and intuitively sensible findings. For example, in Figure 3a , home and work are the only two locations where tasks are completed in every time bucket 3 ; this makes sense since people spend the most time at these two locations. Another interesting observation is the relative proportion of tasks done on weekends at different public locations. For example, far more grocery, home & garden or hardware tasks are completed on weekends than bank tasks, since it is the expectation that the former are open for business while the latter are not (or may only be open with limited hours). Figure 3b also contains some noteworthy details. One example is that tasks assigned anytime labels are only done at home or at work; this makes a lot of sense, considering that they are typically short, unplanned ones ( §4.3), and that tasks completed at any other public location typically requires some form of forethought or planning. Another example, is the relative proportion of work related tasks completed on different days of the week (especially in the mornings and afternoons), which conforms to the expectation that most people work on weekdays, rather than on weekends. Figures 3a and 3b present a broad picture of cross-correlation between location and time labels, but it may be useful to also explore a more focused view of crosscorrelation. Table 4 presents some examples of the most highly correlated pairs of labels, as measured by PMI. This value is computed from probabilities of independent and joint label occurrences, which are obtained from counts over the location and time labels that received majority agreement. Because PMI is sensitive to very infrequent events, we discarded location labels that have fewer than five tasks associated with them. As can be seen from the table, these highly correlated label pairs make intuitive sense: for example, the fact that restaurant and library are associated with the night time bucket on both weekdays and weekends (i.e., date nights or study sessions), that gym is associated with WE morning (i.e., early workout on weekends), or that dmv is associated with WD afternoon (i.e., less busy times when people are often at work 4 ) In summary, this section has presented an analysis of the MS-LaTTE dataset. Our main takeaway from this analysis is the fact that while the data does contain some expected variance -in the form of individual latitude for times at which tasks are completed -interesting and often intuitively reasonable properties of task completion are captured by annotations with majority agreement. This motivates the use of MS-LaTTE for the learning effort we tackle in §5, as well as for future work that we hope will leverage the dataset for modeling task intelligence.

Given the annotations in the dataset, there are many interesting predictive problems that can be tackled such as predicting the location or time bucket most likely to support some task activity, or predicting when (resp. where) a task should happen given a user's set location (resp. time bucket) and description, or even scheduling a user's day using their time commitments and likely locations. However, in this paper we seek only to describe and validate the MS-LaTTE dataset and therefore tackle the foundational modeling problems of predicting whether two tasks are likely to be completed together (by location or time); other modeling efforts are left to the community and to future work. This leads us to the two benchmark tasks of co-location prediction and co-time prediction. Note that while these tasks may be simpler than some of the more ambitious modeling challenges described above, a model that successfully tackles the former may indicate approaches that might succeed on the latter. Moreover, co-location prediction and co-time prediction are meaningful modeling efforts in and of themselves, leading to potential user-facing scenarios such as alerting users working on a given task, to other tasks they can complete, based on their current location or time (or both). Formally given two tasks T 1 = (t 1 , l 1 ) and T 2 = (t 2 , l 2 ), where t and l are task and list descriptions respectively, the problems of co-location and co-time are to find binary predictive functions f (T 1 , T 2 ) → 0, 1. We generate benchmark datasets for evaluating these two predictive tasks from the annotated MS-LaTTE dataset. We sample 25,000 task pairs from the cross product of the 10,101 unique tasks in MS-LaTTE, using 20,000, 1,000, and 4,000 respectively for training, validation, and test splits. Pairs are assigned a positive label if they contain at least one common label that was assigned by a majority of annotators; otherwise they are assigned a negative label. While pairs are unique, tasks themselves may repeat.

To ensure fair evaluation we stratify the dataset so that tasks that appear in any split do not appear in any other split. It may be noted that the resulting benchmark datasets are imbalanced containing a roughly 71/29 and 38/62 positive/negative split for location and time respectively. The train, validation, and test splits for colocation and co-time benchmarks are released with the dataset. In this paper, we report accuracy and Macro F1 (due to the imbalance in the datasets) as evaluation metrics for the two evaluation tasks. We evaluate several popular language modeling approaches in this paper. They include: (a) Random -a baseline that randomly assigns a positive or negative label to an instance, using the ratios in the training data as bases for sampling a label. (b) Lexical -a model that featurizes both sets of tasks and list strings and uses uni-, bi-and tri-gram features in a logistic regression classifier trained and tuned on the train and validation splits respectively. (c) GloVe -a model that uses the popular GloVe vectors (Pennington et al., 2014) as features for lexical items in the task and list strings. An average of the GloVe vectors is passed through a logistic classifier that is trained and tuned on the train and validation splits respectively. (d) BERT -a model that is similar to the GloVe model above, but uses pre-trained BERT embeddings (Vaswani et al., 2017) instead. Be-cause these representations capture full strings as opposed to individual tokens, the embeddings for T 1 and T 2 are concatenated rather than averaged. Finally, we compare these models, against a more sophisticated fine-tuned BERT model. Like the pretrained BERT model, this one also featurizes both task inputs T 1 and T 2 . But it uses a composition variant that has been successfully applied in past work to problems that involve dual string comparison (Mou et al., 2015) , such as textual entailment:

where the semi-colon represents concatenation, and the • signifies the element-wise dot product. In our model (BERT TE-FT) the representation c is then passed through a non-linear layer, before a final linear layer with a sigmoidal output produces a binary prediction value. The model uses an intermediate dimension of 256 as an output to a ReLU non-linearity; it also applies a dropout factor of 0.5 to the model during training to avoid overfitting. We use binary cross-entropy as the loss function, and we train the model using an Adam optimizer with a fixed weight decay (Loshchilov and Hutter, 2017) , and initial hyperparameters of lr = 1e −5 , eps = 1e −8 . Table 5 : Results of several models on the binary prediction tasks of co-location and co-time detection.

The results of our evaluation are given in Table 5 . They demonstrate that while all models are capable of outperforming the random baseline, they do so with varying degrees of success. The best model is the BERT TE-FT model, which outperforms the other approaches by a significant margin on both benchmark tasks 5 . These improvements are statistically significant at a p-value=0.01 based on a paired student's t-test. The lexical model proves surprisingly capable, outperforming both GloVe and BERT models; this is possibly because the training data size is fairly large and contains a decent coverage of lexical terms, negating the need for the fuzzy matching afforded by fixed embeddings. An interesting avenue for future work is to compare models on varying amounts of training data. Notably, both tasks are challenging (with co-time prediction being significantly more so), and there is likely room for improvement over the simple approaches described in this paper. While BERT TE-FT is clearly the best model on both benchmark tasks, we can also attempt to evaluate how much contribution its distinctive features add by conducting an ablation study. Specifically, we ablate the model by: (a) freezing the parameters of the BERT model (effectively negating the effects of fine-tuning); or (b) using a simple concatenation of T 1 and T 2 instead of the more complex vector composition variant in Equation 1. The results of our ablation study are given in Table 6 . They show that while both components influence the full model positively, fine-tuning is clearly a far more important positive factor. This seems to indicate that the language used in to-do tasks is different from general purpose text on which BERT is trained, and thus benefits from fine-tuning to the domain.

In conclusion, we presented the two new benchmark tasks of co-location prediction and co-time prediction, derived from the MS-LaTTE dataset. We compared a number of popular language modeling approaches on these benchmarks and showed that they can indeed be modeled successfully. Notably, a model containing fine-tuned BERT seemed to perform best. However, both benchmarks are challenging and present future work with interesting possibilities to outperform the simple models in this paper with more sophisticated approaches.

We have publicly released a new dataset of to-do tasks called MS-LaTTE. This dataset contains location and time labels from multiple annotators for every one of its 10,101 tasks, and is the first to contain such contextual information surrounding task completion. It is also the largest publicly available dataset of real-world to-do tasks of any kind, by an order of magnitude. In this paper, we have described the setup used to collect and annotate the dataset, conducted a detailed analysis of its labels and properties, and performed experimental evaluations on two novel benchmark tasksco-location prediction and co-time prediction -derived from the dataset. We found that the data captures several intuitive regularities, and that these regularities can be modeled by popular language modeling techniques -including, most successfully, by a BERT fine-tuned approach. We anticipate that the release of MS-LaTTE will spur the research community to work on contextual task modeling, and more generally on task intelligence. Despite the utility, we hope the community will derive from this dataset, it does have some limitations. We rely on third-party judges' interpretations of tasks that they did not create themselves, and for which they have very little information besides the raw textual representation. Moreover, annotators are all from a single locale (India) and thus may miss broader cultural or countryspecific subtleties that are outside their field of experience. These issues may have potentially led to labels that are erroneous or noisy due to misconstrued intent.

To resolve these and other issues, there are several interesting and challenging research directions that we hope to pursue in future work. They include: (a) alternative mechanisms to gather context information for tasks directly from individuals, such as in-situ data collection and experience sampling methods; (b) leveraging contexts beyond time and place, such as people, activity, busyness and resources required (digital as well as physical); (c) using MS-LaTTE to model novel contextual prediction and recommendation scenarios, such as predicting when or where a task is most likely to be completed, ranking tasks according to their likelihood of being completed given location, time or both, or recommending other tasks based on what a user is currently doing, where and/or when; (d) personalization of models to individuals or cohorts, based on factors such as occupation, preferences and habits; (e) and integration of models in real-world task management systems and digital assistants, with user studies and online experimentation to determine downstream impact on users' productivity.

Trajectory-aware mobile search

Toward context-aware computing: experiences and lessons

Taking email to task: the design and evaluation of a task management centered email tool

What a todo: studies of task management towards the design of a personal task list manager

Understanding context for tasks and activities

Inferring and using location metadata to personalize web search

Analyzing and predicting task reminders

Some help on the way: opportunistic routing under uncertainty

Jogger: models for context-sensitive reminding

Castaway: a contextaware task management system

Computing krippendorff's alpha-reliability

A supervised approach to the interpretation of imperative todo lists

Because i carry my cell phone anyway: functional location-based reminder applications

Interrater reliability: the kappa statistic

Natural language inference by tree-based convolution and heuristic matching

Measuring agreement on set-valued items (masi) for semantic and pragmatic annotation

Glove: global vectors for word representation

Behavioral dynamics on the web: learning, modeling, and prediction

The wearable remembrance agent: a system for augmented memory

Understanding the importance of location, time, and people in mobile local search behavior

Attention is all you need

Exploring the role of prospective memory in location-based reminders

Task duration estimation

Context-aware point-of-interest recommendation using tensor factorization with social regularization

Time-aware point-of-interest recommendation

Online context-aware recommendation with time varying multi-armed bandit

Learning to decompose and organize complex tasks

Grounded task prioritization with contextaware sequential ranking

Where to go next: a spatio-temporal gated network for next poi recommendation

When recommendation meets mobile: contextual and personalized recommendation on the go