key: cord-0643316-z2yo7ade
authors: Tsakalidis, Adam; Nanni, Federico; Hills, Anthony; Chim, Jenny; Song, Jiayu; Liakata, Maria
title: Identifying Moments of Change from Longitudinal User Text
date: 2022-05-11
journal: nan
DOI: nan
sha: 13ce813796933cadd7ae9356d56c714283cc7c75
doc_id: 643316
cord_uid: z2yo7ade

Identifying changes in individuals' behaviour and mood, as observed via content shared on online platforms, is increasingly gaining importance. Most research to-date on this topic focuses on either: (a) identifying individuals at risk or with a certain mental health condition given a batch of posts or (b) providing equivalent labels at the post level. A disadvantage of such work is the lack of a strong temporal component and the inability to make longitudinal assessments following an individual's trajectory and allowing timely interventions. Here we define a new task, that of identifying moments of change in individuals on the basis of their shared content online. The changes we consider are sudden shifts in mood (switches) or gradual mood progression (escalations). We have created detailed guidelines for capturing moments of change and a corpus of 500 manually annotated user timelines (18.7K posts). We have developed a variety of baseline models drawing inspiration from related tasks and show that the best performance is obtained through context aware sequential modelling. We also introduce new metrics for capturing rare events in temporal windows.

Linguistic and other content from social media data has been used in a number of different studies to obtain biomarkers for mental health. This is gaining importance given the global increase in mental health disorders, the limited access to support services and the prioritisation of mental health as an area by the World Health Organization (2019). Studies using linguistic data for mental health focus on recognising specific conditions related to mental health (e.g., depression, bipolar disorder) (Husseini Orabi et al., 2018) , or identifying self-harm ideation in user posts (Yates et al., 2017; Zirikly et al., 2019) . However, none of these works, even when incorporating a notion of time (Lynn et al., Figure 1 : Example of an Escalation (with a darker "peak") and a Switch within a user's timeline. 2018; Losada et al., 2020) , identify how an individual's mental health changes over time. Yet being able to make assessments on a longitudinal level from linguistic and other digital content is important for clinical outcomes, and especially in mental health (Velupillai et al., 2018) . The ability to detect changes in individual's mental health over time is also important in enabling platform moderators to prioritise interventions for vulnerable individuals (Wadden et al., 2021) . Users who currently engage with platforms and apps for mental health support (Neary and Schueller, 2018 ) would also benefit from being able to monitor their well-being in a longitudinal manner.

Motivated by the lack of longitudinal approaches we introduce the task of identifying 'Moments of Change' (MoC) from individuals' shared online content. We focus in particular on two types of changes: Switches -mood shifts from positive to negative, or vice versa -and Escalations -gradual mood progression (see Fig. 1 , detailed in § 3). Specifically we make the following contributions:

• We present the novel task of identifying moments of change in an individual's mood by analysing linguistic content shared online over time, along with a longitudinal dataset of 500 user timelines (18.7K posts, English language) from 500 users of an online platform. • We propose a number of baseline models for automatically capturing Switches/Escalations, inspired by sentence-and sequence-level stateof-the-art NLP approaches in related tasks.

• We introduce a range of temporally sensitive evaluation metrics for longitudinal NLP tasks adapted from the fields of change point detection (van den Burg and Williams, 2020) and image segmentation (Arbelaez et al., 2010) . • We provide a thorough qualitative linguistic analysis of model performance.

Social Media and Mental Health Online usergenerated content provides a rich resource for computational modelling of wellbeing at both population and individual levels. Research has examined mental health conditions by analysing data from platforms such as Twitter and Reddit (De Choudhury et al., 2013; Coppersmith et al., 2014; Cohan et al., 2018) as well as peer-support networks such as TalkLife (Pruksachatkun et al., 2019) . Most such work relies on proxy signals for annotations (e.g., self-disclosure of diagnoses, posts on support networks) and is characterised by a lack of standardisation in terms of annotation and reporting practices (Chancellor and De Choudhury, 2020 (Benton et al., 2017; Kshirsagar et al., 2017; Yates et al., 2017; Husseini Orabi et al., 2018; Jiang et al., 2020; Shing et al., 2020) . Researchers are increasingly adopting sequential modelling to capture temporal dynamics of language use and mental health. For example, Cao et al. (2019) encode microblog posts using suicide-oriented embeddings fed to an LSTM network to assess the suicidality risk at post level. Sawhney et al. (2020b Sawhney et al. ( , 2021 improves further on predicting suicidality at postlevel by jointly considering an emotion-oriented post representation and the user's emotional state as reflected through their posting history with temporally aware models. The recent shared tasks in eRisk also consider sequences of user posts in order to classify a user as a "positive" (wrt self-harm or pathological gambling) or "control" case (Losada et al., 2020; Parapar et al., 2021) . While such work still operates at the post-or user-level it highlights the importance of temporally aware modelling.

Related Temporal NLP Tasks Semantic change detection (SCD) aims to identify words whose meaning has changed over time. Given a set of word representations in two time periods, the dominant approach is to learn the optimal transformation using Orthogonal Procrustes (Schönemann, 1966) and measure the level of semantic change of each word via the cosine distance of the resulting vectors (Hamilton et al., 2016) . A drawback of this is the lack of connection between consecutive windows. Tsakalidis and Liakata (2020) addressed this through sequential modeling by encoding word embeddings in consecutive time windows and taking the cosine distance between future predicted and actual word vectors. Both approaches are considered as baselines for our task. First story detection (FSD) aims to detect new events reported in streams of textual data. Having emerged in the Information Retrieval community (Allan et al., 1998) , FSD has been applied to streams of social media posts (Petrović et al., 2010) . FSD methods assume that a drastic change in the textual content of a document compared to previous documents signals the appearance of a new story. A baseline from FSD is considered in §4.2.

We describe the creation of a dataset of individuals' timelines annotated with Moments of Change. A user's timeline P (u) s:e is a subset of their history, a series of posts [p 0 , ..., p n ] shared by user u between dates s and e. A "Moment of Change" (MoC) is a particular point or period (range of time points) within [s, e] where the behaviour or mental health status of an individual changes. While MoC can have different definitions in various settings, in this paper we are particularly interested in capturing MoC pertaining to an individual's mood. Other types of MoC can include life events, the onset of symptoms or turning points (e.g., moments of improvement, difficult moments or moments of intervention within therapy sessions). 1 We address two types of Moments of Change: Switches (sudden mood shifts from positive to negative, or vice versa) and Escalations (gradual mood progression from neutral or positive to more positive or neutral or negative to more negative). Capturing both sudden and gradual changes in individuals' mood over time is recognised as important for monitoring mental health conditions (Lutz et al., 2013; Shalom and Aderka, 2020) and is one of the dimensions to measure in psychotherapy (Barkham et al., 2021) .

Individual's timelines are extracted from Talklife 2 , a peer-to-peer network for mental health support. Talklife incorporates all the common features of social networks -post sharing, reacting, commenting, etc. Importantly, it provides a rich resource for computational analysis of mental health (Pruksachatkun et al., 2019; Saha and Sharma, 2020) given that content posted by its users focuses on their daily lives and well-being.

A complete collection between Aug'11-Aug'20 (12.3M posts, 1.1M users) was anonymised and provided to our research team in a secure environment upon signing a License Agreement. In this environment, 500 user timelines were extracted ( §3.2) and an additional anonymisation step was performed to ensure that usernames were properly hashed when present in the text. The 500 timelines were subsequently annotated using our bespoke annotation tool ( §3.3) to derive the resulting longitudinal dataset ( §3.4).

Existing work extracts user timelines either based on a pre-determined set of timestamps (e.g., considering the most recent posts by a user) (Sawhney et al., 2020b) or by selecting a window of posts around mentions of specific phrases (e.g., around self-harm) (Mishra et al., 2019) . The latter introduces potential bias into subsequent linguistic analysis (Olteanu et al., 2019) , while the former could result into selecting timelines from a particular time period -hence potentially introducing temporallydependent linguistic or topical bias (e.g., a focus on the COVID-19 pandemic). Here we instead extract timelines around points in time where a user's posting behaviour has changed. Our hypothesis is that such changes in a user's posting frequency could be indicative of changes in their lives and/or mental health. Such association between changes in posting behaviour on mental health fora and changes in mental health has been assumed in prior literature (De Choudhury et al., 2016) .

Identifying changes in posting frequency We create a time series of each user's daily posting frequency based on their entire history. We then employ a change-point detection model to predict the intensity of daily post frequency by the given user. Bayesian Online Change-point Detection (Adams and MacKay, 2007) with a Poisson-Gamma underlying predictive model (Zachos, 2018) was chosen as our model, due to its highly competitive performance (van den Burg and Williams, 2020) and the fact that extracted timelines using this method had the highest density of MoC compared to a number of different timeline extraction (anomaly detection and keyword-based) methods for the same dataset.

Extracting timelines around change-points Upon detecting candidate MoC as change-points in posting frequency, we generated candidate timelines for annotation by extracting all of the user's posts within a seven-day window around each change-point. We controlled for timeline length (between 10 and 150 posts, set empirically) so that they were long enough to enable annotators to notice a change but not so long as to hinder effective annotation. This control for timeline length means that our subsequent analysis is performed (and models are trained and evaluated) on time periods during which the users under study are quite active; however, the upper bound of 150 posts in 15 days set for each timeline also ensures that we do not bias (or limit) our analysis on extremely active users. Finally, to ensure linguistic diversity in our dataset, 500 timelines extracted in this way were chosen for annotation at random, each corresponding to a different individual. The resulting dataset consists of 18,702 posts (µ=35, SD=22 per timeline; range of timeline length=[10,124], see Fig. 2 (a)).

Annotation Interface An annotation interface was developed to allow efficient viewing and annotation of a timeline (see snippet in Fig. 3) . Each post in a timeline was accompanied by its timestamp, the user's self-assigned emotion and any associated comments (color-coded, to highlight recurrent users involved within the same timeline). Given the context of the entire timeline, annotations for MoC are performed at post level: if an annotator marks a post as a MoC, then they specify whether it is (a) the beginning of a Switch or (b) the peak of an Escalation (i.e., the most positive/negative post of the Escalation). Finally, the range of posts pertaining to a MoC (i.e., all posts in the Switch/Escalation) need to be specified. Data annotation After a round of annotations for guideline development with PhD students within the research group (co-authors of the paper), we recruited three external annotators to manually label the 500 timelines. They all have University degrees in humanities disciplines and come from three different countries; one of them is an English native speaker. Annotators were provided with a set of annotation guidelines containing specific examples, which were enriched and extended during iterative rounds of annotation. 3 Annotators completed 2 hands-on training sessions with a separate set of 10 timelines, where they were able to ask questions and discuss opinions to address cases of disagreement. Following the initial training phase, we performed spot checks to provide feedback and answer any questions while they labelled the full dataset (n=500 timelines). Annotators were encouraged to take breaks whenever needed, due to the nature of the content. On average, each annotator spent about 5 minutes on annotating a single timeline.

The annotation of MoC is akin to assessment of anomaly detection methods since MoC (Switches and Escalations) are rare, with the majority of posts not being annotated (label 'None'). Measuring the agreement in such settings is therefore complex, as established metrics such as Krippendorff's Alpha and Fleiss' Kappa would generally yield a low score. This is due to the unrealistically high expected chance agreement (Feinstein and Cicchetti, 1990) , which cannot be mitigated by the fact that annotators do agree on the majority of the annotations (especially on the 'None' class). For this reason, we have used as the main indicator the per label positive agreement computed as the ratio of the number of universally agreed-upon instances (the intersection of posts associated with that label) over the total number of instances (the union of posts associated with that label). As highlighted in Table 1, while perfect agreement for 'None' is at 69%, perfect agreement on Escalations and Switches is at 19% and 8%, respectively. However, if instead of perfect agreement we consider majority agreement (where two out of three annotators agree), these numbers drastically increase (30% for Switches and 50% for Escalations). Moreover, by examining the systematic annotation preferences of our annotators we have observed that the native speaker marked almost double the amount of Switches compared to the other two annotators, in particular by spotting very subtle cases of mood change. We have thus decided to generate a gold standard based on majority decisions, comprising only cases where at least two out of three annotators agree with the presence of a MoC. The rare cases of complete disagreement have been labelled as 'None'. We thus have 2,018 Escalations and 885 Switches from an overall of 18,702 posts (see Fig. 2 (b) for the associated lengths in #posts). In future work we plan to consider aggregation methods based on all annotations or approaches for learning from multiple noisy annotations (Paun and Simpson, 2021) .

Our aim is to detect and characterise the types of MoC based on a user's posting activity. We therefore treat this problem as a supervised classification task (both at post level and in a sequential/timelinesensitive manner, as presented in §4.2) rather than an unsupervised task, even though we also consider effectively baselines with unsupervised components (FSD, SCD in §4.2). Contrary to traditional sentence or document-level NLP tasks, we incorporate timeline-sensitive evaluation metrics that account for the sequential nature of our model predictions ( §4.1). Given a user's timeline, the aim is to classify each post within it as belonging to a "Switch" (IS), an "Escalation" (IE), or "None" (O). At this point we don't distinguish between beginnings of switches/peaks of escalations and other posts in the respective ranges. While the task is sequential by definition, we train models operating both at the post level in isolation and sequential models at the timeline-level (i.e., accounting for user's posts over time), as detailed in §4.2. We contrast model performance using common post-level classification metrics as well as novel timeline-level evaluation approaches ( §4.1). This allows us to investigate the impact of (a) accounting for severe class imbalance and (b) longitudinal modelling. We have randomly divided the annotated dataset into 5 folds (each containing posts from 100 timelines) to allow reporting results on all of the data through cross-validation.

Post-level We first assess model performance on the basis of standard evaluation metrics at the post level (Precision, Recall, F1 score). These are obtained per class and macro-averaged, to better emphasize performance in the two minority class labels (IS & IE). However, post-level metrics are unable to show: (a) the expected accuracy at the timeline level (see example in Fig. 4) and (b) model suitability in predicting regions of change. These aspects are particularly important since we aim to build models capturing MoC over time.

Timeline-level Our first set of timeline-level evaluation metrics are inspired from work in change-point detection (van den Burg and Williams, 2020) and mirror the post-level ones, albeit operating on a window and timeline basis. Specifically, working on each timeline and label type independently, we calculate Recall R where T P w denotes the true positives that fall within a range of w posts and M (l) /GS (l) are the predicted/actual labels for l, respectively. Note that each prediction can only be counted once as "correct". R The second set of our timeline-level evaluation metrics is adapted from the field of image segmentation (Arbelaez et al., 2010) . Here we aim at evaluating model performance based on its ability to capture regions of change (e.g., in Fig 4, 'GS' shows a timeline with three (two) such regions of Escalations (Switches)). For each such true region R M2) . Although M2 provides a more faithful 'reconstruction' of the user's mood over time (the predictions are identical but shifted slightly in time), all post-level evaluation metrics for M1 are greater or equal to those obtained by M2 for the two minority classes (IE and IS).

The coverage metrics are calculated on the timeline basis and macro-averaged similarly to R (l) w and P (l) w . Using a set of evaluation metrics, each capturing a different aspect of the task, ensures assess to model performance from many different angles.

We have considered different approaches to addressing our task:

(i) Naïve methods, specifically a Majority classifier (predicting always "None") and a "Random" predictor, picking a label based on the overall label distribution in the dataset. It has been shown that comparisons against such simple baselines is essential to assess performance in computational approaches to mental health (Tsakalidis et al., 2018) . (Lin et al., 2017) , which is more appropriate for imbalanced datasets.

(iii) Emotion Classification We used DeepMoji (EM-DM) (Felbo et al., 2017) and Twitter-roBERTabase (EM-TR) from TweetEval '20 (Barbieri et al., 2020 ) operating on the post-level, to generate softmax probabilities for each emotion (64 for EM-DM, 4 for EM-TR). These provide meta-features to a BiLSTM to obtain timeline-sensitive models for identifying MoC.

(iv) First Story Detection (FSD). We have used two common approaches for comparing a post to the n previous ones: representing the previous posts as (i) a single centroid or (ii) the nearest neighbour to the current post among them (Allan et al., 1998; Petrović et al., 2010) . In both cases, we calculate the cosine similarity of the current and previous posts. The scores are then fed into a BiLSTM as meta-features for a sequential model. Results are reported for the best method only.

(v) Semantic Change Detection (SCD). Instead of the standard task of comparing word representations in consecutive time windows, we consider a user being represented via their posts at particular points in time. We follow two approaches. The first is an Orthogonal Procrustes approach (Schönemann, 1966) operating on post vectors (SCD-OP). Our aim here is to find the optimal transformation across consecutive representations, with higher errors being indicative of a change in the user's behaviour. In the second approach (SCD-FP) a BiLSTM is trained on the user's k previous posts in order to predict the next one (Tsakalidis and Liakata, 2020) . Errors in prediction are taken to signal changes in the user. In both cases, we calculate the dimension-wise difference between the actual and the transformed/predicted representations (post vectors) and use this as a meta-feature to a BiLSTM to obtain a time-sensitive model.

(vi) Timeline-sensitive. From our (ii) post-level classifiers, BERT(f) tackles the problem of imbalanced data but fails to model the task in a longitudinal manner. To remedy this, we employ BiLSTM-bert, which treats a timeline as a sequence of posts to be modelled, each being represented via the [CLS] representation of BERT(f). To convert the post-level scores/representations from (iii)-(v) above into time-sensitive models we used the same BiLSTM from (vi), operating at the timeline-level. Details for each model and associated hyperparameters are in the Appendix. Table 2 summarises the results of all models; Fig. 5 further shows the P w /R w metrics for IE/IS for the best-performing models. peting models in terms of post-level macro-F1. It provides a 8.6% relative improvement (14% for the IS/IE labels) against the second best performing model (BERT(f)). Furthermore, it achieves a great balance between precision-and recalloriented timeline-level metrics, being consistently the second-best performing model. This performance is largely attributed to two factors, which are studied further below: (a) the use of the Focal loss on BERT, generating [CLS] representations that are much more focused on the minority classes (IE/IS), and (b) its longitudinal aspect.

Post-level The BERT variants perform better than the rest in all metrics. Their coverage metrics though suggest that while they manage to predict better the regions compared to most timeline-level methods (i.e., high C r ), they tend to predict more regions than needed (i.e., low C p ) -partially due to their lack of contextual (temporal-wise) information. Finally, as expected, BERT(f) achieves much higher recall for the minority classes (IE/IS), in exchange for a drop in precision compared to BERT(ce) and in recall for the majority class (O).

Models from Related Tasks EM-DM achieves very high precision (P , P w ) for the minority classes, showing a clear link between the tasks of emotion recognition and detecting changes in a user's mood -indeed, emotionally informed mod-els have been successfully applied to post-level classification tasks in mental health (Sawhney et al., 2020a) ; however, both EM models achieve low recall (R, R w ) for IE/IS compared to the rest. For the SCD inspired models, SCD-FP outperforms SCD-OP on most metrics. This is largely due to the fact that the former uses the previous k=3 posts to predict the next post in a user's timeline (instead of aligning it based on the previous post only. Thus SCD-FP benefits from its longitudinal component -a finding consistent with work in semantic change detection (Tsakalidis and Liakata, 2020) .

While BiLSTM-bert yields the highest macro-F1 and the most robust performance across all metrics, it is not clear which of its components contributes the most to our task. To answer this, we perform a comparison against the exact same BiLSTM, albeit fed with different input types: (a) average word embeddings as in BiLSTM-we, (b) Sentence-BERT representations (Reimers and Gurevych, 2019) and (c) fine-tuned representations from BERT(ce). As shown in Table 3 , fine-tuning with BERT(ce) outperforms Sentence-BERT representations. While the contextual nature of all of the BERT-based models offers a clear improvement over the static word embeddings, it becomes evident that the use of the focal loss during training the initial BERT(f) is vital, offering a relative im- provement of 6% in post-level macro-F1 (13.7% for IS/IE). Calibrating the parameters in the focal loss could provide further improvements for our task in the future (Mukhoti et al., 2020) .

The importance of longitudinal modelling is shown via the difference between the BERT and BiLSTM variants when operating on single posts vs on the timeline-level (e.g., see the post-level results of BERT(ce)/Word emb. in Table 3 vs BERT(ce)/BiLSTM-we in Table 2 , respectively). We further examine the role of longitudinal modelling in the rest of our best-performing models from Table 2 . In particular, we replace the timelinelevel BiLSTM in EM-DM and SCD-FP with a twolayer feed-forward network, operating on post-level input representations -treating each post in isolation. The differences across all pairwise combinations with and without the longitudinal component are shown in Fig. 6 . Timeline-level models achieve much higher precision (6.1%/6.9%/11.1% for P /P 1 /C p , respectively) in return for a small sacrifice in the timeline-level recall-oriented metrics (-2.8%/1.9%/2.3% for R/R 1 /C r ), further highlighting the longitudinal nature of the task. 

Here we analyse the cases of Switches/Escalations identified or missed by our best performing model (BiLSTM-bert). Switches (IS) are the most challenging to identify, largely due to being the smallest class with the lowest inter-annotator agreement. However, the EM-based models achieve high levels of precision on Switches, even during post-level evaluation (see Table 2 ). We therefore employ EM-TR (Barbieri et al., 2020) , assigning probability scores for anger/joy/optimism/sadness to each post, and use them to characterise the predictions made by BiLSTM-bert. Fig. 7 and Table 4 show that our model predicts more often (in most cases, correctly) a 'Switch' when the associated posts express positive emotions (joy/optimism), but misses the vast majority of cases when these emotions are absent. The reason for this is that TalkLife users discuss issues around their well-being, with a negative mood prevailing. Therefore, BiLSTM-bert learns that the negative tone forms the users' baseline and thus deviations from this constitute cases of 'Switches' (see example in Table 5 ). We plan to address this in the future by incorporating transfer learning approaches to our model (Ruder et al., 2019) .

Escalations (IE) are better captured by our models.

Here we examine more closely the cases of 'Peaks' in the escalations (i.e., the posts indicating the most negative/positive state of the user within an escalation -see §3.3). As expected, the post-level recall of BiLSTM-bert in these cases is much higher than its recall for the rest of IE cases (.557 vs .408).

In Fig. 8 we analyse the recall of our model in capturing posts denoting escalations, in relation to the length of escalations. We can see that our model is more effective in capturing longer escalations. As opposed to the Switch class, we found no important differences in the expressed emotion between TP and FN cases. By carefully examining the cases of Peaks in isolation, we found that the majority of them express very negative emotions, very often including indication of self-harm. A Logistic Regression trained on bigrams at the post-level to distinguish between identified vs missed cases of Peaks showed that the most positively correlated features for the identified cases were directly linked to self-harm (e.g., "kill myself", "to die", "kill me"). However, this was not necessarily the case with missed cases. Nevertheless, there were several cases of self-harm ideation that were missed by BiLSTM-bert, as well as misses due to the model "ignoring" the user's baseline, as is the case with Switches (see Table 6 ). Transfer learning and domain adaptation strategies as well as self-harm detection models operating at the post level could help in mitigating this problem. 

We present a novel longitudinal dataset and associated models for personalised monitoring of a user's well-being over time based on linguistic online content. Our dataset contains annotations for: (a) sudden shifts in a user's mood (switches) and (b) gradual mood progression (escalations). Proposed methods are inspired by state-of-the-art contextual models and longitudinal NLP tasks. Importantly, we have introduced temporally sensitive evaluation metrics, adapted from the fields of change-point detection and image segmentation. Our results highlight the importance of considering the temporal aspect of the task and the rarity of mood changes.

Future work could follow four main directions: (a) integrating longitudinal models of detecting changes, with post-level models for emotion and self-harm detection (see §5.2); (b) incorporating transfer learning methods (Ruder et al., 2019) to adapt more effectively to unseen users' timelines; (c) adjusting our models to learn from multiple (noisy) annotators (Paun and Simpson, 2021) and (d) calibrating the parameters of focal loss and testing other loss functions suited to heavily imbalanced classification tasks (Jadon, 2020).

Ethics institutional review board (IRB) approval was obtained from the corresponding ethics board of the University of Warwick prior to engaging in this research study. Our work involves ethical considerations around the analysis of user generated content shared on a peer support network (Talk-Life). A license was obtained to work with the user data from TalkLife and a project proposal was submitted to them in order to embark on the project. The current paper focuses on the identification of moments of change (MoC) on the basis of content shared by individuals. These changes involve recognising sudden shifts in mood (switches or es-calations). Annotators were given contracts and paid fairly in line with University payscales. They were alerted about potentially encountering disturbing content and were advised to take breaks. The annotations are used to train and evaluate natural language processing models for recognising moments of change as described in our detailed guidelines. Working with datasets such as TalkLife and data on online platforms where individuals disclose personal information involves ethical considerations (Mao et al., 2011; Keküllüoglu et al., 2020) . Such considerations include careful analysis and data sharing policies to protect sensitive personal information. The data has been de-identified both at the time of sharing by TalkLife but also by the research team to make sure that no user handles and names are visible. Any examples used in the paper are either paraphrased or artificial. Potential risks from the application of our work in being able to identify moments of change in individuals' timelines are akin to those in earlier work on personal event identification from social media and the detection of suicidal ideation. Potential mitigation strategies include restricting access to the code base and annotation labels used for evaluation.

Limitations Our work in this paper considers moments of change as changes in an individual's mood judged on the basis of their self-disclosure of their well-being. This is faced by two limiting factors: (a) users may not be self-disclosing important aspects of their daily lives and (b) other types of changes related to their mental health (other than their mood/emotions, such as important life events, symptoms etc.) may be taking place.

Though our models could be tested in cases of nonself-disclosure (given the appropriate ground truth labels), the analysis and results presented in this work should not be used to infer any conclusion on such cases. The same also holds for other types of 'moments of change' mentioned in §2 (e.g., transition to suicidal thoughts), as well as other types of changes, such as changes in an individual in terms of discussing more about the future, studied in Althoff et al. (2016) , or changes in their self-focus (Pyszczynski and Greenberg, 1987) over time, which we do not examine in this current work.

Topic detection and tracking pilot study final report

Large-scale analysis of counseling conversations: An application of natural language processing to mental health

Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence

TweetEval: Unified benchmark and comparative evaluation for tweet classification

Bergin and Garfield's handbook of psychotherapy and behavior change

Multitask learning for mental health conditions with limited social media data

Random forests. Machine learning

Latent suicide risk detection on microblog via suicideoriented word embeddings and layered attention

Methods in predictive techniques for mental health status on social media: a critical review

SMHD: a large-scale resource for exploring online language usage for multiple mental health conditions

Quantifying mental health signals in twitter

Predicting depression via social media

Discovering Shifts to Suicidal Ideation from Mental Health Content in Social Media

BERT: Pre-training of deep bidirectional transformers for language understanding

High agreement but low kappa: I. the problems of two paradoxes

Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm

Variability in language used on social media prior to hospital visits

Diachronic word embeddings reveal statistical laws of semantic change

Bidirectional lstm-crf models for sequence tagging

Deep learning for depression detection of Twitter users

A survey of loss functions for semantic segmentation

Detection of mental health from reddit via deep contextualized representations

Analysing privacy leakage of life events on twitter

Adam: A method for stochastic optimization

Detecting and explaining crisis

Focal loss for dense object detection

Overview of erisk at clef 2020: Early risk prediction on the internet

Silja Vocks, Dietmar Schulte, and Armita Tschitsaz-Stucki. 2013. The ups and downs of psychotherapy: Sudden gains and sudden losses identified with session reports

Clpsych 2018 shared task: Predicting current and future psychological health from childhood essays

Loose tweets: An analysis of privacy leaks on twitter

SNAP-BATNET: Cascading Author Profiling and Social Network Graphs for Suicide Ideation Detection on Social Media

Calibrating deep neural networks using focal loss

State of the field of mental health apps

Social data: Biases, methodological pitfalls, and ethical boundaries

Overview of erisk at clef 2021: Early risk prediction on the internet

Aggregating and learning from multiple annotators

Streaming first story detection with application to twitter

Moments of change: Analyzing peerbased cognitive support in online mental health forums

Selfregulatory perseveration and the depressive selffocusing style: a self-awareness theory of reactive depression

Sentencebert: Sentence embeddings using siamese bertnetworks

Transfer learning in natural language processing

Causal factors of effective psychosocial outcomes in online mental health communities

Deep attentive learning for stock movement prediction from social media text and company correlations

Phase: Learning emotional phaseaware representations for suicide ideation detection on social media

A time-aware transformer based model for suicide ideation detection on social media

A generalized solution of the orthogonal procrustes problem

A meta-analysis of sudden gains in psychotherapy: Outcome and moderators

A computational approach to understanding empathy expressed in text-based mental health support

A prioritization model for suicidality risk assessment

Sequential modelling of the evolution of word representations for semantic change detection

Can we assess mental health through social media and smart devices? addressing bias in methodology and evaluation

An evaluation of change point detection algorithms

Using clinical natural language processing for health outcomes research: overview and actionable suggestions for future advances

The effect of moderation on online mental health conversations

The WHO special initiative for mental health (2019-2023): Universal health coverage for mental health

Depression and self-harm risk assessment in online forums

Bayesian on-line change-point detection: Spatio-temporal point processes. Bachelor's thesis

Clpsych 2019 shared task: Predicting the degree of suicide risk in reddit posts

This work was supported by a UKRI/EPSRC Turing AI Fellowship to Maria Liakata (grant EP/V030302/1) and the Alan Turing Institute (grant EP/N510129/1). The authors would like to thank Dana Atzil-Slonim, Elena Kochkina, the anonymous reviewers and the meta-reviewer for their valuable feedback on our work, as well as the three annotators for their invaluable efforts in generating the longitudinal dataset.

Here we provide details on the hyperparameters used by each of our models, presented in §4.2:• RF: Number of trees: [50, 100, 250, 500] • BiLSTM-we:Two hidden layers ([64,128,256] units), each followed by a drop-out layer (rate: [.25, .5, .75] ) and a final dense layer for the prediction. Trained for 100 epochs (early stopping if no improvement on 5 consecutive epochs) using Adam optimizer (lr: [0.001, 0.0001]) optimzing the Cross-Entropy loss with batches of size [128, 256] , limited to modelling the first 35 words of each post.• BiLSTM-bert:Two hidden layers ([64,128,256 • FSD: Same architecture as BiLSTM-bert.For the FSD part, we experimented with word embeddings 4 and representations from Sentence-BERT. We extract features either by considering the nearest neighbor or by considering the centroid, on the basis of the previous [1,2,...,10] posts, as well as on the basis of the complete timeline preceding the current post (11 features, overall). The two versions (nearest neighbor, centroid) were run independently from each other.• SCD-OP & SCD-FP: We experimented with average post-level word embeddings and representations from Sentence-BERT (results are reported for the latter, as it performed better).For SCD-FP, we stacked two BiLSTM layers (128 units each), each followed by a dropout (rate: 0.25), and a final dense layer for the prediction, with its size being the same as the desired output size (300 for the case of word embeddings, 768 for Sentence-BERT).We train in batches of 64, optimising the cosine similarity via the Adam Optimizer with a learning rate of .0001, and employing an early 4 en-core-web-lg @ https://github.com/ explosion/spacy-models/releases/ download/en_core_web_lg-3.0.0/en_core_ web_lg-3.0.0-py3-none-any.whl stopping criterion (5 epochs patience). The final model (i.e., after the SCD part) follows the exact same specifications as BiLSTM-bert, operating on the outputs from the SCD components.• BERT(ce) & BERT(f): We used BERTbase (uncased) as our base model and added a Dropout layer (rate: .25) operating on top of the [CLS] output, followed by a linear layer for the class prediction. We trained our models for 3 epochs using Adam (learning rate: [1e-5, 3e-5]) and perform five runs with different random seeds (0, 1, 12, 123, 1234). Batch sizes of 8 are used in train/dev/test sets. For the alpha-weighted Focal loss in BERT(f), we used γ = 2 and a t = 1/p t , where p t is the probability of class t in our training data. Results reported in the paper (as well as the results for BiLSTM-bert) are averaged across the five runs with the different random seeds.We trained each model on five folds and selected the best-performing combination of hyperparameters on the basis of macro-F1 on a dev set (33% of training data) for each test fold.

The code for the experiments is written in Python 3.8 and relies on the following libraries: keras (2.7.0), numpy (1.19.5), pandas (1.2.3), scikitlearn (1.0.1), sentence_trasformers (1.1.0), spacy (3.0.5), tensorflow (2.5.0), torch (1.8.1), transformers (4.5.1).

All experiments were conducted on virtual machines (VM) deployed on the cloud computing platform Microsoft Azure. We have used two different VMs in our work:• the experiments that involved the use of BERT were ran on a Standard NC12_Promo, with 12 cpus, 112 GiB of RAM and 2 GPUs;• all other experiments were ran on a Standard F16s_v2, with 16 cpus and 32 GiB of RAM.