key: cord-0225829-6retgml9 authors: Ahne, Adrian; Khetan, Vivek; Tannier, Xavier; Rizvi, Md Imbessat Hassan; Czernichow, Thomas; Orchard, Francisco; Bour, Charline; Fano, Andrew; Fagherazzi, Guy title: Identifying causal relations in tweets using deep learning: Use case on diabetes-related tweets from 2017-2021 date: 2021-11-01 journal: nan DOI: nan sha: e0104ea94a45880fc6a22310491a7f089ba2e129 doc_id: 225829 cord_uid: 6retgml9 Objective: Leveraging machine learning methods, we aim to extract both explicit and implicit cause-effect associations in patient-reported, diabetes-related tweets and provide a tool to better understand opinion, feelings and observations shared within the diabetes online community from a causality perspective. Materials and Methods: More than 30 million diabetes-related tweets in English were collected between April 2017 and January 2021. Deep learning and natural language processing methods were applied to focus on tweets with personal and emotional content. A cause-effect-tweet dataset was manually labeled and used to train 1) a fine-tuned Bertweet model to detect causal sentences containing a causal association 2) a CRF model with BERT based features to extract possible cause-effect associations. Causes and effects were clustered in a semi-supervised approach and visualised in an interactive cause-effect-network. Results: Causal sentences were detected with a recall of 68% in an imbalanced dataset. A CRF model with BERT based features outperformed a fine-tuned BERT model for cause-effect detection with a macro recall of 68%. This led to 96,676 sentences with cause-effect associations."Diabetes"was identified as the central cluster followed by"Death"and"Insulin". Insulin pricing related causes were frequently associated with"Death". Conclusions: A novel methodology was developed to detect causal sentences and identify both explicit and implicit, single and multi-word cause and corresponding effect as expressed in diabetes-related tweets leveraging BERT-based architectures and visualised as cause-effect-network. Extracting causal associations on real-life, patient reported outcomes in social media data provides a useful complementary source of information in diabetes research. Diabetes distress (DD) refers to psychological factors such as emotional burden, worries, frustration or stress in the day-to-day management of diabetes. [1] [2] [3] Diabetes distress is associated with poorer quality of life, 4 higher A1C levels 5(p),6 and medication adherence. 7 Reducing diabetes distress may improve hemoglobin A1c and reduce the burden of disease among people with diabetes. 8 Social media is a useful observatory resource for patient reported diabetes issues and could help to contribute directly to public and clinical decision making from a patients' perspective, given the active online diabetes community. 9, 10 Identifying causal relations in social media expressed text data might help to discover unknown etiological results, specifically causes of health problems, concerns and symptoms. To intervene and potentially prevent diabetes distress it is necessary to understand the causes of diabetes distress from a patients' perspective to understand how patients see their disease. Causal relation extraction in natural language text has gained popularity in clinical decision-making, biomedical knowledge discovery or emergency management. 11 Particularly, causal relations on Twitter have been examined for diverse factors causing stress and relaxation, 12 adverse drug reactions 13 or causal associations related to insomnia or headache. 14 In this paper, we aim to extract spans of text as two distinct events from diabetes and diabetes-related tweets such that one event directly or indirectly impacts another event. We categorized these events as cause-event and effect-event depending upon the expressed context of each tweet. This work is realised in the frame of the World Diabetes Distress Study (WDDS) which aims to analyze what is shared on social media worldwide to better understand what people with diabetes and diabetes distress are experiencing. 15, 16 Most approaches examine explicit causality in text, 14, 17, 18 when cause and effect are explicitly stated, for instance by connective words (e.g. so, hence, because, lead to, since, if-then). 11, 19 An example for an explicit cause-effect pair is "diabetes causes hypoglycemia". Whereas, implicit causality is more complicated to detect such as in "I reversed diabetes with lifestyle changes" with cause "lifestyle changes" and effect "reversed diabetes". Machine and deep learning models have also been applied to extract causal relations. They are able to explore implicit relations and provide better generalisation contrary to rule-based approaches. 11,20,21,22(p) An interesting approach, leveraging the transfer learning paradigm, and addressing both explicit and implicit cause-effect extraction is provided by Khetan et al. 23 They fine-tuned pre-trained transformer based BERT language models 24, 25 to detect "Cause-Effect" relationships using publicly available datasets such as the adverse drug effect dataset. 26 In a similar spirit, the objective of the present work is to identify both explicit and implicit multi-word cause-effect relations on noisy, diabetes-related tweets, to aggregate identified causes and effects in clusters and ultimately to visualise these clusters in an interactive cause-effect network. On the basis of diabetes-related tweets, we first preprocessed tweets to only focus on personal, no-joke and emotional content; secondly, we identified tweets in which causal information (opinion, observation, etc.) is communicated, also referred to as causal tweets or causal sentences; in a third step, causes and their corresponding effects were extracted. Lastly, those cause-effect pairs were aggregated, described and visualised. The entire workflow is illustrated in Figure 1 . . This is an extended dataset of the one used in earlier works. 9 All data collected in this study were publicly posted on Twitter. Therefore, according to the privacy policy of Twitter, users agree to have this information available to the general public. 27 In this work, we applied a preprocessing pipeline, similar to earlier works, 9 to focus on tweets with personal content and remove institutional tweets (organisations, advertisement, news, etc.), identify and exclude jokes, and filter tweets containing emotional elements to adjust the scope of the tweets towards diabetes distress. Besides, questions were removed. This led to 562,013 tweets containing personal, non-joke and emotional content. More details on the preprocessing pipeline are summarized in the Multimedia Appendix 2. In order to identify causal tweets and cause-effect association, 5,000 randomly chosen diabetes-related tweets were manually labeled. We did not restrict ourselves to a specific area of diabetes-related causal relationships and include potentially all types. Labeling cause-effect pairs is a complex task. To verify the reliability of the labeling, two authors labeled 500 tweets independently and we calculated Cohen's kappa score, a statistical measure expressing the level of agreement between two annotators. 28 We obtained a score of 0.83, which is interpreted as an almost perfect agreement according to Altman and Landis. 29, 30 Disagreements were discussed between the two authors and one author labelled additional 4,500 tweets, resulting in 5,000 labeled tweets. A first model was trained to predict if a sentence contains a potential cause-effect association (causal sentence) and a second model extracted the specific cause and associated effect from the causal sentence. Thus, the first model acts like a barrier and filters non-causal sentences out. These tweets may have either a cause, an effect, none of them, but not both. To simplify the model training, we hypothesised that cause-effect-pairs only occur in the same sentence and we removed all sentences with less than 6 words due to a lack of context. For this reason we operated on a sentence and not tweet level. Additional challenges in our setting were that causes and effects could be multi word entities and the language used on Twitter is non-standard with frequent slang and misspelled words. The identification of causal sentences is a binary classification task. The pre-trained language examples for each cause-effect pair, due to the fact that causes and effects could potentially be related to any concept in the diabetes domain, drove us to adopt an active learning approach to increase the training data. Active learning is a sample selection approach aiming to minimize the annotation cost while maximising the performance of ML-based models. 32 It has been widely applied on textual data. 33, 34 The training data was increased in several iterations as illustrated in Figure 3 . After having trained the causal sentence classifier to detect sentences with causal information, we identified the specific cause-effect pairs in the causal sentences. The identification of cause-effect pairs was casted as an event extraction, or named entity recognition task, i.e assigning a label cause or effect to a sequence of words. The manually labeled causes and effects were encoded in a IO tagging format based on the common tagging format BIO (Beginning, Inside, Outside), introduced by Ramshaw and Marcus. 35 Here, "I-C" denotes inside the cause and "I-E" inside the effect. Those two tags were completed by the outside tag "O" symbolizing that the word is neither cause nor effect. The following results were obtained from 482,583 sentences which were obtained from splitting the 562,013 personal, emotional, non-joke tweets into sentences; excluding questions; and including only sentences with more than 5 words. Causal sentences The performances to detect causal sentences for the imbalanced dataset are illustrated in The semi-supervised clustering led to 1,751 clusters. To remove noisy clusters through potential misclassifications, only clusters with a minimal number of 10 cause/effect occurrences were considered for the following analyses, resulting in 763 clusters. Note, the order of documents might affect the results, as different clusters might have been created. Please refer to Multimedia Appendix 4 for an overview over the 100 most largest clusters (automatically added clusters have "Other" as "Parent cluster"). Table 4 provides an overview over the largest clusters, containing either cause or effect, on the left side and on the right side the most frequent cause-effect associations, excluding the largest cluster "Diabetes" as it will be studied separately. The cluster "Diabetes" is the largest one with 66,775 occurrences of "Diabetes" as either cause or effect (ex.: #diabetes, diabetes, The most frequent cause-effect is "unable to afford insulin" which causes "death" expressed in 1,246 cases, followed by "insulin" causing "death" with 1,156 cases and "type 1 diabetes" causing "fear" with 1,054 cases. Table 4 : Left side column shows the most frequent clusters (causes and effects) with the number of occurrences. The last column shows the most frequent cause-effect relationships excluding the cluster "Diabetes". *OGTT: Oral glucose tolerance test The largest cluster "Diabetes" mainly occurs as a cause and its most frequent effects ("Death", "fear", "sick") are visualised in Figure 5 . From the 30 most numerous effects for "Diabetes", 6 were related to "Nutrition" and 5 to "Complications & comorbidities" and 3 to each of "Diabetes distress", "Emotions" and "Healthcare system". The interactive visualisation in D3 with filter options was published under https://observablehq.com/@adahne/cause-and-effect-associations-in-diabetes-related-tweets. We invite the interested reader to play with the graph to enhance understanding. Figure 6 provides an example graph of this visualisation showing only cause-effect relationships with at least 250 occurrences to ensure readability. It is striking that "death" seems to play such a central role as effect with various causes ("unable to afford insulin", "rationing insulin", "finance", "insulin", "Type 1 diabetes (T1D)", "overweight") pointing at it. Other central nodes are "Type 1 diabetes" acting as cause for "insulin pump", "insulin", "hypoglycemia (hypo)", "sickness", "finance" and emotions "anger" and "fear", where latest has the strongest association; or the node "Insulin" mostly relating as cause to "sickness", "medication", "finance", "death", or "hypoglycemia" and "fear" and "anger". Our findings suggest that it is feasible to extract both explicit and implicit cause and associated effects from diabetes-related Twitter data. We demonstrated that by adopting the transfer learning paradigm and fine-tuning a pre-trained language model we were able to detect causal sentences. Moreover, we have shown that simply fine-tuning a BERT-based model does not always outperform more traditional methods such as relying on conditional random fields in the case of the cause-effect pair detection. The precision, recall and F1 numbers, given the challenging task and the imbalanced dataset, were satisfying. The semi-supervised clustering and interactive visualisation enabled us to identify "Diabetes" as the largest cluster acting mainly as the cause for "Death" and "fear". Besides, a central cluster was detected in "Death" acting as an effect for various causes related to insulin pricing, a link already detected in earlier works. 9 From a patients' perspective we were able to show that their main fear is insulin pricing expressed in the most frequent cause-effect relationship "unable to afford insulin" causing "death" or "rationing insulin" causing "death". health-related concepts such as, "stress", "insomnia", "headache" as effects and identified causes using manually crafted patterns and rules. 14 However they only focused on explicit causality and excluded causes and effects encoded in hashtags and synonymous expressions. 14 On the contrary, we tackled both explicit and implicit causality, including causes and effects in hashtags, and exploiting synonymous expressions through the use of word embeddings. Kayesh et al. proposed an innovative approach, a novel technique based on neural networks which uses common sense background knowledge to enhance the feature set, but they focused on the simplified version of explicit causality in tweets. 18 Bollegala et al. developed a causality-sensitive approach for detecting adverse drug reactions from social media using lexical patterns and in consequence aiming at explicit causality. 39 Dasgupta et al. proposed one of the few deep learning approaches, due to the unavailability of appropriate training data, leveraging a recursive neural network architecture to detect cause-effect relations from text, but also only targeted explicit causality. 40 A Bert-based approach tackling both explicit and implicit causality is provided by Khetan et al. who used already existing labeled corpora not based on social media data. 23 Recently they further extended their work of explicit and implicit causality understanding in single and multiple sentence but in clinical notes. 41 To the best of our knowledge, this is the first paper investigating both explicit and implicit cause-effect relationships on diabetes-related Twitter data. The present work demonstrates various strengths. First, by leveraging powerful language models we were able to identify a large number of tweets containing cause-effect relationships which enabled us to the detect cause-effect associations in 20% (96,676 / 482,583) of the sentences, contrary to other approaches which were able to identify causality in less than 2% of tweets. 14 Second, contrary to most previous work, we tackled both explicit and implicit causal relationships, an additional explanation for the higher number of cause-effect associations we obtained compared to other studies focusing only on explicit associations. 14 Third, relying fully on automatic machine learning algorithms avoided us from defining manually crafted patterns to detect causal associations. Fourth, operating on social media data that is expressed spontaneously and in real-time offers the opportunity to gain knowledge from an alternative data source and in particular from a patients' perspective, which might complement traditional epidemiological data sources. A strong limitation is that cause-effect relations are expressed in tweets and this cannot be used for causal inference as the Twitter data source is uncertain and the information shared can be opinion or observation. Another shortcoming is that the performance of our algorithms to detect cause-effect pairs is not perfect. But the overall process and the vast amount of data minimizes this issue. The lack of recall is counterbalanced by the sheer amount of data and the lack of precision is counterbalanced by the clustering approach in which non-frequent causes or effects are discarded. 42 Labeling causes and effects in a dataset is a highly complicated task and we would like to emphasize that mislabelings in the dataset may occur. Enhancing data quality certainly is a strong point to address to further improve performance. The causal association structures learnt by the model from the training set, might not generalise completely when applied on the large amount of Twitter data. Besides, the active learning strategy certainly added noise to the model, as only positive samples were corrected, which could be improved in future investigations. Moreover, we would like to highlight that the diabetes related information shared on Twitter, may not be representative for all people with diabetes. For instance we observed a bigger cluster of causes/effect related to type 1 diabetes compared to type 2 diabetes, which is contrary to the real world. 43 A potential explanation for that is the age distribution of Twitter users. 44 But due to the large number of tweets analyzed, a significant variability in the tweets could be observed. In this work, we developed an innovative methodology to identify possible cause-effect relationships among diabetes-related tweets. This task was challenging due to addressing both explicit and implicit causality, multi-word entities, the fact that a word could be both cause or effect, the open domain of causes and effects, the biases occuring during labeling of causality, and the relatively small dataset for this complex task. We overcame these challenges by augmenting the small dataset via an active learning loop. The feasibility of our approach was demonstrated using modern BERT-based architectures in the preprocessing and causal sentence detection. A combination of BERT features and CRF layer were leveraged to extract causes and effects in diabetes-related tweets which were then aggregated to clusters in a semi-supervised approach. The visualisation of the cause-effect network based on Twitter data can deepen our understanding of diabetes, in a way of directly capturing patient-reported outcomes from a causal perspective. The fear of death due to the inability to afford insulin were main concerns expressed. When Is Diabetes Distress Clinically Meaningful?: Establishing cut points for the Diabetes Distress Scale Understanding the sources of diabetes distress in adults with type 1 diabetes Emotional Regulation and Diabetes Distress in Adults With Type 1 and Type 2 Diabetes The differential associations of depression and diabetes distress with quality of life domains in type 2 diabetes Regimen-Related Distress, Medication Adherence, and Glycemic Control in Rural African American Women With Type 2 Diabetes Mellitus Predicting diabetes distress in patients with Type 2 diabetes: a longitudinal study Disease-related distress, self-care and clinical outcomes among low-income patients with diabetes Systematic review and meta-analysis of psychological interventions in people with diabetes and elevated diabetes-distress Insulin pricing and other major diabetes-related concerns in the USA: a study of 46 407 tweets between The diabetes online community: The importance of forum use in parents of children with type 1 diabetes A Survey on Extraction of Causal Relations from Natural Language Text. ArXiv210106426 Cs How Do You #relax When Youʼre #stressed? A Content Analysis and Infodemiology Study of Stress-Related Tweets Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts Extracting health-related causality from twitter messages using natural language processing Challenges and perspectives for the future of diabetes epidemiology in the era of digital health and artificial intelligence Étude mondiale de la détresse liée au diabète : le potentiel du réseau social Twitter pour la recherche médicale Extracting causal knowledge from a medical database using graphical patterns On Event Causality Detection in Tweets. ArXiv190103526 Cs The Semantics of Relationships: An Interdisciplinary Perspective. Information Science and Knowledge Management Classifying Relations via Long Short Term Memory Networks along Shortest Dependency Paths Relation Classification via Multi-Level Attention CNNs Event-Related Features in Feedforward Neural Networks Contribute to Identifying Causal Relations in Discourse Causal BERT: Language Models for Causality Detection Between Events Expressed in Text Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv181004805 Cs Attention Is All You Need. ArXiv170603762 Cs Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports Twitter Privacy Policy A Coefficient of Agreement for Nominal Scales Practical Statistics for Medical Research The measurement of observer agreement for categorical data BERTweet: A pre-trained language model for English Tweets. ArXiv200510200 Cs Active Learning Literature Survey Active Discriminative Text Representation Learning. ArXiv160604212 Cs Support vector machine active learning with applications to text classification Text Chunking using Transformation-Based Learning. ArXivcmp-Lg9505040 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data A Fast Implementation of Conditional Random Fields (CRFs) Causality Patterns for Detecting Adverse Drug Reactions From Social Media: Text Mining Approach Automatic Extraction of Causal Relations from Text using Linguistically Informed Deep Neural Networks | SIGdial 2018 -video recordings and slides MIMICause : Defining, identifying and predicting types of causal relationships between biomedical concepts from clinical notes. ArXiv211007090 Cs NLP-driven Data Journalism: Time-Aware Mining and Visualization of International Alliances International Diabetes Federation. IDF Diabetes Atlas Percentage of U.S. adults who use Twitter as of