key: cord-0173333-slb3bdkd authors: Wuhrl, Amelie; Klinger, Roman title: Recovering Patient Journeys: A Corpus of Biomedical Entities and Relations on Twitter (BEAR) date: 2022-04-21 journal: nan DOI: nan sha: d23fd0226c4355fc954aacfc7036d88b5d8a384e doc_id: 173333 cord_uid: slb3bdkd Text mining and information extraction for the medical domain has focused on scientific text generated by researchers. However, their direct access to individual patient experiences or patient-doctor interactions can be limited. Information provided on social media, e.g., by patients and their relatives, complements the knowledge in scientific text. It reflects the patient's journey and their subjective perspective on the process of developing symptoms, being diagnosed and offered a treatment, being cured or learning to live with a medical condition. The value of this type of data is therefore twofold: Firstly, it offers direct access to people's perspectives. Secondly, it might cover information that is not available elsewhere, including self-treatment or self-diagnoses. Named entity recognition and relation extraction are methods to structure information that is available in unstructured text. However, existing medical social media corpora focused on a comparably small set of entities and relations and particular domains, rather than putting the patient into the center of analyses. With this paper we contribute a corpus with a rich set of annotation layers following the motivation to uncover and model patients' journeys and experiences in more detail. We label 14 entity classes (incl. environmental factors, diagnostics, biochemical processes, patients' quality-of-life descriptions, pathogens, medical conditions, and treatments) and 20 relation classes (e.g., prevents, influences, interactions, causes) most of which have not been considered before for social media data. The publicly available dataset consists of 2,100 tweets with approx. 6,000 entity and 3,000 relation annotations. In a corpus analysis we find that over 80 % of documents contain relevant entities. Over 50 % of tweets express relations which we consider essential for uncovering patients' narratives about their journeys. On social media, doctors, patients, concerned relatives or other laypeople frequently discuss medical information. Twitter posts for example contain opinions and recommendations about treatments, recounts of medical experiences, or hypotheses and assumptions about medical issues like in Figure 1 . This information is by design centered around the patient. It is impacted by the patient's journey and their subjective perspective on processes like developing symptoms, being diagnosed and offered a treatment, being cured or learning to live with a disease. This data offers direct access to people's perspectives and covers information that is not available elsewhere, e.g., aspects that might not be considered important or difficult to assess in clinical settings. This includes, e.g., assessments of a patient's quality of life ( entific and biomedical text generated by researchers. Such texts seldomly focus on individual patient's experiences or patient-doctor interactions which makes the information and knowledge contained in the text distant by nature. While scientific resources contain high quality information, many studies struggle with gender biases and population imbalance (Weber et al., 2021) , which leads to blind spots in the literature. The timeconsuming nature of clinical studies causes delays until information is available to practitioners. Both limitations can be mitigated by accessing social media data. Duh et al. (2016) find in fact that social media can lead to earlier detection of adverse drug reactions. While social media data has come more into focus recently, existing corpora are limited with respect to the types of entities and relations they cover. Most commonly, biomedical entity corpora focus on diseases, symptoms and drugs (Jimeno-Yepes et al., 2015; Alvaro et al., 2017, i.a.) . With regards to relation detection, work on Twitter is limited to causal relations (Doan et al., 2019) , or a very small number of relation classes (i.e. reason-to-use, outcome-negative, outcomepositive) (Alvaro et al., 2017) . This leaves a gap for medical information needs. As described above, content from social media holds this type of information. Extracting it is required if we want to uncover more fine-grained aspects of patients' medical journeys complementary to the knowledge in scientific text. To facilitate research in this area, we contribute a cor-pus of medical tweets annotated with a fine-grained set of medical entities and relations between them. For the BEAR Corpus of Biomedical Entities And Relations on Twitter, we annotate 14 entity and 20 relation classes. Entities include environmental factors, diagnostics, biochemical processes, quality-of-life assessments, pathogens, as well as more established entity classes such as medical conditions, and treatments. Relation classes model how entities prevent, influence, interact with, cause or worsen other entities, or how they relate to each other as a symptom, side-effect, or diagnosis. The dataset consists of 2,100 tweets with roughly 6,000 entities and 3,000 relations. To the best of our knowledge the majority of those classes which are centered around patient journeys have not been considered before. The dataset is available at https://www. ims.uni-stuttgart.de/data/bioclaim. Biomedical natural language processing (BioNLP) is an established field in computational linguistics, with a rich set of shared tasks including BioCreative and the competitions organized by the BioNLP workshop series (bio, 2021; Ben Abacha et al., 2021) . Research topics include automatic information extraction from clinical reports, discharge summaries or life science articles, e.g., in the form of entity recognition for diseases, proteins, drug and gene names (Habibi et al., 2017; Giorgi and Bader, 2018; Lee et al., 2019, i.a.) . A subsequent task to entity recognition is relation extraction which covers clinical relations (Uzuner et al., 2011; Wang and Fan, 2014; Sahu et al., 2016; Lin et al., 2019; Akkasi and Moens, 2021) or biomedical relations/interactions (e.g., drug-drug-interactions) between entities (Lamurias et al., 2019; Sousa et al., 2021, i.a.) . While scientific resources contain high quality information, studies might not be fully representative regarding population groups or gender (Weber et al., 2021) , which leads to blind spots in the literature -the general population can barely be captured in such studies. In addition, clinical studies or reports are timeconsuming which inevitably leads to delays, e.g., with regards to indications of adverse drug events. Both limitations can be mitigated by accessing social media data. Duh et al. (2016) find in fact that social media can lead to earlier detection of adverse drug reactions. This is why biomedical NLP also works with social media texts and online content (Wegrzyn-Wolska et al., 2011; Yang et al., 2016; Sullivan et al., 2016, i.a.) , including established shared tasks (Magge et al., 2021a) . A major focus has been to inform pharmacovigilance by identifying and extracting mentions of adverse drug reactions (Nikfarjam et al., 2015; Cocos et al., 2017; Magge et al., 2021b) . Additionally, the community has explored leveraging social media postings to monitor public health (Paul and Dredze, 2012; Choudhury et al., 2013; Sarker et al., 2016; Stefanidis et al., 2017) , and detect personal health mentions (Yin et al., 2015; Klein et al., 2017; Karisani and Agichtein, 2018) . A few studies compare biomedical information in scientific documents with social media: Thorne and Klinger (2017) explore how disease names are referred to across both domains, while Seiffe et al. (2020) look into laypersons' medical vocabulary. A related task is entity normalization which links a given mention of an entity to the respective concept in a formalized medical ontology. Limsopatham and Collier (2016) and later Basaldella et al. (2020) explore this task for medical entities on social media showcasing the difficulties in mapping laypeople's health terminology to structured medical knowledge bases. The ongoing COVID-19 pandemic has sparked bioNLP research to leverage or contextualize information about the disease and virus from social media. A number of studies explore detecting COVID-19related misinformation and fact-checking (Hossain et al., 2020; Chen and Hasan, 2021; Mattern et al., 2021; Saakyan et al., 2021, i.a.) . Others have looked into monitoring information surrounding the virus using social media (Cornelius et al., 2020; Hu et al., 2020) . Early contributions on biomedical information extraction from Twitter aimed at the extraction of adverse drug reactions from social media -a fundamentally different use case than scientific text analytics. The goal is to provide access to information even before it becomes available to doctors or researchers. This work includes corpus creation efforts on dedicated platforms like AskAPatient 1 (Karimi et al., 2015) and Twitter (Nikfarjam et al., 2015; Magge et al., 2021b) With a similar motivation, Jimeno-Yepes et al. (2015) created Micromed, a Twitter corpus annotated with disease names, drug names, and symptom mentions. Further, TwiMed (Alvaro et al., 2017) is a dataset which combines social media and scientific text with annotations of diseases, symptoms and drug names to study drug reports across both sources. Annotated with the same entity classes, the MedRed dataset consists of Reddit posts (Scepanovic et al., 2020) labeled via crowdsourcing. In addition to identifying entities, there has also been some work on linking them to existing databases. To facilitate this task for social media, Limsopatham and Collier (2016) contribute a Twitter corpus in which entities are linked to the SIDER 4 (Kuhn et al., 2016) database of drug profiles. Basaldella et al. (2020) subsequently introduce COMETA, a Reddit corpus in which entities are linked to SNOWMED-CT 2 . With regards to the groups of entities considered (phenotype, disease, anatomy, molecule (incl. drugs, toxins, nutrients etc.), gene/DNA/RNA, device, procedure) this is similar to our contribution. Existing resources do not cover enough entities to extract patient narratives from social media. They do not allow us yet to access the fine-grained information that social media content holds, and that would allow us to fill the information gap in scientific text. Relation extraction contextualizes entities with each other. Medical relation extraction resources for social media are rare. Existing studies have focused on causal relations (Doan et al., 2019) , or a small number of relation classes (i.e., reason-to-use, outcome-negative, outcome-positive) (Alvaro et al., 2017) . With regards to scientific text, and specifically clinical relation extraction, closest to our annotation scheme are approaches by Uzuner et al. (2011) and Wang and Fan (2014) . Classes for both their work describe relations between treatments and medical conditions, relations between two treatments, medical conditions, or diagnoses (e.g., treatment caused medical problem, treatment improved or cure medical problem, test reveal medical problem in Uzuner et al. (2011) , or treats, prevents, has symptom, contraindicates in Wang and Fan (2014) ). However, both work with clinical and scientific texts. Medical relation extraction on social media is understudied and missing resources that facilitate extracting patients' experiences and opinions towards entities of their medical history which would allow us to recover their medical narratives. We collect English tweets between January 01 and November 02, 2021 using the official keyword-based Twitter API. The list of keywords to retrieve the data stems from three different sources. Refer to Table 6 for examples for each source. 1. DrugBank: DrugBank is a database for drugs which provides molecular information about drugs, their mechanisms, interactions and targets (Wishart et al., 2018) . We use generic and brand/product names which allows us to collect tweets discussing treatments, or descriptions of off-label drug use. 2. MeSH: Medical Subject Headings is a controlled vocabulary thesaurus used for indexing articles in PubMed 3 . We use terms from the subcategories disease and therapeutics to collect tweets that address specific diseases and therapeutic measures. We use all terms that appear with a frequency >= 1000 in PubMed articles hypothesizing that the distribution of those terms mirrors the usage on Twitter. 3. Manual: MeSH and DrugBank mostly contain scientific terms (see Table 6 ), so we also query with a manually compiled list of medical terms. Partly, those relate to 10 medical conditions 4 . This is to collect tweets that either use Twitter specific hashtags, abbreviations, or community-based terms related to a condition, or mention terms generally related to the medical domain. All terms combined result in a list of 22,874 keywords. From this list, 10,599 terms return results from Twitter during a test crawl. We remove unproductive terms and use a final list of 7,358 keywords from Drug-Bank, 3,120 from MeSH, and 121 from the manually compiled list. 5 We acknowledge that by using this approach, we can not sample tweets with incorrectly spelled mentions of drug or disease names. We only keep non-duplicate tweets (based on the tweet ID) which do not contain a URL due to their increased probability of containing advertisements. Further, we only keep tweets which contain a relational term. Examples include words like treats, prescribed, or diagnosed (and variations thereof). From the resulting collection of tweets, we draw a sample balanced across the three keyword sources. We subsequently annotate 700 tweets per data source (350 per MeSH subcategory) which amounts in a total of 2,100 tweets. We label entity and relation classes that allow us to include individual aspects within people's diseasetreatment cycles. Classes cover information concerning developing symptoms, being diagnosed and offered a treatment, being cured or learning to live with a medical condition. They allow us to model statements about how to self-diagnose, treat a particular condition by themselves, or capture how people perceive risk factors. For both annotation tasks, we therefore follow the central paradigm which tells annotators to label entities and relations the way a tweet's author intends or understands them. A mention like UV radiation could either be intended as an environmental factor (High UV radiation causes skin cancer.), or a treatment (UV radiation will help with my low vitamin D levels). We label seven groups of entities. Each group contains a respective label or subset of labels which the annotators use to label the text. We visualize the entities in Figure 2 and depict which entity-pairs can be related. Each entity group will be briefly described in the following section. Table 1 additionally provides fully annotated examples from the dataset, to which we will refer to in the following descriptions. Medical Conditions. All mentions of diseases, symptoms, side effects, and medical events or descrip-4 COVID-19, Alzheimer's disease, borderline personality disorder, cancer, depression, irritable bowel syndrome, measles, multiple sclerosis, post-traumatic stress disorder, stroke. 5 Lists we used to collect and filter the data are available in the suppl. material together with the corpus. Each relation is directed and connects two entities (see Figure 2 for a depiction of which entities can be related and Table 1 for examples). We annotate the following entity pairs with relations. (± indicates that a relation has a positive and negative variant, e.g., (does not) treat.) treat → medC ±treats, worsens, ±prevents, ±causes, contraindicates, prescribed, ±influences medC → treat side effect of env/pathogen/biochem → medC ±causes, ±influences, ±prevents medC → medC/biochem has symptom, ±causes, is similar to treat → treat ±interaction, is similar to diag → medC/pathogen ±diagnoses pathogen → biochem ±causes medC/treat/env/diag → qol ±causes, ±influences general type of, other We measure the agreement between annotations by calculating the inter-annotator F 1 . Specifically, we treat one annotator's labels as the gold annotations and consider the other annotator's labels as predictions (Hripcsak and Rothschild, 2005) . We report the agreement for varying levels of strictness. We consider entity span (S) and type (T) as follows: S1T1 The two spans and types of the entities are entirely identical. S0T1 The two spans overlap by min. one token, entity type is identical. S0T0 The two spans overlap by min. one token, entity type is ignored in the comparison. When evaluating the annotated relation (R) between two entities, we consider two modes: R1 Relation type and direction are identical. R0 Relation type and direction are ignored in the comparison. On the entity level, comparing S1T1 to S0T1 shows to which extend the span of an entity influences the annotation task. Comparing S0T1 to S0T0 indicates the impact of assigning a label on the difficulty of the task. Analyzing the relation annotation follows the same objectives with respect to the entities, but adds the impact of the relation assignment. R1S1T1 is the strictest evaluation mode. The comparison to both R1S0T1 and R1S0T0 helps in understanding how the entity annotation influences the relation annotation task. R0S0T0 captures the most general level of agreement indicating how well the annotators can identify the fact that any two entities are somehow related. Comparing this to R1S0T0, we can conclude how difficult it is to identify relation types. We work with two in-house annotators (A1, A2) to label the tweets with entities and relations. Both anno- tators are female, ages 20 to 25, and 25 to 30, respectively. Their backgrounds are in linguistics and computational linguistics. They have no medical training. We iteratively train the annotators over the course of three months. In each training iteration, all annotators label a small set of instances independently following our annotation guidelines. Subsequently we discuss each set within the group. In addition, we calculate the interannotator F 1 for each round of training annotations (refer to Section 3.2.3 for an explanation of the eval. metrics used), and adapt the guidelines with findings from the discussions and analysis to clarify the annotation tasks further. The training instances are not part of the final corpus. The final version of the guideline document is available in the supplementary material. Table 2 shows the development of the inter-annotator F 1 over the training iterations. For each round we report the macro F 1 score across all entity/relation classes in the different evaluation settings. We find that the agreement increases for the entities and the relation annotation over time. The agreement increases as we allow for less precise matches to be counted as true positive instances. By the end of the training period, annotators agreed with .53F 1 on exact entity types and boundaries (S1T1). Comparing the impact of each subtask in the last round, we observe that agreeing on the entity type is more challenging than identifying the entity span (decrease of .25F 1 between S0T0 and S0T1 vs. .02F 1 decrease between S0T1 and S1T1). Evaluating the relation type strictly (R1S0T0 vs. R0S0T0), the agreement drops by .07F 1 which indicates that the relation type is fairly ambiguous, and therefore hard to agree upon. The strictest evaluation measures (S1T1, R1S1T1) show that the task remains challenging even after substantial annotator training which we attribute to the diverse nature of text in tweets. Presumably, this is also why the agreement fluctuates over training rounds. We provide an adjudicated version of the dataset which combines both annotators' results. In case of disagreements of entity spans between the annotators, we choose the longest overlapping sequence between two instances. We further prefer more frequent entity and relation classes over less frequent ones, and choose more general concepts over more specific ones. Generally, our aggregation strategy is motivated by a high recall approach to ensure that we lose as little of the nuances from the individual annotations as possible. We aggregate in two steps and first align the entity annotations, followed by aggregating the relations. Please refer to Section 6 for more details. The annotators labeled the final corpus over the course of four months. Since both sets of annotations provide unique perspectives on the data, we release the individual annotations along with an aggregated version. We evaluate the annotations using the inter-annotator F 1 -scores as described in Section 3.2.3 and provide scores for the full dataset as well as individual scores for each sampling method in the following. Figure 3 shows the inter-annotator F 1 -scores for each subsample of the corpus evaluated with descending strictness. For the final corpus we find that annotators are fairly synchronized in identifying entities in tweets (.67F 1 S0T0). Agreeing on the entity type is more challenging than identifying the same entity span (.07F 1 decrease between S0T1 and S1T1 vs. .23F 1 decrease between S0T0 and S0T1). This is also the case for the relation agreement. Labeling the relation type is by far the most difficult task. When we compare the agreement levels in R0S0T0 with R1S0T0, we report a difference of .13F 1 which showcases how ambiguous the relations are. We observe a slight decrease of the agreement compared to the last training round. We attribute this to the fact that annotators are continued to be faced with novel variations of entities and relations because of Twitter's diverse nature. Agreement across sources. Across all evaluation modes, tweets from the subsample Manual show the strongest agreement, followed by subsamples MeSH, and DrugBank. The results indicate that tweets from the Manual category are easier to annotate than the other documents, presumably because they mostly use laypeople's vocabulary. Due to the nature of the Drug-Bank database, tweets from this set might be more scientific, making them more difficult to annotate. Agreement across entities. Table 4 reports the interannotator F 1 -score (iaa) for each entity class (eval. mode: S1T1). A1 and A2 agree most strongly on instances of medC and treat drug (.73 and .74 F 1 , respectively). We observe the lowest agreement for mentions of biochem process (.05 F 1 ). We observe that the agreement for highly frequent classes is stronger than the agreement in less frequent ones. Presumably, this is because these classes are also the most concrete, and therefore easier to detect. Less frequent classes (e.g., env or qol) could be considered more abstract or vague. At the same time, we presume that seeing a certain type of entity more often acts like a training effect for the annotators. Agreement across relations. Table 5 reports the inter-annotator F 1 -score for each relation class (evaluation mode: R1S0T0 6 ). Across all classes, we report a macro F 1 -score of .35. has symptom and does not prevent are the classes with highest agreement (.59 F 1 respectively), followed by treats (.58 F 1 ), may diagnose and prevents (.56 F 1 , respectively). We observe no agreement for is contraindicated, may not diagnose, and pos/neg interaction. The final corpus contains 2,100 tweets with labels for medical entities and the relations connecting them. Table 3 lists the number of documents with and without entities and relations. The majority of documents in the dataset contain entities. 86.2 % of all documents in the dataset are labeled with at least one entity. Slightly more than half of all documents containing entities also express a relevant relation (56.5 %). The corpus consists of 93,258 words (17,559 words are unique). The longest tweet consists of 114 words, the two shortest tweets are made up of 4 words each (see Table 7 ). A tweet from our corpus has an average length of 44.41 words. There is no substantial difference between tweets from different sampling sources. Table 3 : Number of documents with and without entities (ent) and relations (rel) for both annotators (A1, A2) and the aggregated dataset (agg). Values in parenthesis report the respective percentages. For relations this is w.r.t. all instances which contain entities. The following sections describe our dataset in more detail. We present corpus statistics regarding the entity and relation class distribution. Note that we describe the aggregated version of the dataset. Table 4 shows the number of instances per entity class. We include the statistics for both annotators (A1, A2) and for the adjudicated dataset. Additionally, we report the statistics for the whole corpus (full), and divided by the method the documents were sampled with (Drug-Bank, MeSH terms, Manual) . The dataset contains 6,324 entities. The biggest entity class is medical conditions (3,553 instances), followed by mentions of treat drug (1,240) . The remaining entity classes are substantially less frequent. env pollution has the smallest number of instances (5). Annotators label approx. 3.01 entities per document. Entities across sources. Mentions of medical conditions are more frequent in tweets from the subsamples MeSH and Manual (1,458 and 1,367, respectively) than they are the DrugBank sample (728). Tweets from set DrugBank exhibit the majority of mentions of treat drug as well as biochem substance entities (1,035 and 163, respectively). Notably, mentions of the second treatment-related entity class, treat therapy, are more frequent in tweets from the MeSH and Manual sample. These results confirm that tweets in the DrugBank sample more frequently discuss treatments, and therefore exhibit a high number of drug and biochemical entities. treat therapy captures more general treatment descriptions than specific mentions of drugs. Regarding the subsample Manual, we presume that the high frequency of therapy mentions indicates that laypeople speak in more general terms about treatments. Table 5 reports the number of annotated relations for each class. We calculate the statistics for both annotators (A1, A2) and for the adjudicated data. We report the numbers of relations for the full corpus as well as for each of the three subsamples (DrugBank, MeSH, Manual). In total, the corpus contains 2,959 relations. The cause of relation is the most frequent (983), followed by treats (500), is type of (336), and pos influence (263). worsens is the class with the lowest frequency (1 instance). For relations which can be either positive or negative, the negative relations are always less frequent. On average, a document in our dataset contains 1.41 relations. Relations across sources. While documents from the subsample DrugBank and MeSH show relatively equal numbers of total relations (averages of 1,043 and 1,081, respectively), the Manual subsample has the least amount of relations (av. of 835). cause of relations are most frequent in the subsamples MeSH (407) and Manual (331). In the DrugBank set, treats is the most prevalent relation class (277). Notably, for set Manual, we find that cause of is by far more frequent than any other relation. All other classes count (mostly substantially) less than 100 instances each. We introduce and describe BEAR, a corpus of 2,100 medical tweets annotated with a detailed set of biomedical entities, and the relations connecting them. Both the entity and relation classes are motivated by the need to capture fine-grained aspects of patients' medical journeys. In our annotation study, we show that tweets hold this type of information, and that nonexpert annotators can detect this reasonably well. With this dataset, we lay the groundwork to develop entity and relation extraction systems that give medical professionals access to patient narratives which are not covered in scientific texts. This includes quality-oflife assessments, perception of risk factors, unconventional treatments, or self-diagnoses that people might feel uncomfortable or irrelevant to share with their doctors. Such systems could help answer detailed questions like "How does chemotherapy affect the social life of breast cancer patients?" or "Which habits serve as coping mechanisms for people suffering from depression?". Table 5 : Number of annotated relations and inter-annotator F 1 (iaa) per class. We report the statistics across the whole corpus (full) as well as divided by the method the documents were sampled with (DB = DrugBank, MeSH = Medical subject headings, Manual = manually researched medical keywords). Within each sampling method we report the statistics for annotator 1 and 2 and for the aggregated dataset. Reported agreement scores (iaa, eval. mode R1S0T0) for all instances across the full corpus. We provide an aggregated version of the dataset which adjudicates both annotators' results. In general, our strategy is motivated by a high recall approach to ensure we do not lose any annotated perspectives on the data. When combining the annotations, we choose the longest overlapping sequence between two instances. We prefer more frequent entity and relation classes over less frequent ones, and choose more general concepts over more specific ones. We aggregate in two steps by first aligning the entity annotations, followed by aggregating the relations. Entities With regards to the entity span, we use the longest overlapping span between A1's and A2's annotation. In cases in which they disagree on the entity type, we chose the more frequent class. Exceptions are the entity classes treatment and biochem. For those classes, one subgroup is more general than the other. If both annotators agree on the major class (treat), but disagree on the subtype (drug vs. therapy) we aggregate to the more general one which are treat therapy or biochem substance. For cases in which one annotator labeled an entity as other while the second annotator chose a different entity class, we aggregate to the more frequent entity class. However, if the annotator used other to model a relation, we keep the entity as other to keep the relation intact and valid. 7 If one annotator labeled an entity, but the other one did not, we generally follow a high recall approach and add this entity to the aggregated document. However, we additionally check if the annotator who marked the entity used it to model a relation. If the relation is valid (i.e. the involved entities are allowed to be connected), we use the entity, otherwise it is dropped. Relations To adjudicate the relation annotation, we identify cases in which both annotators agreed on the fact that there is any type of relation between a given entity pair. First, we check if the relation tags are valid (i.e. the involved entities are allowed to be connected). If one of them is invalid, we choose the valid one for the aggregated version. If both are invalid, the relation is dropped. If they are both valid, we choose the more frequent relation class. One exception to this rule concerns cases in which one annotator identified an other relation while the second annotator chose a different relation class. Here, the tag other indicates a vague relation which is not in line with our aim to adjudicate to the more specific class. Therefore, we can not resolve this by simply assigning the more frequent label, because some of the small relation classes are less frequent than the class other. A1 and A2 consequently re-visit those cases (11 instances) and decide jointly which relation type should be added to the aggregated version. For annotations in which A1 and A2 only agreed on one of the involved entities, we follow a high recall approach and keep both relations for the adjudicated version of the data as long as the relations are valid. Finally, we consider cases in which one annotator did not label any relation while the other identified one. For those, we hypothesize that they are ambiguous and that the missing relation reflects that (i.e. that the relation marked by one of the annotators might be covering a political claim about a medical topic). In an effort not to lose these borderline cases, we add them to the aggregation as long as the relation is valid. Causal relationship extraction from biomedical text using deep neural models: A comprehensive survey TwiMed: Twitter and PubMed comparable corpus of drugs, diseases, symptoms, and their relations COMETA: A corpus for medical entity linking in the social media Overview of the MEDIQA 2021 shared task on summarization in the medical domain Navigating the kaleidoscope of COVID-19 misinformation using deep learning Social media as a measurement tool of depression in populations Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in twitter posts COVID-19 Twitter monitor: Aggregating and visualizing COVID-19 related trends in social media Extracting healthrelated causality from twitter messages using natural language processing Can social media data lead to earlier detection of drug-related adverse events? Pharmacoepidemiology and drug safety Transfer learning for biomedical named entity recognition with neural networks Deep learning with word embeddings improves biomedical named entity recognition COVIDLies: Detecting COVID-19 misinformation on social media Agreement, the f-measure, and reliability in information retrieval Weibo-COV: A large-scale COVID-19 social media dataset from Weibo Identifying diseases, drugs, and symptoms in twitter Cadec: A corpus of adverse drug event annotations Did you really just have a heart attack? Towards robust detection of personal health mentions in social media Detecting personal medication intake in Twitter: An annotated corpus and baseline classification system The SIDER database of drugs and side effects Bo-lstm: classifying relations via long short-term memory networks along biomedical ontologies Biobert: a pre-trained biomedical language representation model for biomedical text mining A BERT-based universal model for both within-and cross-sentence clinical temporal relation extraction Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task DeepADEMiner: a deep learning pharmacovigilance pipeline for extraction and normalization of adverse drug event mentions on Twitter FANG-COVID: A new large-scale benchmark dataset for fake news detection in German Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features A model for mining public health topics from Twitter COVID-fact: Fact extraction and verification of realworld claims on COVID-19 pandemic Relation extraction from clinical texts using domain invariant convolutional neural network Social media mining for toxicovigilance: automatic monitoring of prescription medication abuse from Twitter Extracting medical entities from social media From witch's shot to music making bones -resources for medical laymen to technical language and vice versa Using Neural Networks for Relation Extraction from Biomedical Literature Finding potentially unsafe nutritional supplements from user reviews with topic modeling Towards confidence estimation for typed protein-protein relation extraction 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text Medical relation extraction with manifold models Gender-related data missingness, imbalance and bias in global health surveys Social media analysis for ehealth and medical purposes Mining health social media with sentiment analysis A scalable framework to detect personal health mentions on Twitter This research has been conducted as part of the FIBISS project which is funded by the German Research Council (DFG, project number: KL 2869/5-1). We thank our annotators for their hard work and tireless attention to detail.