key: cord-0774766-eipbbons
authors: Dey, Vishal; Krasniak, Peter; Nguyen, Minh; Lee, Clara; Ning, Xia
title: A Pipeline to Understand Emerging Illness via Social Media Data Analysis: A Case Study on Breast Implant Illness
date: 2020-08-25
journal: JMIR Med Inform
DOI: 10.2196/29768
sha: fac717d3b6d7b3e6de76aa4f8532ff57ad4eda50
doc_id: 774766
cord_uid: eipbbons

Background: A new illness could first come to the public attention over social media before it is medically defined, formally documented or systematically studied. One example is a phenomenon known as breast implant illness (BII) that has been extensively discussed on social media, though vaguely defined in medical literature. Objectives: The objective of this study is to construct a data analysis pipeline to understand emerging illness using social media data, and to apply the pipeline to understand key attributes of BII. Methods: We conducted a pipeline of social media data analysis using Natural Language Processing (NLP) and topic modeling. We extracted mentions related to signs/symptoms, diseases/disorders and medical procedures using the Clinical Text Analysis and Knowledge Extraction System (cTAKES) from social media data. We mapped the mentions to standard medical concepts. We summarized mapped concepts to topics using Latent Dirichlet Allocation (LDA). Finally, we applied this pipeline to understand BII from several BII-dedicated social media sites. Results: Our pipeline identified topics related to toxicity, cancer and mental health issues that are highly associated with BII. Our pipeline also shows that cancers, autoimmune disorders and mental health problems are emerging concerns associated with breast implants based on social media discussions. The pipeline also identified mentions such as rupture, infection, pain and fatigue as common self-reported issues among the public, as well as toxicity from silicone implants. Conclusions: Our study could inspire future work studying the suggested symptoms and factors of BII. Our study provides the first analysis and derived knowledge of BII from social media using NLP techniques, and demonstrates the potential of using social media information to better understand similar emerging illnesses.

The ubiquity of social media has resulted in the early descriptions of new and evolving diseases over social media platforms before they can be systematically studied [1] [2] [3] [4] [5] [6] [7] , particularly during the era of medical internet. [8] [9] [10] [11] [12] [13] [14] Social media users increasingly turn to platforms like Twitter, Facebook, YouTube, etc., to either share personal experiences, including diseases and illness they experienced, or to seek support and resources, including health and medical resources. Recent studies showed the potential of social media in detection of mental illness and depression [15] [16] [17] , in early detection of food-borne illnesses [18] [19] [20] and other infectious diseases. [2, [21] [22] [23] [24] Furthermore, several studies demonstrated social media as an effective tool to disseminate information regarding symptoms, personal well-being and public health resources during multiple influenza outbreaks. [25] [26] [27] [28] During the early stages of COVID-19, studies [4, 29, 30] analyzed Sina Weibo (a major Chinese microblogging site) posts to characterize patient symptoms and public concerns in multiple provinces of China. Based on the analysis from Weibo posts, Huang et al. [30] concluded that most of the affected patients were elderly with fever as the most common symptom. These studies demonstrate that public social media data can be leveraged to better understand emerging illnesses and to accommodate prompt responses.

One particular new illness we study in this manuscript is Breast Implant Illness (BII). Breast implants have gained popularity over the last 20 years.

[31] More than 400,000 women have breast augmentation or post mastectomy surgeries every year in the US.

[32] There has been a 4% increase in the number of breast augmentation procedures during 2017-2018, and over the same period a 6% increase in breast implant removal procedures.

[32] Concerns about the safety of breast implants have also arisen [33] [34] [35] [36] [37] [38] and persisted. [39] [40] [41] [42] [43] [44] [45] However, a causal link between breast implants and systemic diseases has not been definitively shown, yet a phenomenon called "breast implant illness", which attributes systemic symptoms to breast implants, has emerged. [46] Unlike other new medical illnesses, however, BII has been reported minimally in the medical literature, but has primarily come to attention on social media. [11, [47] [48] [49] [50] For example, a recent analysis [49] demonstrated increasing public interest in BII based on Twitter and Google Trends data from February of 2018 to 2019. In an attempt to summarize key symptoms, diseases and disorders defining BII, several cohort studies [51, 52] analyzed patient reported outcomes before and after breast explant surgeries. These studies showed some potential relations between explant surgeries and improvement of specific symptoms in the patient population. Unfortunately, these studies were not definitive due to limited study design secondary to their lack of control groups, data collection bias, and lack of randomization. The lack of medical knowledge about BII makes it difficult to define the condition and therefore nearly impossible to conduct rigorous epidemiological or clinical studies of it. BII is just one disease process for which the lack of medical knowledge is apparent, but there are many other new illnesses for which this is the case; any initial knowledge that is supported by a sufficient amount of social media data would be meaningful as reference for future, formal studies, and thus, techniques to discover such knowledge are highly needed.

Toward identifying and summarizing key attributes of a new illness, in this study, we constructed a data analysis pipeline for social media data analysis about BII. The pipeline incorporates Natural Language Processing (NLP) and topic modeling methods. Our primary contribution is on deriving novel knowledge about BII -a medical condition which has not yet been systematically studied and defined in the medical literature via constructing a data analysis pipeline and applying the pipeline on social media data. Given the fact that the medical knowledge and literature on BII have not been established, and the related concepts are not well defined or accepted, using social media data to understand emerging issues could be a meaningful starting point. We applied this pipeline to better understand the symptoms and signs that are associated with BII. Our study provides the first analysis, to the best of our knowledge, using social media data, and derived knowledge of BII from social media. It demonstrates the potential of using social media information to better understand conditions that have primarily been reported on social media. It also demonstrates the effectiveness of our pipeline and its potential to be applied to understand other new illnesses. In the following discussion, we will discuss our pipeline within the context of BII. However, our pipeline is not specific to BII and is applicable to other illnesses.

We collected and used data from the following social media websites. These websites were selected because they are dedicated for BII discussions and information and have focused user groups with interest in BII. Often, dedicated social media websites (e.g., forums, twitter sites) are available for a particular illness or disease. For example, some dedicated websites [53] [54] [55] contain stories and experiences of patients fighting different cancers; there are dedicated websites [56,57] containing various posts and stories by users experiencing chronic pain and illness; others [58-60] contain stories and experiences from COVID-19 survivors. Below are the social media sources used in our study:

• https://www.breastimplantillness.com: this is a dedicated, public website with articles on BII-related topics, and offers resources related to implant and explant procedures, etc. This website also allows individuals to post their experiences and concerns on breast implants and related health issues. We extracted individuals' posts from the website (up to 05/10/2019), and the resulted dataset is referred to as BIIweb. • https://healingbreastimplantillness.com: this website contains information on postimplant disorders, post-explant healing, breast implant safety, etc. The discussion board of this website has multiple posts and comments on symptoms, signs, etc., that are experienced by individuals with a breast implant or those who have undergone an explant. The dataset extracted from the discussion board of this website (up to 05/10/2019) is referred to as HealingBII. • https://www.instagram.com/explore/tags/breastimplantillness: this website contains the collection of publicly available Instagram posts that used "breastimplantillness" as a hashtag in their posts. We extracted the associated texts of each Instagram post with the timestamp between 01/10/2012 and 09/04/2019. The dataset extracted from this site is referred to as IG-BII. BIIweb  187  669  3  129  24,191  HealingBII  1,920  1,330  1  85  165,090  IG-BII  28,987  515  1  123 3,581,081 In this table, a #posts indicates the number of posts/comments in the respective datasets. b lmin/ c lmax indicates the maximum/minimum length of a post in terms of words. d lavg indicates the average length of posts in terms of words. e #words indicates the total number of words in respective datasets.

All comments/posts from the three websites are included in the corresponding datasets. Table 1 presents a summary of the collected social media data. The BIIweb dataset has only 187 posts but larger posts (larger lavg) on average than the other two datasets. HealingBII is the second largest dataset with 1,920 posts, each with 85 words on average. IG-BII is the largest dataset with 28,987 posts and 123 words per post on average. Figure 1 presents the overview of the pipeline. We extracted major topics of interest primarily related to symptoms, diseases and medical procedures from our datasets through the following three steps. Each of the steps will be discussed in detail later.

Step 1. Data preprocessing: We removed all stop-words, numeric characters, hyperlinks, hashtags, etc., and converted all the remaining characters into lowercase.

Step 2. Mention extraction and concept mapping: We extracted mentions related to signs/symptoms, diseases/disorders and medical procedures using the Clinical Text Analysis and Knowledge Extraction System (cTAKES). [61] Extracted mentions are further mapped to standard medical concepts that are represented by concept unique identifiers (CUIs) in the Unified Medical Language System (UMLS) [62] ontology.

Step 3. Topic modeling: We summarized mapped concepts to topics using Latent Dirichlet Allocation (LDA). [63] LDA is a probabilistic generative model for topic modeling. It represents each document as a mixture over latent topics, where each topic is modeled as a distribution over words.

Step 3a. Mention replacement: We replaced each extracted mention in the posts with its mapped CUIs and discarded all the other words in the posts. We will discuss this step in more detail in section 'Topic modeling' later.

Step 3b. Topic modeling using LDA: Given the corpus of mapped CUIs, LDA generates documents-topics and topics-CUIs probability distributions. We will discuss this step in more detail in section 'Topic modeling' later.

Step 3c. Analysis and evaluation: We further analyzed these distributions to derive a list of topics using most representative mentions, and summarized extracted mentions for each dataset. We will discuss this step in more detail in section 'Results: LDA topics' later. 

We used the NLTK tokenizer [64] to tokenize the raw texts for each dataset. Out of the obtained tokens, we removed the stop-words (most frequently occurring, function words such as conjunctions, prepositions, determiner, etc.) using the NLTK English stop-words list. Since stopwords carry little or no information on our topics of interest in BII, they can be safely removed as typically done in NLP. We also removed all the numeric characters, emojis, non-ASCII characters, hyperlinks, hashtags and Instagram handles using regular expression matching, and converted all the remaining tokens into lower case to unify different cases for downstream processing.

Mention extraction refers to the extraction of words/phrases that convey a medical concept. We used the cTAKES tool for mention extraction. The cTAKES tool is an open-source NLP tool for clinical information extraction from unstructured clinical texts. cTAKES extracts mentions (i.e., words/phrases that convey a medical concept) from posts and maps those mentions to standard medical concepts. In doing so, it also categorizes each extracted mention into one of five cTAKES categories: sign/symptom, disease/disorder, medication, procedure and anatomy, that is, while cTAKES is extracting the mentions, it also automatically classifies the mentions into one of the five categories. For example, in the sentence "Over the years my tinnitus, has become worse to almost debilitating levels", cTAKES extracts "tinnitus" as a mention of sign/symptom category. Below, we discussed how to configure cTAKES in detail.

We used the fast-dictionary-lookup annotator in cTAKES to extract mentions from the processed data. This annotator identifies and extracts mentions in texts and normalizes them into CUIs in the UMLS standard medical ontology. This normalization of extracted mentions into CUIs is referred to as concept mapping. Each CUI in the UMLS ontology uniquely identifies a medical concept. Hence, we represented extracted mentions using standard medical concepts of those CUIs that cTAKES maps the mentions to. We configured the annotator to use exact string match and to use all-term-persistence property. Thus, the annotator retains all terms irrespective of the semantic properties of each term. For example, for the text "back pain", the annotator annotates the generic term "pain" as well as the precise term "back pain". We chose to use the all-term-persistence property to retain maximum information with respect to precise and generic medical concepts. Finally, the annotator stores the generated annotations in XMI files. In order to obtain the annotations in a human-readable format from the XMI files, we performed the following steps as shown in Figure 2 . We used a custom interpreter to process XMI files produced by cTAKES and to obtain mappings between mentions and CUIs out of cTAKES. We first looked for UmlsConcept XML identifiers in the XMI files, where each UmlsConcept XML identifier is generally grouped under the FSArray, and each FSArray is associated with a single ontology concept and the category of the concept. Each such concept is assigned one category out of five cTAKES categories: sign/symptom, disease/disorder, medication, procedure and anatomy. Each ontology concept is further associated to a UMLS CUI and a ontologyConceptArr identifier. It must be noted that a mention can be mapped to multiple CUIs. For example, the mention "allergic reaction" is categorized as sign/symptom but mapped to two different CUIs "C1527304" and "C0020517". Then, we extracted those ontology concepts that describe any of these categories: diseases/disorders, signs/symptoms and medical procedures. Finally, we used the begin and end markers associated with each ontologyConceptArr identifier to obtain the position of the annotated mention in the input post.

In this work, we are only interested in the first three categories (i.e., sign/symptom, disease/disorder and procedure) in order to understand BII-related issues. Hence, we only used the mentions that are categorized into either of these three categories.

In order to conduct topic modeling, we processed the posts as follows: we substituted each mention in the posts with its mapped CUIs and discarded all the other words in the posts, which were considered as non-medical concepts by cTAKES or not among the three categories of our interest. If a mention was mapped to multiple CUIs, we replaced that mention with the multiple CUIs. If multiple mentions were mapped to a same CUI, we replace all such mentions with the CUI. In this way, each post was represented as a bag-of-CUIs instead of a collection of mentions as the input to the topic modeling, and our vocabulary consisted of CUIs. Upon topic modeling, we interpreted the topic-CUI distribution to derive the topics.

We used Latent Dirichlet Allocation (LDA) [63] to learn the topic distributions of each post and the CUI distributions of each topic. LDA is a generative probabilistic model for modeling topics within a document corpus. LDA models each document in the corpus as a mixture of latent topics, where each topic is modeled as a distribution over words in all the documents. LDA derives the optimal distributions via maximizing the likelihood of observing the corpus following the perspective distributions. A brief description on LDA is provided in the Supplementary Materials. In our experiments, a bag-of-CUIs generated as above was used as a document in LDA, and the CUIs were words in the document. We used the lda-c software [65] , a very efficient implement of LDA method, to conduct the topic modeling.

When LDA is used for topic modeling for general documents (e.g., news, scientific literature), words and their frequencies in the documents are used in LDA. However, in our analysis, we aim to understand the medical concepts related to BII from social media texts. Different words may indicate a same medical concept. For example, joint aches, painful joints, arthralgia, aching joints all indicate joint pain and are associated with a single medical concept represented by a single CUI. Therefore, instead of using words, we used the medical concepts, which are presented by CUIs, in our LDA analysis. Since multiple words indicating a same medical concept can be mapped to a same CUI, using CUIs can also aggregate and strengthen the information from the multiple words, compared to using words, which may be sparse and thus not easy to learn topics from. Table 2 presents the summary statistics on the annotated mentions and their mapped CUIs by cTAKES. In BIIweb, cTAKES extracted 2,186 mentions and mapped them to 475 unique CUIs. In HealingBII, cTAKES extracted 11,080 mentions and mapped them to 1,177 unique CUIs. In the largest dataset IG-BII, cTAKES extracted 5,530 unique mentions and mapped to 2,871 unique CUIs. Note that a same mention can be mapped to multiple CUIs and can have multiple categories (each CUI has only one category). For example, the mention "flashes" is mapped to two different CUIs and then two different categories: diseases and medical procedures. Table 2 also presents the statistics of each category of extracted mentions. For each dataset, most of the extracted mentions are categorized as signs/symptoms by cTAKES.

In order to determine if cTAKES can sufficiently extract relevant mentions, we performed a manual annotation and compared the two lists of extracted mentions: one out of cTAKES and the other out of the manual annotation. We randomly sampled 50 posts from each of the three datasets, and manually annotated those posts. Upon manual annotation, we extracted mentions (words/phrases) that convey concerns and experiences of social media users involving BIIrelated symptoms, diseases and medical procedures. respectively. According to the high overlap in the results between manual annotation and cTAKES across multiple datasets used in our study, it is reasonable to assume that cTAKES is a decent surrogate of manual annotation for BII study through social media data.

In order to identify the best topic models, we used grid search to identify the best parameter values for the Dirichlet prior ∈ {0.01,0.05,0.1,0.5,1,1.5,2,5,10,15,20,25} and the number of topics ∈ {3,4,5,10,15,20}. In order to evaluate topic models, we analyzed each LDA topic modeling result for every combination of and values corresponding to low perplexity scores. [63, 66, 67] For each topic modeling result, we analyzed the document-topic and topic-CUI probability distributions to derive topics and their respective top-10 representative mentions. The top-10 representative mentions for a given topic are the most frequent mentions corresponding to the top-10 CUIs with the highest probabilities of belonging to the topic. Note that multiple mentions can be mapped to a given CUI (Table 2) . We only presented the most frequent mention because all the mentions mapped to a same CUI have similar semantics. We further evaluated the quality of a topic modeling based on how well the derived topics summarize the most representative mentions. We analyzed each LDA topic modeling result for every combination of and ; and chose the one where the derived topics were distinct and best summarized the most representative mentions. Finally, we identified distinct and meaningful topics using (i) = 4 and = 10 for BIIweb, (ii) = 5 and = 10 for HealingBII and (iii) = 5 and = 1.5 for IG-BII. We observed that with higher values, the most representative mentions were similar across topics. Hence, the derived topics were not distinct and were difficult to interpret. Example: "Even more heartbreaking and discouraging, has been the emotional pain of not being able to freely play with her on the floor due to hip and knee pain, along with leg and foot spasms… but I struggle with many feelings of failure as a wife and mother due to physical limitations." Tables 3, 4 and 5 present the top-10 representative mentions, the frequencies of CUIs corresponding to the mentions (in %), and the interpretations of the topics indicated by the mentions (e.g., common signs/symptoms). Note that the frequencies of CUIs are among all the posts, not only in those posts with the highest probability belonging to a certain topic. We presented these frequencies because each post has a certain probability of belonging to a certain topic, and thus frequencies among all the posts should better represent the topic information across all the posts. These tables also present the examples of posts that have high probabilities of belonging to the respective topic.

In the examples, the mentions that have high probabilities of belonging to the corresponding topics are underlined. Note that we used CUIs in LDA to derive topic and word distributions (as discussed in the section 'Methods -Topic modeling'), but we present the most frequent mentions (with clear semantics) that were mapped to respective CUIs (which are identifiers without semantics) in these tables. The mentions in these tables are sorted based on the probabilities of their corresponding CUIs belonging to the respective topics. Please note that these probabilities are not presented in the tables (they are not the frequencies presented in the tables). Therefore, each topic is represented by its most representative mentions and thus summarizes such mentions. For example, we interpret a topic as pain and other signs if there are significant number of mentions related to pain such as neck pain, chest pain, headache, etc. Please note that the topics are not sorted and the first columns in Tables 3-5 are only nominal identifiers. Below, we discussed the derived topics out of LDA for BIIweb and HealingBII datasets from the original posts. Note that two topics can still share a same representative mention with different probabilities in LDA. Table 3 presents the topics in dataset BIIweb. Although BIIweb is the smallest dataset (Table 1) , we were still able to identify four distinct topics with the most representative mentions such as fatigue, infection, toxicity and anxiety. Table 4 presents the topics in dataset HealingBII. HealingBII shares some common topics/representative mentions as those in BIIweb. For example, pains, cancers and toxicity are common across these two datasets. However, a focused topic unique in HealingBII is on surgeries and procedures, where people (mostly patients) discuss the procedures among themselves and share related experiences. Another unique topic in HealingBII is on mental health.

In addition to physical symptoms, individuals reported significant emotional and mental difficulties such as depression and expressed their serious symptoms in social media. Table 5 presents the topics in dataset IG-BII. IG-BII is the largest dataset (Table 1 ) and has significantly more posts than the other two. We observed that cancers, mental health and toxicity emerged as significant topics in this large dataset, quite consistently with those in HealingBII. In IG-BII, people also discussed their recovery process from issues/events associated with breast implant illness. We identified from these three datasets frequent mentions of rupture, pains and fatigue.

We also identified mentions of cancer, lupus and autoimmune disorders. Please note that Table  3 has four topics for BIIweb, but Table 4 and 5 have 5 topics for HealingBII and IG-BII, respectively. This is because the number of topics is determined by how distinct the topics are, not by a pre-specified number of topics. Table 6 presents the top-10 representative mentions, the frequencies of CUIs corresponding to the mentions (in %), and the interpretations of the topics on the unified dataset -combining all the three datasets BIIweb, HealingBII and IG-BII. We obtained the unified dataset by combining all the posts from the three datasets into one corpus. In order to perform topic modeling, we processed the posts in the unified dataset in the same way as we processed the posts in the individual datasets (discussed in the section 'Methods -Topic modeling'). Upon topic modeling, we identified five distinct topics using = 5 and = 1.5. We observed that physical health, cancers, mental health, toxicity and common disorders emerged as significant topics in the unified dataset, quite consistently with those in IG-BII. This is because IG-BII is the largest dataset out of the three and comprise more than 90% of the unified dataset. We also identified common concerns such as pain, allergy, depression, weight gain, cancer, inflammation and toxicity issues from the individual and the unified datasets. This implies that the above mentions are frequently associated with breast implant illness. Table 7 presents the percentage of posts per topic, where a post is considered as belonging to a topic if among all topics that has, has the highest probability. Although the distributions are not completely consistent across datasets, toxicity remains a notable topic among all the datasets. This indicates the common issues significantly associated with breast implant illness. Also, pains, cancers, mental health and other disorders are substantially associated with breast implants.

In order to understand signs, symptoms and diseases/disorders associated with breast implant illness, a condition reported primarily in social media rather than medical reports, we collected social media posts and analyzed them using NLP and topic modeling. We extracted mentions related to signs/symptoms, diseases/disorders and medical procedures using cTAKES, mapped them to standard medical concepts, and summarized the mapped concepts to topics using LDA. We found that mentions such as rupture, infection, inflammation, pains and fatigue were common self-reported issues. We also found that mental health related concerns such as stress, anxiety and depression as well as diseases like cancers and autoimmune disorders were common concerns. Note that cTAKES is able to extract medication and anatomy information as well, but they were not used in our LDA analysis given that the objective of our study is not to study medications used or anatomy related to BII.

In our method, we relied on cTAKES and the rich UMLS dictionary to extract all relevant mentions including their lexical variants (synonyms, abbreviations, paraphrases). In order to determine if cTAKES can sufficiently extract relevant mentions, we performed a manual annotation to extract all relevant mentions and compared those with extracted mentions out of cTAKES. We found that cTAKES can sufficiently capture relevant medical concepts, quite comparable to the manual annotation. It is worth noting that we did not evaluate the performance of our mention extraction module on all the posts of each dataset, which is typically done using precision and recall metrics when there are ground-truth labels associated with each mention. However, in order to have such labels, it requires careful manual annotations based on domain knowledge on breast implant illness. Unfortunately, such domain knowledge on complications, symptoms and other issues associated with/caused by breast implant illness is not fully available. Actually, our goal in this study is to provide useful information from social media data that could help complement what we currently know. Therefore, in this preliminary study, we use all the annotated mentions, assuming that cTAKES enables high-quality annotations.

We acknowledge that cTAKES might not be able to extract all relevant mentions from our social media datasets. This is because cTAKES was originally designed for extraction of medical entities from clinical notes, which have very different wording and writing styles compared to social media data. As social media data comprise informal phrases, short ambiguous texts, emoticons and a wide range of lexical variants corresponding to a single concept, cTAKES may not work flawlessly on social media data, although we observed reasonable output out of cTAKES. We also observed that cTAKES often associates a single mention with multiple CUIs belonging to the same category. We think it is due to the presence of multiple mappings for a given mention in the UMLS meta-thesaurus. Regardless, the extracted mentions and the mapping of mentions to UMLS CUIs as generated by cTAKES are used for topic modeling without any manual verification or evaluation. In the future research, we will develop a detailed guideline to further evaluate extracted mentions before using them in topic modeling.

Our study also has some limitations. First, LDA is an unsupervised learning technique, in which the number of topics ( ) is assumed to be known a-priori. However, it is difficult to accurately estimate for a given dataset. In our study, we used grid search to try different values. Even though, without full domain knowledge, it still remains non-trivial to evaluate the LDA results for each value. In our study, we selected the topics based on and values. We did not use perplexity [63, 66, 67] , a widely used metric in topic modeling to select topics, because as studied in literature (e.g., Chang et al. [68] ), perplexity often does not correlate well with topic interpretability, and in our case, the lowest perplexity does not always enable intuitive or meaningful topics. In the future research, we will develop more rigorous ways to select the number of topics and to evaluate topic modeling results. In our current study, we did not do a sentiment analysis on the posts in order to understand the positive or negative opinions expressed in the posts. We plan to include this process before topic modeling so as to generate cleaner dataset for topic modeling.

It is always worth noting that social media data could be of variable quality (e.g., misspelling, misconception, biased opinions), particularly compared to medical literature data. Anyone can post to social media, and so the derived content may be from individuals who may have other implant specific issues such as capsular contracture or implant infection. Thus, understanding diseases, disorders, symptoms, signs, etc., associated with a drug, disease or medical procedure from social media data always runs into risks of confounders or errors. However, given that the medical knowledge and literature on breast implant illness have not been well established, and the related concepts are not well defined or well accepted, using social media data to understand emerging issues could be a meaningful starting point. Still, any findings from social media data require rigorous evaluation and validation based on medical and biological knowledge, experiments and clinical practice, etc. In addition, we only analyzed three, though the most relevant and prolific, websites dedicated to BII discussions. Additional, more comprehensive analysis on social media data of a much larger scale would be beneficial to better understand BII from a larger, diverse population. Sentiment analysis over social media data could be another valuable analysis to enable more insights on health experience of users/patients and their emotions/feelings. We will consider sentiment analysis in our future research when BII is better understood and we can accurately annotate social media data.

This study has important implications for future methodological work and clinical research. Future methodological research on NLP could include causality inference between breast implant illness and symptom/sign mentions from social media to understand their relations, etc.

Our findings could provide the relevant domains for clinical research studies that are seeking to develop measures of BII, and to identify its causes. More specifically too, our results provide a patient derived definition of BII which can be useful to clinicians treating patients with BII concerns in order to use this patient-centered language. Our methods and informatics strategies applied in this study would also provide working examples for analyzing other emerging but not well-defined illnesses from social media data.

Our analysis over social media data identifies mentions such as rupture, infection, inflammation, pains and fatigue that are common self-reported issues on social media sites dedicated to BII. In addition, our analysis shows that a significant number of the user comments and posts are also concerned with mental and physical health, and toxicity issues after breast implants. The findings from our study could be used to further the scientific study of BII as well as the care of patients presenting with the described symptoms by allowing clinicians to develop a patientcentered language to better approach patients with concerns. Our study provides the first analysis and derived knowledge of BII from social media using NLP techniques, and demonstrates the potential of using social media information to better understand emerging illnesses.

The application of internet-based sources for public health surveillance (infoveillance): Systematic review

Infodemics" to Health Promotion: A Novel Framework for the Role of Social Media in Public Health

Modeling spatiotemporal pattern of depressive symptoms caused by COVID-19 using social media data mining

Using Reports of Symptoms and Diagnoses on Social Media to Predict COVID-19 Case Counts in Mainland China: Observational Infoveillance Study

Social media-and internet-based disease surveillance for public health

Retrospective analysis of the possibility of predicting the COVID-19 outbreak from Internet searches and social media data

Detecting influenza epidemics using search engine query data

Naturally occurring peer support through social media: The experiences of individuals with severe mental illness using you tube

Mining of textual health information from Reddit: Analysis of chronic diseases with extracted entities and their relations

Tweet classification toward twitter-based disease surveillance: New data, methods, and evaluations

Twitter Social Media is an Effective Tool for Breast Cancer Patient Education and Support: Patient-Reported Outcomes by Survey

Understanding patient anxieties in the social media era: Qualitative analysis and natural language processing of an online male infertility community

Requests for Diagnoses of Sexually Transmitted Diseases on a Social Media Platform

Perceptions of infertility information and support sources among female patients who access the Internet

Detecting depression and mental illness on social media: an integrative review. Current Opinion in Behavioral Sciences

Screening internet forum participants for depression symptoms by assembling and enhancing multiple NLP methods

A systematic review of natural language processing and text mining of symptoms from electronic patient-authored text data

Potential of social media as a tool to combat foodborne illness

Health department use of social media to identify foodborne illness

Early Detection of Foodborne Illnesses in Social Media

Forecasting Zika Incidence in the

Latin America Outbreak Combining Traditional Disease Surveillance with Search, Social Media, and News Report Data

Health information on social media helps mitigate Crohn's disease symptoms and improves patients' clinical course

Social Media Based Analysis of Opioid Epidemic Using Reddit

Dengue prediction by the web: Tweets are a useful tool for estimating and forecasting Dengue at country and city level

The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza A H1N1 pandemic

National and local influenza surveillance through twitter: An analysis of the 2012-2013 influenza epidemic

Google flu trends spatial variability validated against emergency department influenza-related visits

Infodemiology and infoveillance: framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet

Data Mining and Content Analysis of the Chinese Social Media Platform Weibo During the Early COVID-19 Outbreak: Retrospective Observational Infoveillance Study

Mining the characteristics of COVID-19 patients in china: Analysis of social media posts

Available from

Long-term health outcomes in women with silicone gel breast implants

Silicone breast implants and the risk of autoimmune/rheumatic disorders: A real-world analysis

Cutaneous hypersensitivity-like reactions associated with breast implants: A review

Risk factor analysis for capsular contracture: A 10-year sientra study using round, smooth, and textured implants for breast augmentation

Silicone Implant Illness: Science versus Myth? Plastic and Reconstructive Surgery

US FDA Breast Implant Postapproval Studies: Long-term Outcomes in 99,993 Patients

Risk of connectivetissue diseases and other disorders after breast implantation

An outcome analysis of 100 women after explantation of silicone gel breast implants

Meta-analyses of the relation between silicone breast implants and the risk of connective-tissue diseases

A prospective analysis of patients undergoing silicone breast implant explantation

Infectious complications following breast reconstruction with expanders and implants. Plastic and Reconstructive Surgery

Analysis of local complications following explantation of silicone breast implants

Prospective cohort study of breast implants and the risk of connective-tissue diseases

Breast Implant Illness: Symptoms, Patient Concerns, and the Power of Social Media. Plastic and Reconstructive Surgery

Facebook Facts: Breast Reconstruction Patient-Reported Outcomes Using Social Media. Plastic and Reconstructive Surgery

Breast Implant Illness: A Way Forward. Plastic and reconstructive surgery

Breast Implant Illness: Are Social Media and the Internet Worrying Patients Sick? Plastic and reconstructive surgery

En Bloc Capsulectomy for Breast Implant Illness: A Social Media Phenomenon?

Understanding Breast Implant Illness, Before and After Explantation: A Patient-Reported Outcomes Study. Annals of plastic surgery

Plastic and Reconstructive Surgery -Global Open

Pain Management -Symptoms, Causes, Treatments, Medications for Chronic Pain

COVID-19 Patient Stories | Johns Hopkins Medicine

Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): Architecture, component evaluation and applications

The Unified Medical Language System (UMLS): Integrating biomedical terminology

Latent Dirichlet allocation

Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics

C implementation of variational EM for latent Dirichlet allocation (LDA)

Probabilistic latent semantic indexing

Correlated topic models

Reading tea leaves: How humans interpret topic models

Latent Dirichlet allocation

Xia Ning conceived the research, obtained funding for the research, and supervised Vishal Dey; Peter Krasniak, Minh Nguyen and Clara Lee provided substantial medical background and insights; Vishal Dey and Xia Ning conducted the research, including data curation, methodology design and implementation, analysis; Vishal Dey drafted the original manuscript; Vishal Dey and Xia Ning conducted the manuscript editing; Peter Krasniak, Minh Nguyen and Clara Lee reviewed the manuscript and provided constructive suggestions and feedbacks.

The authors claim no conflict of interests. Here, we briefly describe Latent Dirichlet Allocation (LDA). [1] LDA is a generative probabilistic model that discovers latent topics in a document corpus. LDA assumes that a document of words = { 1 , 2 , ⋯ , } is generated as follows: 1) a perdocument distribution over topics ∈ ℝ is first generated from a Dirichlet distribution Dirichlet( ), where ∈ ℝ is the Dirichlet prior ≥ 0 ( = 1, ⋯ , ) and is the given number of topics; 2) for each word in the document, a topic is generated from a multinomial distribution Mult( ); 3) a word distribution ∈ ℝ over topic is generated from a Dirichlet distribution Dirichlet( ), where ∈ ℝ is the Dirichlet prior, ≥ 0 ( = 1, ⋯ , ) and is the number of words in the vocabulary; 4) given , word is generated from a multinomial distribution Multi( ). LDA assumes all the words in a document are independent given their , and all the documents in the corpus are independent. Estimation on and via maximum likelihood methods will enable document topics and the most probable words over the topics.