key: cord-0535701-7o0ezkgo authors: Bobicev, Marina Sokolova Victoria title: Machine Learning Evaluation of the Echo-Chamber Effect in Medical Forums date: 2020-10-19 journal: nan DOI: nan sha: 2327e284970fdb4edcd5d01aea3bb52bc78c3458 doc_id: 535701 cord_uid: 7o0ezkgo We propose the Echo-Chamber Effect assessment of an online forum. Sentiments perceived by the forum readers are at the core of the analysis; a complete message is the unit of the study. We build 14 models and apply those to represent discussions gathered from an online medical forum. We use four multi-class sentiment classification applications and two Machine Learning algorithms to evaluate prowess of the assessment models. Sentiment propagation and sentiment influence have been subjects of Sentiment Analysis, an advanced field of Natural Language Processing. High sentiment correlation was found in messages posted in online forum discussions (Weroński et al, 2012) . Studies of hyperlink connections in blogs have shown that connections are strongly influenced by immediate posts but further influence steeply declines (Miller et al, 2011) . Zafarani et al. (2010) studied the sentiment propagation in a LiveJournal dataset. They concluded that sentiment propagation in user's network positively correlates with the number of friends and negatively correlates with the number of posts and prolificacy of the user's friends. Persistence of positive and negative expressions may considerably vary: rapidly-fading information contains significantly more words related to negative emotions, whereas the persistent information contains more words related to positive emotions (Wu et al, 2011; Hansen et al, 2011) . We propose ECE models that use three categories of message parameters: i) sentiments perceived by the message annotators; ii) authors' activity in the discussion and on the forum; and iii) the message posting in the discussion (Fig 2) . Noticeably, our models do not use the textual content of the messages. Preliminary studies of the data identified peer-to-peer support, patient empowerment, as well as patient's uncertainty as the major reasons for the discussions' participations. Manual annotation of the IVF data set resulted in 5 sentiment categories: 'confusion', 'encouragement', 'gratitude', 'factual', and 'endorsement', a transitional category between 'factual' and 'encouragement'. Annotators used the reader perception model; (Sokolova & Bobicev, 2013) provide details of the annotation process of the IVF data. Those sentiments cover, albeit not exclusively, complex psychological factors of infertility (Hocaoglu, 2018) . For the current study, each post was annotated by two independent raters using pre-defined categories 'confusion', 'encouragement', 'gratitude', 'factual', and 'endorsement'. The annotators reached a strong agreement with Fleiss Kappa = 0.737. The posts assigned two different categories were considered 'ambiguous'. We looked at three factors of author's activity on the forum: 1) initiations of new discussions; 2) the total number of messages posted by an author; 3) contribution to discussions initiated by other authors. The authors who start discussions (aka first authors) actively participated in the initiated discussion and guided it in the direction they needed. Only in 10% of discussions, the first authors did not participate in discussion after initiating it. On average, 25% of messages per discussion were posted by the author of the first post. We identified a group of more active, or prolific, authors who posted approximately 7 times more posts that an average author. We estimated the "prolificity" of the authors as the ratio of the total number of author's posts to the total number of posts of the most prolific author in the studied data (Patil et al, 2013) . Thus, prolificity ranges between [0, 1] and the participant with the greatest number of posts has prolificity equal to 1. In our data, the average prolificity of the prolific authors is 0.44, while the overall prolificity is 0.06. We hypothesize that sentiments of messages posted by an interlocutor already involved in discussion can be easier identifiable than sentiments in a message posted by a person just joining this thread. Thus, we introduced a category of discussion newcomers, i.e. authors contributing to discussion for the first time. Our ECE models factor in the message posting within discussion, as we hypothesize that postings can affect readers' perception of the message sentiment. For example, readers may not strongly perceive sentiments expressed at the beginning of discussion and then better "tune" to sentiments as discussion progresses. [ In , we reported that annotators assigned different sentiment to 26% of the first posts and to 16% of the last post, whereas only 13% of all the posts were labeled ambiguous, i.e. were annotated with different sentiments.] Our current models test this hypothesis by differentiating among three types of postings: beginning (1 st post), end (the last post), and middle of discussion (all the other posts). Our first goal was to demonstrate that there are patterns of sentiments in forum's discussions and they mutually influence each other. Hence, we built a representation which reflected sentiment transitions in discussions. Having two annotation labels for each post we decided to use them both as parameters rather than merge them. This allowed us to disambiguate the ambiguous label, which appeared when two annotators selected different sentiment labels for the post: -Model I -4 categorical parameters. We represented each post through the two labels assigned by each annotator to the previous post and two labels assigned by each annotator to the following post. For first and last posts, we used parameters "none" as a proxy of absent labels. -Model II -4 categorical parameters + 3 binary parameters = 7 parameters, where the four categorical parameters are the annotation labels (Model I), and the three binary parameters show whether the previous, current and next messages are first, middle, or last ones. -Model III -4 categorical parameters + 2 binary parameters + 1 numerical parameter = 7 parameters, where the four categorical parameters are the annotation labels (Model I) and the three other parameters represent author's activity (author's parameters): a binary parameter f indicating whether the author of the post is the one who started this discussion; a binary parameter n indicating whether the author posted in this discussion for the first time; a numerical parameter pr containing authors "prolificity", calculated as described in Section 4.2. Note that these parameters are independent and can simultaneously be true. -Model IV -4 categorical parameters + 3 binary parameters + 3 author's parameters = 10 parameters, where the four categorical parameters are the annotation labels (Model I), the three binary parameters represent the message posting options (i.e., first, middle, or last) (Model II), and the three author's parameters are the same as in Model III. We were interested in the impact of the longer sequences of sentiment transitions on the post's sentiment. To assess this impact, we built the following models: -Model V -8 categorical parameters. We represented the post by four labels assigned by each annotator to the two previous messages and by four labels assigned by each annotator to the two following messages. -Model VI -8 categorical parameters + 3 binary parameters = 11 parameters, which combines sentiment labels of Model V and the message posting options within discussion. -Model VII -8 categorical parameters + 3 author's activity parameters = 11 parameters, which combines sentiment labels of Model V and the authors' activity indicators; -Model VIII -8 categorical parameters + 3 binary parameters+ 3 author's parameters = 14 parameters. This Model combines all the three characteristics: sentiment labels (Model V), the message posting indicators and the authors' activity indicators. The following Models use three parameters each. Here, we eschew the sentiment labels. Thus, we remove annotator's bias and only rely on the message posting options and the author's activity indicators -Model IX -3 binary parameters that represent message's posting options (described in Model II). -Model X -3 author's parameters, as described in Model III. -Model XI -3 binary parameters+ 3 author's parameters = 6 parameters. Next, we assume that the author's activity can affect annotators as well (e.g., they recognize messages of active authors). Thus, the following three Models enhance a possible bias of sentiment labels through author's parameters. We added each of author's parameters to Model I in turn. -Model XII -4 categorical parameters + 1 binary parameter which shows whether the message author is the first author who started the discussion = 5 parameters. -Model XIII -4 categorical parameters + 1 parameter which represented the fact that the author just joined the discussion = 5 parameters. -Model XIV -4 categorical parameters + 1 parameter which represented author's prolificity = 5 parameters. Table 1 lists the models and the number of parameters each model uses. The author activity binary 3 3 3 3 3 3 1 1 1 We apply four settings of multi-class classifications of the IVF data set. In those ML settings, the data is classified into six, five, four, and three sentiment categories respectively; Sec. 4.1 provides the details. In every multi-class classification setting, we apply one model in one multi-class sentiment classification. As a result, we have 14x 4 = 52 supervised Machine Learning (ML) tasks to evaluate performance of the ECE models. We apply the model to all messages in the discussion threads. We then separately use Support Vector Machine (SVM) and Conditional Random Fields (CRF), with 10-fold cross validation, to classify sentiments of every message. Thus, we conduct 104 experiments. Precision(macro) serves as the model prowess criterion. It is the average of Precisions calculated for each sentiment category, where all the categories are treated equally: ∑ where indicates the number of correctly recognized messages labeled with the sentiment category , is the number of messages incorrectly assigned to the sentiment category , and is the number of sentiment categories. It signifies per-class agreement of the data sentiment labels with those obtained by combination of the ECE model and a classifier. We work with data obtained from In Vitro Fertilization (IVF) website 2 . The data have been introduced in (Sokolova and Bobicev, 2013) and is available on demand. We analyze discussions from the IVF Ages 35+ sub-forum 3 .  The data set consists of 1321 messages written by 359 female authors and posted on 80 discussions. The average length of the discussion -16.5 posts (s.t.d. = 9.6); the average number of participants in one discussion comes to 9-10 persons (s.t.d. = 4).The average post had 750 characters and 5-10 sentences.  Discussants form a relatively narrow group: there are 359 female authors;. The forum participants contributed to 80 topics we were working with. 15 authors we denoted as the most prolific; they posted almost 45% of posts. 172 persons posted only one message each.  There were 73 first authors in the 80 annotated threads, 5 of which started 2 topics and one started 3. The first author who started the thread was usually rather active in the initiated topic and posted in average 4 messages in the started thread.  In each thread, in average 9.5 posts were written by the participants who joined the discussion for the first time. These messages formed around 64% of topic's posts. We constructed four multi-class categorizations of the data set: We apply Support Vector Machines (SVM, the logistic model, exhaustive search among normalized poly kernels 1 -5, soft margin 1 -5, WEKA toolkit) and Conditional Random Fields (CRF, Maximum Likelihood, Mallet toolkit). SVM has shown a reliable performance in sentiment analysis of social networks. At the same time, we expect CRF to benefit from the feature sets that are sequences of possibly dependent random variables. The best classifiers were found through 10-fold cross-validation. We provide the results obtained on Bag-of-Words (BOW)/unigram model as our benchmark. The total unigram count was 7787 unique words. Words with occurrence 1 and 2 were mostly out-off-vocabulary; we did not use them in the text representation. As the result, we used 3302 unigrams with occurrence > 3 to represent the messages. We compare the model performance through ranking. For each classification task, we rank the obtained Precisions in the descending order. The ECE model with the lowest rank shows the most reliable performance. Model IX provided the best ECE assessment when the data sets were classified by SVM. The model's total rank was 8, i.e., rank 1 in 6-class, 3 -in 5-class, and ranks 2 -in 4-class and 3-class classifications. Model IX has 3 binary parameters. Models I and XII provided a tied-ranked assessment when the data sets were classified by CRF. Their ranks were 7, i.e., for Model I: rank 2 in the 6-class, 3 -in 5-class, and ranks 1 -in 4-class and 3-class classifications, for Model XII: ranks 1 in 6-class and 5-class, 2 -in 4-class, and rank 3 in 3-class classifications. Model I has 4 categorical parameters, Model XII has four categorical and 1 binary parameters. Below, Table 2 and Fig. 2 report Precision provided by the ECE models in SVM classification, Table 3 and Fig 3 report the ECE model ranking, based on SVM. Table 4 and Fig 4 report Precision provided by the ECE models in CRF classification, Table 5 and Fig 5 report the ECE model ranking, based on CRF. I II III IV V VI VII VIII IX X XI XII XIII 4-class 6 3.5 1 11.5 6 6 10 11.5 8 2 15 14 3.5 13 9 3-class 5 13 8 10 11.5 11.5 3. Our ECE analysis is based in three factors: who expressed the message sentiments (the authors and their activity), why those sentiments appear in the message (sentiments of adjoined messages as perceived by readers), and where the sentiment-bearing message appears (the message's posting option in the discussion). Whereas many sentiment analytics studies rely on textual features (Poria et al, 2020) , some studies eschew them, e.g., Liu et al (2017) . Our models do not use the textual content of the messages. This lexicon-independent approach makes the models suitable for the online environment where lexical aspects significantly depend on the sociodemographic profile of the message author (Hilte et al, 2020) . Selection of performance evaluation metrics is an essential part of Machine Learning applications, where measures can enhance the study impact or diminish it (Flach, 2019) . We use Precision as the main evaluation measure. Whereas Recall captures the ratio of correctly found sentiments for each sentiment category, Precision captures the ratio of sentiments that were correctly identified for every sentiment category. A higher model's Precision means that the model identifies the target sentiment with fewer errors, making Precision more useful than Recall when we assess ECE. We provide Precision, Recall, and F-score results in Appendix. We compare sentiment classification results provided by multiple models (14 models and a benchmark model) on four data sets. Omnibus statistical tests that compare performance of multiple algorithms on multiple data sets, e.g., Friedman test, ANOVA, rely on the number of data sets being larger than the number of the classifiers (García et al, 2010; Stapor, 2017; ) . Thus, we provide a relative comparison of the ECE models by ranking the classification results obtained with the use of those models. Model IX, with 3 binary parameters of message's posting options, achieved the best ranking in SVM classification. Models I, 4 categorical sentiment parameters, and Model XII, 4 categorical sentiment parameters and 1 binary parameter of the author's activity, share the best ranking in CRF classification. We have focused on sentiments shared among the discussion participants. At the same time, sharing of sentiments is only a part of Echo-Chamber Effect. Selective usage of information and information omission also contribute to it. Avoidance of opposing information can be achieved through physical distancing, inattention, biased interpretation, forgetting, and self-handicapping (Golman et al, 2017) . Investigation of physical avoidance, forgetting, and self-handicapping may be unfeasible: to study those phenomena, e.g., who does not participate in the forum and their reasons for doing so, we have to conduct extensive and rather expensive surveys. At the same time, inattention and biased interpretation can be addressed by extending our analysis to the textual message content and applying advanced semantic and pragmatic tools to it. We assess Echo-Chamber Effect by applying a supervised multi-class sentiment analysis. We list the novelty factors of our study: i) it extends the ECE analysis to the online forums, thus going beyond the Twitter data; ii) we access the ECE models under four choices of multi-class sentiment classification, thus providing evaluation for every measure under different sentiment settings; iii) our study contributes to expansion of the ECE analysis in the domain of online medical discussions. We proposed ECE assessment in online forums that relies in readers' perception of the expressed sentiments. Using data sets collected from a medical forum, we compared performance of SVM-based and CRF-based classifiers on 14 ECE models. We empirically evaluated those models in 6-,5-,4-, and 3class classification tasks. Our empirical results show that ECE can be reliably evaluated without a tedious lexical study of the message content. The best obtained Precision, with benchmark Precision in brackets: in SVM classification -0.425 for 6 classes (0.415), 0.520 for 5 classes ( For future work, we plan to collect data from another medical forum and replicate the study on the new data set. Also, we applied a supervised learning approach which relies on access to the data set being fully manually annotated. A vast and ever growing volume of messages posted on social media makes availability of fully annotated data unrealistic. To reduce dependency on manual annotation, we plan to transition to semi-supervised learning. In the future work we plan to address information avoidance and its tactics such as inattention, biased interpretation, and omission by including the textual content analysis in our research. For the SVM-based classification, the best F-scores slightly outperformed BOW's F-score for 6,5,4 classes, and came almost equal for 3 classes (0.633 vs 0.634). For CRF-based classification, BOW's results were significantly lower than those of SVM; CFR's best results considerably outperformed those obtained on BOW's representation. Tables A2-A7 report the results obtained on the models introduced in Section 3. Tables A2 -A4 report SVM's performance, and Tables A5-A7 -on CRF. For SVM, the best F-score is consistently obtained when the model parameters include sentiment labels, the author activity information and the information about post's position in the discussion (Model VIII). For 6-class, 5-class and 4-class classification, the best F-score were achieved when all the parameters were used, for 3-class classification the best F-score is obtained without post's position parameters (Model VII). Leaving out sentiment labels considerably decreased classification results (Table A3 ). The best F-score = 0.253 was computed on the three author's activity parameters (Model X). Adding the message position parameters (Model XI) had actually decreased the results -F-score = 0.271. Enhancing sentiment labels with singular indicators of author's activity (Table A4 ) has shown inconclusive results if we compare them with the sentiment labels enhanced with the three author's activity parameters (Model III). When sentiment features were augmented with the "first author" indicator (Model XII), F-score improved for all the problems. CRF-based classification obtained higher Fscore than SVM on all the four classification tasks. However, CRF results show that its performance is highly volatile and depends on messages being represented through the sentiment labels of immediate preceding messages and following messages. For 6,5, and 3 classes, the overall best F-score and the 2 nd best F-score are achieved either on sentiment labels augmented with the "first author "indicator (Model XII, Table A7 ) or on the four sentiment labels along (Model I, Table A5 ). For 4 classes, the best and second best F-score are achieved on sentiment labels augmented with the "first author" and the "newcomer "indicators respectively (Models XII, XIII, Table A7 ). Representing messages without sentiment labels decreased CFR ability (Table A6) . For 6 classes, the best F-score = 0.249 (Model XI) is lower than any other F-score for the classification problem; the same holds for 5, 4, and 3 classes. Table A7 : Classification results for CRF on Models XII -XIV. For each task, the highest measure is in bold; the lowest -in italic. * indicates CRF's best F-score for the problem. Longman grammar of spoken and written English What Goes Around Comes Around: Learning Sentiments in Online Medical Forums No Sentiment is an Island Thumbs up and down: Sentiment analysis of medical online forums Echo chambers on social media: A comparative analysis Echo Chamber or Public Sphere? (2014) Predicting Political Orientation and Measuring Political Homophily in Twitter Using Big Data Falling into the echo chamber: the Italian vaccination debate on Twitter. AAAI Conference on Web and Social Media On the contribution of discourse structure on text complexity assessment The echo chamber is overstated: the moderating effect of political interest and diverse media. Information, communication & society Are online news comments like face-to-face conversation? A multidimensional analysis of an emerging register Performance Evaluation in Machine Learning: The Good, the Bad, the Ugly, and the Way Forward Quantifying Controversy in Social Media Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power Information avoidance Good friends, bad newsaffect and virality in twitter Lexical Patterns in Adolescents' Online Writing: The Impact of Age The psychosocial aspect of infertility Spreading (dis) trust: Covid-19 misinformation and government intervention in Italy Digital literacy linked to engagement and psychological benefits among breast cancer survivors in Internet-based peer support groups Grounded emotions Quantification of Echo Chambers: A Methodological Framework Considering Multi-party Systems Algorithmic detection and analysis of vaccine-denialist sentiment clusters in social networks Predicting Group Stability in Online Social Networks Beneath the Tip of the Iceberg: Current Challenges and New Directions in Sentiment Analysis Research The social sharing of emotion (SSE) in online social networks: A case study in Live Journal Learning Relationship between Authors' Activity and Sentiments: A case study of online medical forums What sentiments can be found in medical forums? RANLP Evaluating and comparing classifiers: Review, some recommendations and limitations Semeval-2007 task 14: Affective text. SemEval-2007 Emotional and psychosocial risk associated with fertility treatment Discourse structure and computation: past, present and future Emotional analysis of blogs and forums data Does bad news go away faster? Giving and receiving emotional support online: Communication competence as a moderator of psychosocial benefits for women with breast cancer Sentiment Propagation in Social Networks: A Case Study in LiveJournal Debunking in a world of tribes We report macro measures Precision, Recall, and F-score obtained in the ECE empirical assessment. All the obtained F-scores outperformed the majority class baselines; refer to Sec 4.1 for the details. To put the classification results in perspective, we use Bag-of-Words (BOW), i.e. the unigram representation, as our benchmark. The total unigram count was 7787 unique words. Words with occurrence 1 and 2 were mostly out-off-vocabulary (OOV) words, e.g., misspelling, typos, non-standard abbreviations. We removed OOV words while building BOW. As the result, our BOW used 3302 unigrams with occurrence > 3 to represent the messages. Table A1 reports the classification results. Note that SVM results on BOW significantly outperform CRF (P = 0.0031).