key: cord-0649569-wf8y55xk authors: Abhari, Rod; Vincent, Nicholas; Dambanemuya, Henry K.; Bodon, Herminio; Horv'at, EmHoke-'Agnes title: Twitter Engagement with Retracted Articles: Who, When, and How? date: 2022-03-08 journal: nan DOI: nan sha: 08daff95234d2a800b06c8136035c00c3deaa9fa doc_id: 649569 cord_uid: wf8y55xk Retracted research discussed on social media can spread misinformation, yet we lack an understanding of how retracted articles are mentioned by academic and non-academic users. This is especially relevant on Twitter due to the platform's prominent role in science communication. Here, we analyze the pre and post retraction differences in Twitter engagement metrics and content of mentions for over 3,800 retracted English-language articles alongside comparable non-retracted articles. We subset these findings according to the five user types detected by our supervised learning classifier: members of the public, scientists, bots, practitioners, and science communicators. We find that retracted articles receive greater overall engagement than non-retracted articles, especially among members of the public and bot users, the majority of engagement happening prior to retraction. Our results highlight non-scientists' involvement in retracted article discussions and suggest an opportunity for Twitter to include a retraction notice feature. These findings improve our understanding of how retracted articles are mentioned and engaged with on Twitter, highlighting the potential of the platform to host critical discussions about problematic scientific findings. At the same time, the prevalence of non-scientists in retracted article discussions signifies the broader reach of retracted articles on digital media, and thus the importance of paying greater attention to how retractions are discussed online. Finally, they invite considerations about how design choices on platforms like Twitter can foster positive scientific exchanges, e.g., by incorporating a retraction notice feature. Ultimately, given the high stakes of scientific misinformation in a time of global pandemics, conscious attention must be paid to the dissemination of retracted findings on social media. In this section we provide a literature survey describing the characteristics and consequences of science communication on social media. Social media have emerged as key platforms for the public to access and discuss scientific information [10] . However, it is unclear whether media designed to be broadly accessible and immediately gratifying can support the nuance necessary to communicate science [11, 12] . The limited attention economy of social media often produce overly-simplified or misleading interpretations of source material [13] . Additionally, individuals in polarized "echo chamber" environments pay selective attention to belief-confirming science and make judgements of scientific credibility based on how well the science conforms to their prior partisan beliefs [14, 15] . Thus, in a health crisis characterized by unreliable information, online social media have become rife with scientific misinformation [16] . Social media and scientists alike have responded to these problems in various ways. Digital literacy efforts such as Google's "Be Internet Awesome" help users to spot inaccurate or misleading science, while a recent book teaches literacy skills necessary to help their readers spot problematic claims, including pseudoscience [17, 18] . Finally, emerging practices in journal publishing, including translational abstracts and public significance statements, help individuals interpret science without relying on third party sources. Yet it remains an open question whether these efforts can address the fundamental reasons for science misinformation, including brevity, sensationalism, and partisan motivations, without changes to how science is communicated [19] . Retractions became an academic convention in the 18th century, at a time when science was produced and shared among a small community of sympathetic elites [20] . Accordingly, there appears to have been no serious concern that exposing the seamy underbelly of science would damage public trust in the scientific enterprise. This assumption may need to be reconsidered for the broad public forums of social media. Although retraction notices were designed to correct the public record, they can also give visibility to flawed research which was previously out of view. A recent study of cross-platform attention to retractions found that most retractions are issued after uncritical attention to the article's findings have been exhausted, limiting its corrective potential [21] . Further, gaps in both scientific and media literacy make it more difficult for individuals not only to recognize retracted research, but to know how to process it when it is explicitly categorized as such [22] . There is also the possibility that retraction notices may cause individuals invested in an article's original findings to "double down" on their beliefs once faced with the retraction [23] . If the discussion around a retracted article centers partisan perspectives, individuals may interpret the information in ways which confirm their existing worldviews [24] ). Addressing the root causes of scientific misinformation on social media requires a closer look at the contexts in which science, including retracted science, is shared and discussed. Since 2009, the number of retractions have roughly tripled, and today, about 4 in 10,000 articles are retracted [25, 26] . While the reasons for retraction vary, in every case, retractions indicate that a article should no longer be considered legitimate research [27] . In the scientific community, this belief is expressed through the taboo against citing retracted research. However, no clear norm exists for online mentions, which may be used to indicate agreement or disagreement, ask for greater clarity, or simply promote the article. Research on non-retracted articles reveals a wide range of purposes and user types for science-mentioning tweets. One survey found that the majority of article mentions on Twitter come from bot accounts affiliated with science organizations that exist to disseminate new science quickly and, typically, without commentary [28] . However, the presence of bots varies substantially by discipline, producing 64% of tweets in the natural sciences but only 20% in humanities and social sciences. Another survey found that over 80% of tweets are purely descriptive and express no discernible stance towards the mentioned article, though the user types in this survey were unspecified [29] . The presence of extra-topical factors also influence the amount of Twitter attention an article receives. Prior research has found that retractions based on research misconduct attract substantially more online attention than retractions due to error, while the opposite is true of post-retraction journal citations [30, 31] . However, neither study performed user or content analysis of the tweets. Although the non-academic public may be more attentive to the controversy generated by retractions due to misconduct, it is also plausible that even academic users are more likely to tweet about controversial research. Thus the relationship between online article mentions and perceptions of scientific credibility are largely unknown. Non-academic users and users from different disciplines bring a diversity of knowledge that can help identify errors or inconsistencies which may otherwise be missed [32, 33] . This possibility is supported by recent research, which demonstrates how early critical discussions of two COVID-19 articles on Twitter detected problems which were later cited as reasons for retraction [1] . When problems are discovered, Twitter allows users to bring them directly to relevant entities, such as journal publishers, by mentioning or "tweeting at" the user accounts of these entities [34] . However, a comprehensive study on retractions found that retractions are generally issued after the initial uncritical attention to a article has been exhausted [21] . If retracted article discussion is limited to bot activity or tweets with no further engagement, then the effectiveness of Twitter discussions as a quality control measure will be limited. These discussions show how a greater understanding of how online audiences interpret and share retracted articles is necessary to evaluate Twitter's contribution to the diffusion of retracted science. With the prevalence of questionable science online, understanding the role of different platforms in disseminating retracted findings becomes increasingly important. Our research contributes to this effort by tracking the recorded engagements with over 3,800 research articles on Twitter with respect to the intersection of three factors: Whether they occur before or after retraction, whether they contain retraction-related content, and what types of users produce the posts. Ultimately, this knowledge can inform efforts to curb the spread of science misinformation. To understand how retracted articles are discussed on Twitter, we collected a set of tweets which mention retracted articles. To do so, we utilized two databases: Retraction Watch's retracted article database and the Altmetrics database of social media posts which contain unique identifiers of online references to academic articles. We also used the Twitter API to obtain more complete tweet-level metadata based on the tweet ids included in the Altmetrics data set. The Retraction Watch database contains article-level information about retracted articles, including the journal, year, and DOI (Digital Object Identifier) of each article. Also included is the date of the corresponding retraction notice and the reason for retraction [35] . From the Altmetrics data that contains tweets published between June 6, 2011 and October 8, 2019, we selected those that mention a retracted article recorded in the Retract Watch database [36] . We then used the Twitter API to obtain complete tweet information for each tweet that was still public in September 2021, meaning it had not been removed by the user or Twitter. Using this process, we obtained complete tweet information for tweets mentioning 3,847 articles (a little over half of the 6,868 retracted articles identified in the Retraction Watch data set). Of the 66,447 tweets, 657 included more than one article link in the tweet text, and thus were counted according to the number of unique mentions of articles in our data. With these tweets accounted for, our final sample contained 67,124 tweets about retracted articles. Non-retracted Control Articles: In order to establish a baseline with which to compare retracted articles, we applied the same tweet retrieval method described above to a control set of non-retracted articles from a recent study on retractions [21] . Control articles were chosen with the following matching procedure: for each retracted article, Peng et al. calculated the time between publication and retraction and the number of tweets at the time of retraction. Then they searched for a non-retracted article from the same journal with a comparable number of tweets after the same amount of post-publication time has passed. After running this procedure for tweets between 2011 and 2018, the control set consisted of 2,085 articles, the majority of which were exact matches in terms of number of tweets, i.e. each non-retracted article received the same number of tweets as one retracted article. With a total of 25,997 tweets, our control set is large enough to provide important context for many of our key results. Combined with the retracted tweets, this brought the overall number of tweets analyzed to 93,127, which includes tweets that mention at least one retracted or non-retracted control article. Our research team exercised an abundance of caution when working with these digital trace data. We designed our research methods in accordance with our institution's Institutional Review Board (IRB) and the norms expressed within the Association of Internet Researchers Ethical Guidelines [37] . From the Twitter API, we collected user information for the purpose of identifying engagement and user types and only report results at the aggregate level. We additionally maintained users' privacy preferences by only collecting data from public accounts. Any user information that could possibly be identifying, namely the users' Twitter ID, their profile description, and the content of their shared tweets, has been stored within a secure cloud environment. Engagement Categories: We used Twitter's API to get detailed engagement metrics about the number of times a tweet was "liked", "retweeted", and received a "reply" or "quote tweet". We then added the "reply" and "quote tweet" counts to obtain a total "response count" for each tweet. This gives us three distinct engagement variables per tweet observation: like count, retweet count, and response count. We consider each of these as a separate engagement metric for the purposes of our study and compare the pre-and post-retraction engagement for each metric separately. Language: 74% of our tweets, replies, and quote tweets were reported by Twitter as being primarily written in English. The remaining tweets covered a wide variety of languages. For engagement dimensions, we considered all tweets; for content, we looked at various subsets based on keyword matching, so these analyses included only tweets that contained the English keywords. We compared engagement with tweets created before a retraction occurred and those created after retraction in order to understand the use of Twitter in the dissemination of retracted articles. To do so, we used the retraction timestamp from Retraction Watch. When describing attention dynamics around retraction, we performed a time windowed analysis of engagement in fixed time frames around the retraction. Specifically, for windows of varying size k (from k= 1 day to k= 180 days), we compared engagement during the k days before a particular article was retracted to the k days after that article was retracted. To further understand the content of tweets about retracted articles, we analyzed each tweet using an interpretable keyword look up process. Specifically, we aimed to identify tweets that were substantively discussing an article's retraction using retraction-related keywords. To identify a list of relevant keywords, we used the "Reasons for Retraction" recorded in the Retraction Watch dataset. This list of words is shown in Table 1 . We omitted the following keywords that were ambiguous: 'data', 'image', 'legal', 'salami', and 'notice'. We validated our keyword filtering approach by first dividing the dataset into tweets that were "retraction aware" (contained a retraction-related keyword) and those that were not (did not contain keywords). We then sampled 150 retraction aware tweets (representing 10% of all tweets with retraction-related keywords) and 150 not retraction aware (representing about 1% of all tweets that do not contain retraction-related keywords). Two authors examined the English-language tweets within this sample (145/150 retraction aware and 140/150 not retraction aware tweets had English text). They labeled tweets to discern those that discussed the retraction and those that did not. The two annotators reached consensus in assigning these labels, which represent an evaluation of whether the tweet signals awareness of the retraction. In comparison to this ground truth based on human evaluation, 135 of the 145 tweets that the keyword-based matching found to be retraction aware were true positives (precision/positive predictive value = 93.1%) and 133 of the 140 tweets that did not contain retraction aware keywords were true negatives (negative predictive value = 94.8%). These measures suggest that such a keyword filtering approach can provide a reasonable proxy for which tweets are retraction aware according to humans. As a proxy for substantive conversation about retractions, we measure the fraction of all tweets which mention any of the retraction-related keywords (hereafter called "keyword tweets"). Additionally, we divide the keyword-mentioning tweets into pre-and post retraction. Finally, we measure the attention to keyword tweets by comparing the amount of engagement on them with the total engagement in our entire sample of tweets mentioning retracted articles (e.g., what fraction of all the likes in our dataset are attributable to tweets that mention the word "retract"). Although keyword tweets are merely a proxy for critical conversation about retracted vs not retracted articles, this simple approach allows us to evaluate the possibility of using Twitter to support meaningful discussion around research articles. To classify user types, we rely on a combination of manual, rule-based, and machine learning approaches on 26,260 English user descriptions. 20,195 descriptions are of users who mentioned retracted articles and 6,065 are from users who mentioned control articles. To manually label user types based on their descriptions, an initial code-book with five user types (i.e., members of the public, bots, science communicators, scientists, and practitioners) was developed. These categories are inspired by Altmetrics' aggregate statistics about users who disseminate scientific articles online. One annotator labelled a set of 1,757 descriptions from users that tweeted about retracted articles and 525 descriptions from users that tweeted about non-retracted articles. To test the reliability of these labels, a second annotator coded 19.4% of the data. Comparing the labels on this subset resulted in a Cohen's score of κ = 0.789, which indicates good agreement. Additionally, a set of keywords associated with four user types was used to construct a rule-based model that assigned users to a specific type if their description contained any one of the keywords associated with that type ( Table 2 ). The keywords, determined a priori, were among the most frequently observed keywords in the set of manually-labelled user descriptions. Instead of including the full set of frequently used keywords for each user type, we relied on conservative keyword lists so as to maximize the true positive labels (users who were correctly assigned by the rule-based method to belong to a certain type) and minimize false positive labels (users who were incorrectly assigned to belong to a certain type). This process yielded 7,757 rule-based labels for user types in the retracted group and 1,970 rule-based labels for user types in the control (i.e., non-retracted) group. Figure 1 provides summary counts of the number of users that belong to each one of the five user types for the retracted and control groups. After combining the human labelled and rule-based labelled descriptions (i.e., 9,514 user descriptions in the retracted group and 2,495 user descriptions in the control group), we implemented and evaluated three supervised classifiers. Random Forests [38] , Decision Trees [39] , and Logistic Regression were trained on the task of classifying five user types from a feature vector constructed from users' Twitter profile description. Since a disproportionately larger number of scientist users was identified through the rule-based algorithm (see Figure 1) , we under-sampled this group to maintain class balance. To create the feature vectors, we first built a vocabulary of the unique terms from all user descriptions combined. To reduce noise and the size of the vocabulary, we pre-processed the user descriptions by (1) converting all the words to lowercase, (2) removing numbers, white spaces, punctuation, and stop words, and (3) lemmatizing the remaining words to reduce their inflectional forms to a common base or dictionary form. Then we represented the pre-processed text using Term Frequency -Inverse Document Frequency (TF-IDF) feature vectors. Specifically, we utilized the TFIDFVectorizer implementation of the scikit-learn [40] module in Python. For each feature vector, the TF-IDF score of a term (t) in a user description (d) is measured by Equation 1 as follows: where tf (t, d) represents the number of times that a term (t) appears in a user description (d) and idf (t) is measures by Equation 2 : where n d is the total number of user descriptions and df (d, t) is the number of user descriptions that contain the term t. Therefore, each user description is represented by a finite-length TF-IDF feature vector. We could not provide labels for users with no descriptions since their feature vectors have all zeros. Therefore, we treated all users without a description as a single, unknown user type. While there is reason to believe that these are likely new users, bots, or less active users, we avoid making any broad claims about this group. From the feature vectors, we then evaluated each model's performance in predicting the ground truth labels using out of sample tests. To perform out of sample tests, we used 5-fold cross validation and reported the model accuracy as the number of correct predictions divided by the total number of predictions. Finally, from the learned model, we projected the remaining user descriptions on the best-performing model. The mean engagement values are, across the board, higher among tweets that mention retracted articles (hereafter "retracted tweets") than equivalent averages from control articles' tweets ("control tweets"). For retraction tweets, the mean values for like, retweet, and response counts are 2.4, 1.66, and 0.49, whereas control tweets have a mean like count of 2.00, a retweet count of 1.43, and a response count of 0.29. We conducted a Mann-Whitney U test of independent samples for each variable, and found the difference in retweet was not significant (p=0.087), whereas the differences in likes and responses were significant (p<0.05 for each). In other words, retracted articles have a higher expected value for engagement in general, but only by a small amount (i.e. less than one additional like on average). Additionally, all engagement metrics are correlated with each other (Pearson correlation between like and retweet counts = 0.8, like and response counts = 0.78, response and retweet count = 0.67). However, engagement metrics have highly skewed distributions throughout, with a median value of zero for each metric. This suggests that the majority of the differences we observe come from a small subset of highly popular articles. Next, we compared the types of tweets (reply, retweet, quote or original tweets) between retracted and non-retracted articles and found the frequency of these types varied considerably between retracted and non-retracted control articles. Compared with the control, retracted articles featured 2% more replies, 4% more retweets, and 2% more quote tweets, but 8% fewer original tweets. In other words, while a majority of our total observations involved a user sharing an existing tweet or replying to another tweet mentioning an article, engagement with retracted articles was more likely to come from response tweets than from original tweets. Using the retraction dates provided by Retraction Watch, we calculated the time between tweet creation and retraction. We use this difference in time to (1) examine how many tweets were created before and after retraction and (2) compare engagement in the days preceding retraction to the days following retraction. Figure 2 shows the total number of tweets created before vs after retraction. This figure shows that, as of the time of data collection, for this population of retracted articles there have been overall more tweets created before article retraction than after. Overall, 20,685 tweets were created before the linked article was retracted and 6,389 were created after, which is expected given that articles are retracted months to years after their publication [21, 41] . We also summed each engagement metric in the pre-and post-retraction periods and calculated the mean over all articles. The average retracted article in our dataset saw 8.8 tweets, 18.8 likes, 14.5 retweets, and 4.1 responses before retraction, and 3.1 tweets, 10.2 likes, 5.3 retweets, and 1.8 responses after retraction. In other words, the engagement was substantially higher before retraction along all dimensions. The control dataset is again useful as context: summing engagement for each control article across the entire data collection period (2011 to 2018), the average engagement per article was 5.6 tweets, 11.3 likes, 8.1 retweets, and 1.6 responses. This means that even before they were retracted, retracted articles saw more overall tweets, likes, retweets, and responses than control articles, and then received additional engagement after retraction. We further examined whether and when a "retraction engagement boost" occurred around retraction with a time window analysis. Figure 3 shows that when we consider tweets about an article that were posted within a fixed window around that article's retraction date, there is a period of time for which tweets about retracted articles receive more engagement after retraction compared to immediately before retraction. For instance, comparing the 10 day period after each retraction to the 10 day period before each retraction, the 1,415 tweets posted after retraction received 6,915 likes and 2,999 retweets, whereas the 1,284 tweets posted before retraction received 2,825 likes and 2,747 retweets. Even though more tweets were posted about an article before its retraction for many window sizes, the period after retraction has elevated like and retweet behavior. Figure 3 : Comparing the total amount of tweets, likes, retweets, and responses (quote tweets + retweets) in a fixed time window around each retraction. X-axis shows different window sizes. For instance, for window size 10 we count only tweets that were created 10 days before or 10 days after retraction. Y-axis shows the boost, i.e., it subtracts the engagement before retraction from engagement after retraction. Positive boost values indicate that engagement was more frequent after retraction. Indeed, despite the fact that many more tweets in our dataset were created before the corresponding article's retraction, the spike in post-retraction engagement led to total post-retraction likes being greater than pre-retraction likes for up to a 122 day window. Equivalent "intersection points" in Figure 3 occur at 25 days for total tweets, 27 days for responses, and 82 days for retweets. As described in the Data and Methods section, we used a keyword matching approach to find retraction aware tweets, i.e., those that contain retraction-related content. These keyword tweets made up 13.4% of all tweets that mentioned retracted articles. These tweets also received outsized engagement via likes: they were responsible for 22.1% of the total likes. By contrast, keyword tweets accounted for only 2.7% of the tweets mentioning control articles. This suggests that retracted articles did indeed see unique Twitter engagement compared to control articles in the form of retraction aware tweets, and these retraction aware tweets received additional engagement in form of likes. Fig. 4 shows the total contribution of likes and responses (quote tweets and replies) received by tweets containing certain retraction-related keywords, separated again based on whether they were created before or after retraction. Figure 4 : Relative share of tweets, likes, and replies for tweets that contain one of these top 10 most common retractionrelated keywords. Left column shows tweets before retraction, middle column shows the group of tweets created after the corresponding article was retracted, and right column shows tweets from non-retracted control articles. Keyword-mentioning tweets created before the retraction accounted for 3.8% of all tweets about retracted articles, but generated 5.9% of likes given to tweets about retracted articles. Looking at tweets created after retraction, keywordmentioning tweets made up 9.6% of our observations but accounted for 16.2% of likes. The fact that the number of likes is higher than expected both prior to and following retraction suggests that keyword-mentioning tweets, in general, saw greater attention than tweets which do not contain these keywords (in particular, as shown in Figure 4 , the word "retract"). Words other than "retract" (such as "ethic" and "peer review") contributed less to overall engagement measurements. These results also provide evidence that a subset of Twitter users discussed retraction-related topics before the linked articles were retracted, at higher rates than would be expected from our matched comparison data. Below, we further discuss how future design could respond to this phenomenon. Using TF-IDF feature vectors described above to represent the descriptions of users that tweeted about retracted articles, we achieved a user type classification accuracy of 0.86 with the Decision Trees (CART) classifier. The Random Forest (RF accuracy=0.85) and Logistic Regression classifiers (LR accuracy=0.82) had comparable performance. Across all user types but scientists, the Decision Trees classifier had lower classification errors compared to the other two models ( Figure 5: right) . Thus, to summarize how the different user types' tweets varied in terms of engagement (i.e. summed likes), time (i.e. before/after retraction) and mentions of retraction-related keywords, we rely on the Decision Trees inferences because of its relatively low misclassification rates across most user types. When classifying users that tweeted about non-retracted articles, we observed similar results with the Decision Trees (CART accuracy=0.82), Random Forest (RF accuracy=0.80) and Logistic Regression (LR accuracy=0.75) classifiers. Notwithstanding similarities in classification performance, users who tweet about retracted vs control articles vary in terms of how the different machine learning models classify them. For example, we observed that the Logistic Regression model is biased towards classifying users as scientist and the Decision Trees model is biased towards classifying users as member of the public. Thus, to summarize how the different user types' tweets varied in terms of engagement (i.e. summed likes), time (i.e. before/after retraction) and mentions of retraction-related keywords, we rely on the Decision Trees inferences because of its relatively low misclassification rates across all user types. It is important to note the differences between the percentage of user types who tweeted about retracted vs control articles (Table 3) . Based on user type classification labels obtained from the Decision Trees model, scientists were twice as likely to tweet about non-retracted articles (35.22%) than retracted articles (18.22%). Similarly, science communicators were four times more likely to tweet about non-retracted (21.07%) than retracted articles (5.27%). Conversely, a three and a half times higher percentage of the members of the public tweeted about retracted (23.77%) than non-retracted (6.52%) articles, indicating general public interest in retractions. In terms of which groups produced tweets that received likes, we saw large contributions from scientists (30.56% of likes) and the public (18.81% of likes). Table 3 also shows the percent contribution of pre/post retraction tweets and keyword-mentioning tweets from each user type in the retracted group. Comparing the relative share of tweets produced by each inferred user type before and after retraction, we see a notable increase in tweets from bots (4.74% pre-retraction and 19.88% post-retraction) and a drop in tweets from scientists (19.18% vs 15.10%). An even starker shift was noted for bots in terms of keyword mentions. Bots produced namely a considerably higher percentage of keyword-mentioning tweets than any other group (41.24%). Other groups of users tended to contribute a higher percentage of non-keyword-mentioning than keyword-mentioning tweets (e.g., 24.92% vs 16.31% for scientists and 8.27% vs 3.87% for practitioners). Table 4 further shows the number of keyword-mentioning tweets per inferred user type among those who mentioned retracted articles. Scientists provided 16.06% of the keyword-mentioning tweets, including 370 tweets with the word "retract". The most striking contributors of keyword-mentioning tweets are bots, which produced 1,479 tweets mentioning the word "retract", yet virtually no mentions of the other keywords. This emphasizes the singular purpose of many bots in calling attention to the fact of retraction. Other retraction-related keywords were less frequent and had comparable overall mentions by members of the public and scientists. Bots No description retract 2,822 370 370 100 93 1,479 410 peer review 212 61 64 9 7 3 68 withdraw 163 50 50 10 13 2 38 fake 112 34 20 7 4 2 45 error 77 23 19 8 7 2 18 5 Discussion Compared with non-retracted control articles, retracted articles were characterized by greater overall engagement in terms of number of tweets, likes, retweets, and responses to mentioning tweets than non-retracted articles prior to retraction. A key finding of our descriptive study, however, is that this engagement varied by user types and retraction-related content within the tweets. Using retraction-related keywords, we identified a set of tweets that discuss the retraction itself. Compared with other tweets about retracted articles, keyword-mentioning tweets made up 13.4% of our observations but accounted for 22.1% of all likes. The presence of retraction-related tweets before the occurrence of a retraction appears to indicate an undercurrent of critical discussion around some pre-retracted articles. Consistent with recent studies, this provides additional evidence that Twitter users are critical in their discussion of these articles [21, 41] . Further support for this possibility is provided by the fact that the non-retracted control articles were considerably less likely to mention retraction-related keywords, such as "retract", "ethics" and "plagiarism," than pre-retraction articles. Although ours is not the first study about retracted articles to use non-retracted articles as a control set, prior work did not compare the types of attention retracted articles receive relative to non-retracted articles. By doing so, we gained two key insights. First, we learned that retracted articles received different types of engagement than non-retracted articles, with considerably fewer original tweets but more retweets, quote tweets, and replies than control articles. One possible reason for this discrepancy is that since retractions are inherently unusual, tweets about them generate increased engagement due to human curiosity. This is further supported by the fact that tweets specifically using retraction-related keywords received a larger proportionate share of likes and replies. Second, we observed profound differences between the user types, with comparatively more public users and bots and fewer practitioners, scientists, and science communicators represented in tweets about retracted articles than non-retracted articles. These findings represent an important contribution to the larger assessment of the role of Twitter as a forum for public science discussion. Prior research finds that retracted articles which receive high amounts of tweets pre-retraction receive proportionately less attention following retraction than articles which are initially unpopular [41] . However, because these studies do not perform user analysis, it is unclear which user groups on Twitter are most susceptible to engaging with retracted content both before and after retraction. Our research demonstrates that retracted articles receive proportionately more tweets from groups that are less represented in non-retracted science tweets, namely non-scientific public and bot users. However, these groups are represented differently depending on if they tweet prior to or following the retraction. Bots, in particular, are represented in nearly 4 times as many tweets post-retraction than pre-retraction, and have a relatively high prevalence rate of 8.31% of retraction tweets, as compared with 5% overall prevalence on Twitter [42] . This supports previous research which finds that bots are prevalent in science communication on Twitter [28, 29] . At the same time, the fact that bot tweets are more likely than other user types to mention retraction-related keywords (41.24% prevalence) suggests that bots play an outsize role in the online correction of retractions by producing tweets that call attention to an article's retraction. Greater monitoring and awareness of retracted articles aligns with the concept of a "healthier" Twitter, which Twitter itself has indicated as being a high priority [43] . We argue that the kind of analysis presented in this article can be useful in assessing the health of online discussions around science. For articles with significant Twitter attention, establishing baseline values for the expected proportion of retraction aware tweets (identified using keywords or more sophisticated techniques) or the proportion of tweets from academic users could contribute to developing approaches for detecting problematic sharing patterns. For instance, articles receiving significantly lower than expected portions of keyword-mentioning tweets after the retraction may indicate that the retraction itself has not been sufficiently communicated. In order to rectify this, Twitter and other digital platforms might take inspiration from scite, a recently developed tool aimed at tracking citations of articles, including retracted ones [44] . This tool has already been incorporated into citation management software like Zotero. Using scite, Zotero flags retracted articles and issues a visible warning when a user saves a retracted article. A similar feature could be added to social media platforms, either directly or through a browser plugin that lets users opt-in to receiving such notifications. Additionally, Twitter could monitor posts to ensure that retracted articles are mentioned correctly. One way this could be done is through integration with Twitter's Birdwatch program. A recent effort, Birdwatch uses crowdsourced labels to evaluate the accuracy of claims made on the platform. We believe that a similar approach could be applied to scientific findings. Domain experts vetted by Twitter could review mentions of research advances in their respective disciplines to ensure that claims about retracted articles are accurate. While user type and keyword analysis provide important context for the observed effects of retraction on Twitter engagement, these measures do not capture the full context of online conversations. The keywords used were selected from the "Reasons for Retraction" recorded in Retraction Watch's database. From this list, we took care to not include keywords that were too ambiguous (e.g., "data") and further broke our analysis down by keywords to confirm that this proxy analysis was informative ( Figure 4 ). However, it is possible that some instances of the keywords were used in a niche context which we did not anticipate. Future work might develop a more elaborate list of keywords, or incorporate more expensive methods such as supervised learning. In addition, since the TF-IDF model we used for classifying user types relies on the description text, we were unable to classify users who did not provide a profile description. Since understanding how different user types engage with retracted articles on Twitter is a critical and understudied component of engagement, future efforts could complement our findings by investigating potential ways to infer user types from user profile elements other than the description text e.g., users' membership lists, posting behaviour or social network structure. These approaches have previously been successful in identifying individual user types on Twitter, e.g., bots [45] , scientists [46] , or journalists [47] , but were not fruitful in classifying multiple user types, as we aimed in our work. Today's political environment is one where science has become a topic of widespread contention. From COVID-19 to climate change, the solution to many of today's global problems depends in part on the dissemination of reliable scientific information. While social media has the potential to bring diverse constituents to scientific discussions in an unprecedented manner, it has also facilitated the spread of unreliable scientific information, including retracted research. In this context, social media platforms are called upon to create procedures that complement, rather than undermine, the retraction process. To support these efforts, our study has helped demystify Twitter engagement with retracted articles, indicating an association between tweet engagement, academic user status, and the retraction-related content of tweets. This opens up a window for further inquiry into how social media can be used for the development of research ideas, broadening participation in scientific discussions, and engaging with publishers and policy-makers. Against a backdrop of scientific misinformation, our research offers a note of cautious optimism, demonstrating how Twitter, a platform known for its brevity, is capable of hosting relevant conversations about academic articles both prior to and after a retraction. Coupled with the timeliness and popular affordances of the platform relative to traditional forms of science communication, this supports the notion that social media can serve an important deliberative role within science. For instance, scientists might use Twitter to debate whether they think an article is likely to be plagiarized or whether some data could be fraudulent. Indeed, we find evidence suggesting that this is already happening. Ultimately, these results encourage design choices on social media that amplify desirable scientific processes and include more diverse groups in scientific discussions. Can tweets be used to detect problems early with scientific papers? a case study of three retracted COVID-19/SARS-CoV-2 papers A scientist's fraudulent studies put patients at risk RETRACTED: Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children Quantifying the effect of Wakefield et al. (1998) on skepticism about MMR vaccine safety in the U Diagnosing the determinants of vaccine hesitancy in specific subgroups: The Guide to Tailoring Immunization Programmes (TIP) Retracted Coronavirus Papers Altmetric Top 100 Covid-19 misleading information policy. blog Introducing the Twitter Impact Factor: An Objective Measure of Urology's Academic Impact on Twitter New media landscapes and the science information consumer Identifying Platform Effects in Social Media Data Confronting stem cell hype The spreading of misinformation online The evidence for motivated reasoning in climate change preference formation Misinformation in and about science Be internet awesome Tools -How do you know a paper is legit? The chronic growing pains of communicating science online What did retractions look like in the 17th century? Dynamics of cross-platform attention to retracted papers: Pervasiveness, audience skepticism, and timing of retractions Mis)informed about what? What it means to be a science-literate citizen in a digital world When Corrections Fail: The Persistence of Political Misperceptions Biased assimilation and attitude polarization: The effects of prior theories on subsequently considered evidence What a massive database of retracted papers reveals about science publishing's 'death penalty Characteristics of retracted articles based on retraction data from online sources through Retractions: Guidance from the Committee on Publication Ethics (COPE) Investigating the quality of interactions and public engagement around scientific papers on Twitter User motivations for tweeting research articles: A content analysis approach Allegation of scientific misconduct increases Twitter attention Athanasios Mazarakis, and Isabella Peters. Retractions from altmetric and bibliometric perspectives The case for (more) diversity in peer review Opinion: Gender diversity leads to better science Research Techniques Made Simple: Scientific Communication using Twitter Random forests. Machine learning Classification and regression trees Scikit-learn: Machine learning in python Media and social media attention to retracted articles according to Altmetric Four truths about bots Healthier Twitter Progress. A healthier Twitter: Progress and more to do Scite: A smart citation index that displays the context of citations and classifies their intent using deep learning Detection of novel social bots by ensembles of specialized classifiers A systematic identification and analysis of scientists on twitter Detecting journalism in the age of social media: Three experiments in classifying journalists on twitter This work was supported by the U.S. National Science Foundation under Grant No. IIS-1943506. The authors would like to thank Hao Peng and Daniel Romero for the data they shared.