key: cord-0544062-1skldirz authors: Yudhoatmojo, Satrio Baskoro; Cristofaro, Emiliano De; Blackburn, Jeremy title: "We won't even challenge their lefty academic definition of racist:"Understanding the Use of e-Prints on Reddit and 4chan date: 2021-11-03 journal: nan DOI: nan sha: ca5850fc6ef6abba0b3f7d79e309d169a457d2ad doc_id: 544062 cord_uid: 1skldirz The dissemination and the reach of scientific knowledge have increased at a blistering pace. In this context, e-print servers have played a central role by providing scientists with a rapid and open mechanism for disseminating research without having to wait for the (lengthy) peer-review process. While helping the scientific community in several ways, e-print servers also provide scientific communicators and the general public with access to a wealth of knowledge without having to pay hefty subscription fees. Arguably, e-print servers' value has never been so evident, for better or worse, as during the COVID-19 pandemic. This motivates us to study how e-print servers are positioned within the greater Web, and how they are"used"on Web communities. Using data from Reddit (2005-2021) and 4chan's Politically Incorrect board (2016--2021), we uncover a surprisingly diverse set of communities discussing e-print papers. We find that real-world events and distinct factors influence the e-prints people are talking about. For instance, there was a sudden increase in the discussion of e-prints, corresponding to a surge in COVID-19 related research, in the first phase of the pandemic. We find a substantial difference in the conversation around e-prints and their actual content; in fact, e-prints are often being exploited to further conspiracy theories and/or extremist ideology. Overall, our work further highlights the need to quickly and effectively validate non peer-reviewed e-prints that get substantial press/social media coverage, as well as mitigate wrongful interpretations of scientific outputs. One of the main products of scientific research is the publication of articles and papers. In fact, this is explicitly built in most academic career paths [48] , as per the infamous "publish or perish" aphorism. Naturally, this translates to intense competition, not just in terms of performing cutting-edge research, but also to quickly and broadly disseminate their findings. In this context, e-print servers such as arXiv, bioRxiv, and medRxiv, have played a crucial role by providing scientists with a rapid and open mechanism for effectively disseminating their work, which is increasingly often a requirement of funding agencies worldwide. Besides enabling researchers to quickly spread the word and elicit feedback about their research (or even simply "timestamp" it), e-prints are also increasingly pivotal in the con-text of science communication and outreach. Science journalists, policymakers, and enthusiasts can easily monitor and search for articles on a handful of e-print servers, without paywalls/subscription fees. Laypeople can do the same, possibly following links from press coverage. The other side of the coin, however, is that articles posted on e-print servers are often not peer-reviewed and, by design, there essentially is no quality control on what is published. As a result, also owing to lack of expertise, users may accidentally or intentionally read questionable papers. The problem gets further compounded when users start disseminating possibly flawed, inconclusive, or misinterpreted results and treat them as a gold standard. A case in point is the explosion of e-prints in the early months of the COVID-19 pandemic [17, 18, 34, 37, 47] , a phenomenon referred to as pandemic publishing by [34] . In fact, some dubious claims and predictions about COVID-19 and possible treatments have been found on bioRxiv and medRxiv according to their founder [34] . This prompts the need to explore the implications of openly accessible -possibly non peer-reviewed -e-prints for society and science in general. In particular, we set out to better understand how e-prints are positioned within two distinct Web communities: Reddit and 4chan's /pol/. We focus on these two communities because they provide a decent balance between known purveyors of false information, as well as more mainstream discussion. Overall, we focus on three research questions: RQ1: What is the general presence of e-prints on 4chan's Politically Incorrect board and Reddit, and what communities are linking to/discussing them? RQ2: What kind of e-prints are linked to? RQ3: How are e-prints discussed, in particular on nonscientific Web communities? Methods. We collect and analyze over 14 years (from November 2, 2005 to March 16, 2021) worth of Reddit data and over 4 years (July 1, 2016 to March 16, 2021) of /pol/'s. We examine all the posts (113.5K and 11K on Reddit and /pol, respectively) that contain links to one of eight e-print servers -arXiv, bioRxiv, medRxiv, ChemRxiv, viXra, PysArXiv, Earth-Arxiv, and SocArxiv. We measure the frequency of posts containing links to these servers over time to uncover changes in the prevalence of links and shifts in their relative popularity. Next, we choose the top three servers with the largest numbers of e-prints linked to from the posts in our dataset and extract the topics they are written about using Top2Vec [6, 19] . We compare topic similarity across linked e-prints to shed light on how diverse the consumption of e-prints is in our Web communities. To understand the positioning of e-prints in discussion, we measure the degree of similarity between the e-prints and the discussions themselves. Finally, to get an even more detailed understanding of this positioning, we perform a qualitative analysis of top discussions from non-science focused subreddits. Main Findings. Overall, our analysis yields several findings. 1 . We shed light on the widespread presence of e-prints on 4chan's Politically Incorrect board and Reddit, finding that arXiv e-prints dominated 4chan's Politically Incorrect board until 2020 (coinciding with the COVID-19 outbreak), when a surge of bioRxiv and medRxiv e-prints overtook the platform and continued to dominate through the end of our data collection. On Reddit, however, arXiv e-prints have always dominated, and remain the dominant e-print source. 2. Our similarity measurement shows low similarity (0.5 at the 99th percentile) across linked e-prints, indicating that they cover a wide variety of topics, except for bioRxiv eprints on Reddit where we found high degree similarity (0.94 at the 99th percentile). 3. We show that discussions on both datasets are more similar to the abstracts than the full-text of the papers, even though the degree of similarity is relatively low overall. 4. Our qualitative analysis uncovers evidence of e-prints being twisted to fit an agenda of a community or individual. Preprint. Preprint is a term used for an unpublished document, e.g., an advanced copy of a scientific article intended for publication in a journal or conference proceeding [2] . Preprints in science have been used as an informal medium for information exchange to short circuit some barriers in scientific output-e.g., publication delays, increasing visibility compared to the growth of publications, and the competitive nature of academia placing pressure on being "the first" to present findings. The field of particle/high-energy physics was an early pioneer in using preprints, distributing them via physical mail before preprint servers existed [2, 20, 49] . In 1991, Paul Ginsparg at Los Alamos National Laboratory in New Mexico launched an electronic bulletin board to distribute preprints in theoretical High-Energy Physics (hep-th) with a notification system sent using the email address hep-th@xxx.lanl.gov [20] . Later on, the electronic bulletin board evolved into a full-blown website hosted at arXiv.org. Over time, the preprints on arXiv grew beyond Physics to broader fields like Mathematics, Computer Science, and Statistics (for a complete listing of arXiv categories, see [7] ). e-Print. Nowadays, the more appropriate term to describe the broad preprint ecosystem is e-print. e-print acknowledges the fact that papers might have been published (or about to be) in peer-reviewed venues, but the authors are also making it open-access. Other fields have also adopted e-print culture, with most deriving the name for their servers from the original arXiv server. Over 40 e-print servers were created between 2000 and 2010 [49] . Some of these are generalpurpose, e.g., PeerJ PrePrints, Preprints.org, and viXra. There are also servers for specific disciplines (e.g., LawArXiv, Sport-sArxiv, ChemRxiv, PsyArXiv, SocArXiv, and EarthArXiv), and regional ones (e.g., AfricArXiv, Arabixiv, IndiaRxiv, and Frenxiv). In recent years, the creation of focused e-print servers has not diminished; bioRxiv, focusing on biology, was launched in 2013 [10, 26] , while medRxiv, focusing on medical sciences, in 2019 [26, 38] . Benefits. As a medium for informal information exchange, the first advantage of e-prints is the rapid distribution of the article to potential readers [2] , which allows authors to be the first to publicize/timestamp their findings as well as for potential readers to read and give feedback to the authors as soon as possible [26] . As feedback is considered and revisions are made, the e-print can be updated with a newer version on the server. Another advantage is that e-print servers are accessible to the public, unlike typical publishing platforms, which hinder public access via paywalls. That is, e-prints have an inherently broader audience than articles published in many well-established, peer-reviewed venues. Overall, e-prints allow for early recognition, fast and broad dissemination, open access, and even facilitate collaboration [26, 44, 48] . Drawbacks. The other side of the coin is that e-prints do not go through the peer-review process that accompanies publication in traditional venues [26] . Thus, there is no real control (except for post-review) over an e-print's quality, yet they are still regularly cited by other scientific work. Although reducing academic gatekeeping is a laudable goal, at minimum, the peer-review process provides some level of assurance that a paper has at least been looked at by scientists that are not the authors. e-prints also contribute to information overload. Because their number is essentially unbounded, scientists must consider a substantially larger body of literature when exploring their problem domain. Again, while this is not necessarily bad, it has consequences; the larger the haystack, the more difficult it is to find the needle hidden within it. 4chan is an image-sharing bulletin board where anyone can share images and posts comments, anonymously, without having to create an account. 4chan consists of many boards with particular interest focus, and each board has many threads discussing similar interest themes as the board. A new thread is created by an "original poster" (OP) by writing a post that attached a single image. Users can reply to a post with or without attaching images and add references to previous posts, quote text, etc. 4chan is also known for the ephemerality of the threads (i.e., threads are continuously deleted). One particular board we focus on is the "Politically Incorrect" board or /pol/; this is quite unique as it basically performs no moderation at all-"everything goes." Many of the discussions are close to far and alt-right movements and exhibit xenophobia, social conservatism, racism, and hate [24] . Reddit is a social news aggregation and discussion website where the post created by a user can be up-voted or downvoted by other users. The comment to a post can be replied to and also be up-vote and down-vote. Reddit has subcommunities called subreddits and is associated with a particular area of interest (e.g., politics, movies, science). The top submission in a subreddit is shown on top of the page, and the default comment shown on top is the best one, but users can choose different settings (i.e., Top, New, Controversial, Old, and Q&A). e-Prints. Xie et al. [49] reported on the exponential growth of e-prints over 30 years, finding that arXiv is by far the largest one. One study by [1] found that two thirds of bioRxiv eprints posted before 2017 ended up in peer-reviewed journals, while [4] showed that 30% of bioRxiv e-prints remain unpublished in peer-reviewed venues, and half of the published ones ended up in Elsevier, Nature, PLOS, and Oxford University journals. An analysis of Computer Science e-prints posted on arXiv between 2008 to 2018 showed that peer-reviewed eprints differ in several ways from their published version, with, e.g., changes to titles, authorship, abstract and introduction, the addition of more authoritative references, and the availability of source code [36] . By analyzing the main driving factors in accelerating the dissemination of scientific knowledge, [48] showed that the early-view and open-access effects of e-prints contributed to measurable citations and readership, as well as visibility. The recent COVID-19 outbreak has driven a surge of papers for which e-print servers play an influential role. A recent study by [47] found that the surge of e-prints and scientific communication on social media have accelerated during the pandemic and argued that the scientific community should embrace e-prints and open, critical discussions instead of depending solely on the peer-review process as a quality control measure. By using altmetrics and content analysis, a study of COVID-19 related e-print usage by media outlets revealed that outlets ranging from those focused on the medical community to traditional news outlets, as well as news aggregators heavily covered them during the early months of the outbreak when our understanding of the pandemic was in its infancy [17] . COVID-19 e-prints also receive increased scientific and public engagement, with shorter review times and widespread use by both journalists and policymakers [18] . Scientific communication. Several studies have investigated scientific communication on Reddit. For instance, [31] revealed that the r/science subreddit provided substantial information exchange, and the comments produced a unique science communication that guides engagement with scientific research. Also, [8] showed that the frequent contributors of r/science used specialized language to discuss the research findings, but less on transient contributors and contributors that eventually leave r/science; hence, the technical language served as a gatekeeper to prevent contributors whose language is not aligned with frequent contributors on r/science. Sengupta et al. [45] distinguished posts on r/academia consisting of the challenging aspect of academia like plagiarism, working in academia, and mental health, while r/gradschool posts were more about graduate school life. Other studies focused specifcally on COVID-19. Kousha et al. [33] discovered the rapidly increasing research volume was more accessible through Dimensions database, but less through Scopus, Web of Science, and PubMed, and COVID-19-related papers were already highly cited, with particular attention on the news and social media. Chen et al. [14] investigated discussions on Weibo where the users seek help through the posts or provide support through comments and showed that users' commenting behavior was related to geographical proximity and the user's expertise level to the topic. Social media studies. Hine et al. [24] showed that many links posted on /pol/ were "right-wing" news sources and shed light on "raiding" behavior where /pol/ users would go to YouTube to post hate in video comments. Zanettou et al. [51] showed that alt-right communities on /pol/ and Reddit had a significant influence on Twitter in propagating "alternative" news to mainstream social networks. Another study by Zannettou et al. [50] focused on weaponized memes by analyzing the propagation, evolution, and influence of Internet memes on Twitter, Reddit, /pol/, and Gab. Grover et al. [21] uncovered alerting behavior of individual extremists in an online environment through behavioral text pattern analysis of a radical right-wing community on Reddit (i.e., r/altright). By studying the characteristics of user actions in the threads of r/politics and r/worldnews subreddits, [22] classified different patterns of controversies into disputes, disruptions, and discrepancies. LaViolette et al. [35] looked at r/MensRight and r/MensLib subreddits and shed light on the ideological differences between them using text classifiers, keyword frequencies, and qualitative approaches. Aldous et al. [3] developed a prediction model to predict whether an article will be shared on another social media platform by evaluating how topics affect audiences across five social media platforms (Facebook, Instagram, Twitter, YouTube, and Reddit) at four levels of engagement, achieving 80% precision. Motivated by the emergence of "direct-to-consumer" genetic testing, [39] analyzed highly toxic language used on genetic testing discussion on Reddit and /pol/. Rajadesingan et al. [43] showed that pre-entry learning of the norm stability contributed the most in maintaining stable "toxic" norms on political subreddits in which newcomers' comments tend to be different from the behavior of the same people on other subreddits; that is, behavior adjustments are community-specific and not broadly transformative. Building word embeddings of Reddit discussions, [16] uncovered gender bias, religious bias, and ethnic bias. Guimaraes et al. [23] designed a feature space and implemented a classi-/pol/ Reddit Server #Posts #Links #e-Prints #Subs #Comms #Links #e-Prints arXiv 2,422 525 427 42,749 66,610 51,108 50,904 biorRxiv 3,793 898 676 785 490 874 781 medRxiv 4,161 885 719 43 170 70 60 ChemRxiv 452 11 5 112 383 88 69 viXra 189 44 23 1,161 964 1,084 975 PsyArXiv 40 19 17 0 0 0 0 EarthArxiv 0 0 0 23 32 25 fier to predict a controversial post-event given a prefix of a path in a Reddit discussion thread (i.e., US politics, World Politics, Relationships, and Soccer). Rajadesingan et al. [42] discovered abundant political talk in non-political subreddits with less toxic comments. On the lighter side, [46] used neuralembedding to measure the social and cultural context on largescale online music sharing on Reddit and showed that a large amount of online music sharing was driven by extra-musical factors, e.g., if the artist is associated with meme culture. Orii et al. [40] studied the sentiment of truckers on r/Truckers towards the impact of autonomous trucks on the trucking industry using qualitative method and found only 0.98% of the comments had positive views on automation. In this section, we present the construction of our dataset and some details on how it was collected. /pol/ We acquire an updated dataset from the authors of [41] which contains /pol/ threads from June 30, 2016, through March 16, 2021. The next step is to find threads that contain links to e-print servers. For the most part, this involved a simple regex match; however, we also had to filter out links that pointed to the e-print servers' homepage, search interface, author page, etc. Reddit. We acquire our Reddit dataset from PushShift [9] , which contains submissions data from June 1, 2005, to March 16, 2021, and comments data from December 1, 2005, to March 16, 2021. Next, we find the submissions and comments that contained links to e-print servers. Like the data collection /pol/, it also involves a simple regex match and filtering out links that point to the e-print servers' homepage, search interface, author page, etc. e-Print Servers. To find discussion that include e-prints, we search through our dataset for links pointing to e-print servers. Specifically, we search for the following eight e-print servers: arXiv, bioRxiv, medRxiv, ChemRxiv, PsyArXiv, viXra, Earth-Arxiv, and SocArxiv. We present the result of our study organized according to the research questions proposed in Section 1. We begin taking a high-level view of e-prints on social media. Our goal is to get a broad understanding of not just the prevalence of e-prints on social media but also what type of communities within /pol/ and Reddit are engaging with e-prints. On the one hand, researchers and scientists certainly use social media, and thus we expect to see at least some activity involving e-prints. Further, one of the primary goals of e-prints in particular (and science in general) is the large-scale dissemination of knowledge, and that includes science enthusiasts, and at some level, laypersons as well. On the other hand, the primary audience of e-prints are trained scientists, and the path from scientific writing to a general audience often involves many intermediaries (e.g., journalists or social media science influencers) that distill the scientific paper into more palatable, and understandable by a general audience, chunks. This would indicate that, for the most part, "typical" social media users are unlikely to discover eprints organically. It is further unlikely that a typical social media user is equipped with the expertise to properly understand and interpret the "raw" scientific product, considering how even within the same discipline, different scientific communities do not communicate in the same language, let alone make use of the same underlying conceptual frameworks and techniques. We begin by providing some high-level statistics in Table 1 and plot the frequency per week of e-print links posted on /pol/ and Reddit comments and submissions in Figure 1 . From this analysis, several concrete findings are immediately apparent. First, there are meaningful differences in the rate of e-prints linked to across all three slices of our dataset. On /pol/ (Figure 1a) , there were no links to EarthArxiv or So-cArxiv, and until early 2020 (coinciding with the COVID-19 pandemic), arXiv e-prints dominated. In early 2020, however, we see a surge in bioRxiv and medRxiv, and they have continued to dominate since then. On Reddit, arXiv has dominated comments and submissions, and continues to do so, although this trend is much more clear for comments than submissions (Figures 1b and 1c , respectively). Further, while there are no links to SocArxiv on Reddit or /pol/, there are links to EarthArxiv while there are not any links to PsyArXiv e-prints. As noted earlier, Reddit is not a singular community; instead, it is composed of subreddits with its focus and policies. By leveraging this, we can better understand the type of communities e-prints are being posted in. First, we find that about 58% of subreddits only have a single submission posted, and 50% only have a single comment linking to an e-print (see Figure 2). Although a deeper discussion is omitted due to space limitations, this is a strong indication that while e-prints are definitely present on Reddit, they are concentrated in relatively few subreddits. To get a better feel for this, Table 2 shows the top 10 subreddits in terms of submissions and comments linking to e-prints. It is immediately apparent that machine learning communities are the primary disseminators of e-prints on Reddit, and in general, subreddits focused around arXiv's targeted disci- plines (e.g., Physics). While looking at the subreddits that post the most e-print links is insightful, it does not take into account engagement. I.e., are e-prints being posted as part of large, active discussions, or are they more often posted in smaller, more intimate conversations? To answer this, we measure engagement on a Reddit submission as a function of the number of comments the submission receives. In Figure 3 , we plot the top 10 submissions (some subreddit names in the label are shortened due to limited space) that linked to an e-print in terms of our engagement metric. In Figure 4 , we plot the top 10 submissions that had a comment that linked to an e-print paper in terms of our engagement metric. While we do see submissions from scienceoriented subreddits, we also see more general audience subreddits as well as more questionable subreddits like r/The_Donald (r/T_D), r/conspiracy (r/consp), and r/LockdownSkepticism (r/LckdwnSkep). Main Takeaways On /pol/, arXiv e-prints dominated the community until early 2020 (coinciding with the COVID-19 pandemic), at which point bioRxiv and medRxiv became the most dominant. On the other hand, there are no links to Earth-Arxiv and SocArxiv on /pol/. On Reddit, arXiv has dominated and continues to dominate comments and submissions, and there are no links to PsyArXiv and SocArxiv. Moreover, we found a strong indication that e-prints are concentrated in relatively few subreddits (i.e., machine learning communities and physics-related subreddits). When looking at the number of comments e-print linking submissions received and the number of comments linking to e-print papers, we see more general audience subreddits and more controversial ones (i.e., r/The_Donald, r/conspiracy, and r/LockdownSkepticism) in addition to science-oriented subreddits. #Comments r/tech-1vtt8b r/T_D-cljuik r/T_D-ecm9ja r/explkm5-573im6 r/science-4tev0n r/space-f1p6ye r/programming-6tp3f0 r/science-28k9le r/explkm5-1t6972 r/math-9e34xh #Comments w/pre-print links r/consp-t3_ju6qwt r/consp-t3_je4er8 r/consp-t3_hmjfgy r/ChnFlu-t3_i3ruqo r/pics-t3_hqksw8 r/Conserv-t3_j0w8u3 r/Conserv-t3_i08dqt r/Coronavr-t3_i1g8xq r/JoRog-t3_hpwbji r/NJ-t3_gacktm Although we can get a coarse grain understanding of what kind of e-prints are linked to on social media based on the server they are published on, and we can even leverage the various categories that the e-print servers define, ideally, we would like to get a feel for what individual papers are about. To answer this, we leverage Top2Vec to build topic embeddings for our corpora. Topic embeddings aim to discover a high-level summary of the information in a particular document [6] (i.e., a topic) in a larger corpus in a way that preserves essential statistical relationships [11] . There are a variety of topic modeling methods in the literature, and we use Top2Vec. Top2Vec used Doc2Vec to create the document embedding of the corpus used to calculate the distance between document vectors and word vectors. The dense area between them indicates a common underlying topic to the documents [6] . We construct one Top2Vec model for /pol/ and one for Reddit and include only the top 3 e-print servers for each (see Table 3 ). Although we omit details due to space limitations, we performed a manual validation of the generated topics by examining both the top 10 words in generated topics and the top 10 most similar e-prints to each topic. As might be intuited from the dominance of machinelearning-oriented subreddits, the largest topic we discovered for arXiv e-prints on Reddit was about reinforcement learning. Interestingly, the largest topic for arXiv papers linked to on /pol/ was about deep learning in a broader sense, e.g., convolutional neural networks and adversarial learning. The biggest topics for e-prints coming from medRxiv and bioRxiv are related to COVID-19, ranging from understanding the epidemic spread of COVID-19 to the mechanism of infection (e.g., airborne or not), as well as potential treatments. Finally, we use our Top2Vec embeddings to understand the overall diversity of e-prints in our social media datasets. To do this, we computed the pairwise similarity of each e-print in our corpora. On /pol/, we find a relatively low degree of similarity: the 99th percentile for similarity across all three e-print servers is well under 0.5 (0.38 for arXiv, 0.44 for medRxiv, and 0.43 for bioRxiv). On Reddit, however, this is not the case for bioRxiv in particular, where the 99th percentile has a similarity of 0.94, indicating that most bioRxiv papers on Reddit cover the same topic (throughout our dataset, the topic is COVID-19). Main Takeaways. There are some takeaways from the result of analyzing the topic embeddings of the e-prints. The largest topic for arXiv e-prints on Reddit was about reinforcement learning, and, interestingly, a broader topic about deep learning was discovered on /pol/. The bioRxiv and medRxiv e-prints have the biggest topic about COVID-19, ranging from an understanding of the COVID-19 epidemic spread to the mechanism of infection, as well as potential treatments. The overall e-print diversity analysis shows a low degree of similarity across the three e-print servers on /pol/ (i.e., under 0.5 at the 99th percentile). On Reddit, however, the bioRxiv eprints show a high degree of similarity (i.e., 0.94 at the 99th percentile), indicating most bioRxiv e-prints cover the same topic. Previously, we have presented a high-level view of e-prints on /pol/ and Reddit (i.e., their prevalence on /pol/ and Reddit and type of communities that engaged with e-prints), and the topic and diversity of the e-prints. Now, we would like to understand how e-prints are positioned in the discussions, and, ideally, we would like to scrutinize each thread for it. To answer this, first, we examine a high-level similarity between the threads and eprints and then qualitatively look over the top threads on the positioning of e-prints in the discussions. Once again, we leverage Top2Vec to build topic embeddings, not just for the e-prints but also for the threads. We calculate their similarity in two folds: (1) between threads and abstracts, and (2) threads and full-texts. Altogether, we discover that threads and abstracts have a higher degree of similarity than threads and full-texts. One possible explanation is that abstract is more than enough to have a general understanding of the paper because it presents the summary of the scientific research [28] including the purpose of the study, main findings, and overall conclusion [5, 29] . On top of that, the vast majority of readers do not look beyond the paper's abstract, and abstract is the only part the reader sees when they search through electronic databases (i.e., e-print servers) [5] . However, the degree of similarity is still considered low to moderate. On /pol/, the 99th percentile for the similarity between threads and abstracts on bioRxiv and medRxiv is under 0.5 (0.38 for bioRxiv and 0.35 for medRxiv) and barely over 0.5 for arXiv, that is 0.51. On Reddit, however, bioRxiv and viXra have a better similarity, that is 0.61 at the 99th percentile of the data. Next, to qualitatively analyze the top threads to shed light on the e-print's positioning in the discussion, we look at the top submissions in Figure 3 and 4. We intentionally choose submissions from subreddits with general audiences and questionable content where the e-print's positioning is less evident than those on science-oriented subreddits. Hence, we chose r/The_Donald subreddit (i.e., r/T_D-cljuik), r/LockdownSkepticism (i.e., r/LckdwnSkep-g6esrx) and r/conspiracy (i.e., r/consp-t3_ju6qwt and r/consp-t3_je4er8). The followings present the qualitative analysis result. This subreddit was known for its affinity for former U.S. President Trump and a well-documented set of content policy violations that led to its quarantining, eventual ban, and migration to a new platform [25] . Considering that r/The_Donald's focus was decidedly non-scientific, it is noteworthy that one of its submissions had the second-highest number of comments for all submissions that linked to an arXiv e-print (see r/T_Dcljuik on Figure 3a ). The submission itself discusses the content moderation decisions about r/The_Donald itself; specifically, a denied appeal to lift the "quarantine" that Reddit admins had put in place. From the submission itself: As part of this response to the most recently denied appeal, r/The_Donald moderators noted their efforts to reduce policy-violating content, e.g., racist content. As part of this justification, they claimed that an e-print showed that the rest of Reddit was actually 50% more racist than r/The_Donald: Maybe we really are the least racist place on reddit, but don't just take our word for it. A 2018 meme study [50] trickily tried to say T_D produced more racist memes than anywhere else on Reddit. We won't even challenge their lefty academic definition of racist. But when calculating so-called "racist" memes as a percent of total memes on T_D (0.4%), the rest of Reddit is 50% MORE racist (0.6%). #RedditSoRacist NB: The link to the e-print has been replaced with a citation in the quote for readability purposes. The linked e-print is a study on the characterization, evolution, and influence that different Web communities have with respect to memes. One of the key findings from this paper is that racist memes are pervasive in fringe communities like r/The_Donald and /pol/. Further, the authors found that r/The_Donald, in particular, was the most efficient community for propagating memes to other communities. In other words, this e-print was being twisted to fit a community's agenda that the e-print specifically called out, an agenda that eventually resulted in r/The_Donald being banned from Reddit completely. r/LockdownSkepticism on voluntarily exposure to SARS-CoV-2 r/LockdownSkeptism is a subreddit focused on lockdown policies related to COVID-19 in a "critical" fashion. One submission (see r/LckdwnSkep-g6esrx on Figure 3b ) raised a topic about voluntarily exposure to COVID-19 to reach herd immunity faster discussed in a medRxiv e-print [32] . From the submission's title: Controlled Avalanche: A Regulated Voluntary Exposure Approach. "Individuals whose probability of developing serious health conditions is low... will be offered the option to be voluntarily exposed to the virus under controlled supervision... an ethically acceptable practice... with low mortality." Some of the comments on the submission were in favor of the voluntary exposure concept: Now this is interesting! Not sure if I would want a "certificate". But I would totally do it if they will put me up in a hotel until I am virus-clear! Others disagreed, bringing up conspiracy theories related to government monitoring as well as Bill Gates related conspiracies: and will then be issued 'immunity certificates' you can shove your GatesOfHellChip up your cold, dead azz r/conspiracy on face mask usage r/conspiracy is, as the name suggests, a subreddit focused on conspiracy theories in general. One submission on November 14, 2020 (see r/consp-t3_ju6qwt on Figure 4 ) questioned the supposed contradictory statements that flu deaths are down because people are wearing masks, yet COVID-19 is spiking because people are not wearing masks: "Flu deaths are down because everyone is wearing masks" "Covid is spiking because no one is wearing masks" The 2 arguments are being made simulataneously One commenter noted that they had observed a lack of legitimate scientific studies showing evidence that masks were not necessary/working and provided a list of several e-prints that demonstrated the efficacy of masks: ... All I have seen is nonsensical psuedo science on so called "studies" that are impossible to reproduce. I shall show you more than 20 that says otherwise and from what I have seen at best if everyone masks and uses them correctly, it may be possible to reduce transmission by 6-15%. ... The list of links is deliberately omitted due to limited space in this paper, but within the list, there are two medRxiv e-prints (e.g., [12] and [30] ). The former is a review paper about respiratory illness links to mask usage in community settings, and the latter is about the efficacy of eye protection, face masks, and personal distancing in mitigating the spread of respiratory viruses. One response quickly brushed off the list of e-prints, saying: They just spam out 10 or more links and say their dumb opinion because they know 99% of people ain't gonna bother reading all their garbage. This response indicates that without giving any summary or context about the listed e-prints, at least some users consider link dumps pointless and are not interested in reading the actual e-prints linked to. One comment elaborated on the positive result of using Hy-droxyChloroquine in treating COVID-19 by linking several studies, three of which were medRxiv e-prints: Three new studies showing the effectiveness of Hydroxy-Chloroquine. Huang et al. [27] HCQ provides protection against COVID [15] Significantly faster clinical recovery and shorter time to RNA negative when HCQ is used [13] An immediate response to the above rejected the effectiveness of HydroxyChloroquine by stating: Trump claimed he was taking HCQ as a preventative against Covid. He still caught it. When he was in the hospital, they gave him Remdesivir and Regeneron. Sounds like HCQ is bullshit. Despite showing a summary of the finding for each e-print listed in the comments, one unsuccessful case on a Former President Trump was enough to question the effects of Hy-droxyChloroquine as an actual treatment. In other words, eprint findings were dismissed once a high notoriety counterexample was seen. Main Takeaways. First, threads on /pol/ and Reddit are more similar to e-prints' abstracts than the full-texts. Second, our qualitative analysis discovers how an e-print was being twisted to fit the agenda of a community on a r/The_Donald submission or to maintain the misinformation alive, as shown in a response in a r/LockdownSkepticism. While other examples (i.e., r/conspiracy) show, e-print's positioning becomes useless when a comment only posts links of e-prints without any context or summary and unreliable as its claim disapproved by an unsuccessful example from a notoriety figure. This paper examined the prevalence, characteristics, and positioning of e-prints on social media; more precisely, on /pol/ and Reddit. Using over 14 years worth of Reddit data and over 4 years of /pol/, we looked for posts containing e-prints from eight e-print servers (i.e., arXiv, bioRxiv, medRxiv, Chem-Rxiv, viXra, PsyArXiv, SocArxiv, and EarthArxiv). On /pol/, arXiv e-prints dominated the community until 2020 (coinciding with the COVID-19 pandemic) where a surge of e-prints from bioRxiv and medRxiv appeared and continue to dominate. arXiv e-prints on /pol/ are predominantly about deep learning with large topics related to COVID-19 coming from bioRxiv and medRxiv. On Reddit, arXiv e-prints have dominated and continue to dominate comments and submissions, with reinforcement learning being the most popular topic. We also examined the similarity between e-prints in each server and found a low degree of similarity except for bioRxiv e-prints on Reddit. When looking at the similarity between e-prints and the social media posts they were linked from, we found that social media discussions are most similar to the abstracts of the e-prints, as opposed to their full-text, implying that most readers do not read the main body of the paper. Finally, when looking at the most highly engaged Reddit submissions and comments linking to e-prints, we found evidence of clear misrepresentation of findings, as well as evidence that without a summary provided by the poster, users outright dismiss them. Limitations. Naturally, our work is not without limitations. First, we only studied the prevalence of e-prints from 8 e-print servers on /pol/ and Reddit. Second, we made no distinction between peer-reviewed and non-peer-reviewed papers posted on the e-print servers. Third, we primarily used Top2Vec to extract topics from e-prints, as opposed to human annotators, which would likely find more nuanced relationships. Finally, we focused solely on how e-prints are used on /pol/ and Reddit, but future work could (and should) explore other social networks as well as the relationship between the use of e-prints on social media and that in other scientific papers (e.g., via scientometrics and bibliometrics). Implications. Our findings show that e-prints are being dis-seminated on non-scientific communities like /pol/ and Reddit where people with different levels of expertise may interpret their contents differently. Consequently, wrongful interpretation or low-quality papers may become dangerous "gold standards" to some users. And, while the authors of this paper are very much in favor of the e-print model, we are also mindful of their drawbacks and potentially dangerous implications for the scientific community at large, including but not limited to those highlighted by our work. For better or worse, research is going to be used by laypeople, and we have little to no control over how it will be used. This is exacerbated by the open access, yet potentially not peer-reviewed nature, of e-prints. In short, the traditional model of scientific publishing is a form of gatekeeping, but this does serve a purpose. At the very least, peer-review provides some guarantee that qualified scientists have read and at least not totally rejected the work in the paper. While the traditional publication model gatekeeps the content of a paper itself, it also serves a more cynical form of gatekeeping: readers must pay for access. Together, this ensures that the scientific community is self-policing and mitigates potential misinterpretation by non-scientists (since, arguably, very few people are interested in paying several hundred dollars for an article). Our results also make it clear that social media users are almost certainly not reading the full contents of a paper, but only the title or abstract. While we are fully confident that the same happens to some extent in the scientific community itself (the point of a title/abstract is in some part to attract attention), in many cases, the title and abstract are the only part of a paper that a layperson could be expected to understand. This leads to some hard questions we need to ask ourselves, with potentially "game-changing" answers. For example, perhaps it is time for the scientific community to fully embrace the fact that the audience for scientific work is far greater than the scientific community itself. If so, one solution might be to be more directly involved in "simple English" explanations of our research, similar to how there is a simple English version of Wikipedia. In fact, there are already outlets, e.g., The Conversation, that provide a direct conduit for scientists to speak to laypeople about our work. We encourage researchers to take this to heart, and consider budgeting for not just publication fees and travel to conferences, but for direct dissemination to the public in their proposals to funding agencies. While this would most definitely stretch our collectively resources even thinner than they already are, it is also likely to help mitigate misinterpretation with the added benefit of increasing positive, public science engagement. We leveraged Top2Vec to generate topic model for the top three servers (i.e., arXiv, bioRxiv, and medRxiv) in /pol/. For each server, we looked at the generated topics and, based on the top 10 words in each topic, we manually named the topic. Moreover, we derived the scientific field and discipline of the topic based on the category of the e-print with the highest document score in each topic. Table 4 , 5, and 6 show generated topics from arXiv, bioRxiv, and medRxiv e-prints, respectively. Table 7 , 8, and 9 list the e-print with the highest score for each generated topic from arXiv, biorXiv and medRxiv e-prints, respectively. Topics generated using Top2Vec from arXiv, viXra and bioRxiv are shown on Table 10 , 11, and 12, respectively. Table 13, 14, and 15 list the highest scored e-print in each topic for arXiv, viXra and bioRxiv, respectively. Tracking the Popularity and Outcomes of All bioRxiv Preprints. eLife Preprints in Particles and Fields View, Like, Comment, Post: Analyzing User Engagement by Topic at 4 Levels across 5 Social Media Platforms for 53 News Organizations bioRxiv: Trends and Analysis of Five Years of Preprints How to Write a Good Abstract for a Scientific Paper or Conference Presentation Top2Vec: Distributed Representations of Topics Explain like I am a Scientist: The Linguistic Barriers of Entry to r/science Latent Dirichlet Allocation Facemasks and Similar Barriers to Prevent Respiratory Illness such as COVID-19: A Rapid Systematic Review. medRxiv Efficacy and Safety of Chloroquine or Hydroxychloroquine in Moderate Type of COVID-19: a Prospective Open-Label Randomized Controlled Study. medRxiv Exploring Commenting Behavior in the COVID-19 Super-Topic on Weibo Chronic Treatment with Hydroxychloroquine and SARS-CoV-2 Infection. medRxiv Discovering and Categorising Language Biases in Reddit Communicating Scientific Uncertainty in an Age of COVID-19: An Investigation into the Use of Preprints by Digital Media Outlets The Evolving Role of Preprints in the Dissemination of COVID-19 Research and Their Impact on the Science Communication Landscape Investigating COVID-19 News Across Four Nations: A Topic Modeling and Sentiment Analysis Approach ArXiv at 20 Detecting Potential Warning Behaviors of Ideological Radicalization in an Alt-Right Subreddit Analyzing the Traits and Anomalies of Political Discussions on Reddit X-Posts Explained: Analyzing and Predicting Controversial Contributions in Thematically Diverse Reddit Forums Kek, Cucks, and God Emperor Trump: A Measurement Study of 4chan's Politically Incorrect Forum and Its Effects on the Web Do platform migrations compromise content moderation? evidence from r/the_donald and r/incels Rise of the Rxivs: How Preprint Servers are Changing the Publishing Process Preliminary Evidence from a Multicenter Prospective Observational Study of the Safety and Efficacy of Chloroquine for the Treatment of COVID-19. medRxiv Are Scientific Abstracts Written in Poetic Verse an Effective Representation of the Underlying Research? Successful Scientific Writing and Publishing: A Step-by-Step Approach Physical Interventions to Interrupt or Reduce the Spread of Respiratory Viruses. Part 1 -Face Masks, Eye Protection and Person Distancing: Systematic Review and Meta-Analysis. medRxiv r/science: Challenges and Opportunities in Online Science Communication Controlled Avalanche -a Regulated Voluntary Exposure Approach for Addressing Covid-19. medRxiv COVID-19 Publications: Database Coverage, Citations, Readers, Tweets, News, Facebook Walls, Reddit Posts. Quantitative Science Studies How Swamped Preprint Servers are Blocking Bad Coronavirus Research Using Platform Signals for Distinguishing Discourses: The Case of Men's Rights and Men's Liberation on Reddit How Many Preprints Have Actually Been Printed And Why: A Case Study of Computer Early in the Epidemic: Impact of Preprints on Global Discourse about COVID-19 Transmissibility. The Lancet Global Health And We Will Fight for Our Race!" A Measurement Study of Genetic Testing Conversations on Reddit and 4chan Perceptions on the Future of Automation in r/Truckers Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board. ICWSM, 14 Political Discussion is Abundant in Non-political Subreddits (and Less Toxic). ICWSM, 15 Community-Specific Learning: How Distinctive Toxicity Norms are Maintained in Political Subreddits On the Value of Preprints: An Early Career Researcher Perspective What are Academic Subreddits Talking About?: A Comparative Analysis of r/academia and r/gradschool Imagine All the People: Characterizing Social Music Sharing on Reddit Proliferation of Papers and Preprints During the Coronavirus Disease 2019 Pandemic: Progress or Problems With Peer Review? Preprints as Accelerator of Scholarly Communication: An Empirical Analysis in Mathematics Is Preprint the Future of Science? A Thirty Year Journey of On the Origins of Memes by Means of Fringe Web Communities The Web Centipede: Understanding How Web Communities Influence Each Other Through the Lens of Mainstream and Alternative News Sources S.B.Y. is a Ph.D. student awarded with the Foreign Fulbright Indonesia DIKTI Higher Ed. PhD. scholarship.