key: cord-0934072-yov56pcl authors: Sevryugina, Yulia V.; Dicks, Andrew J. title: Publication practices during the COVID-19 pandemic: Biomedical preprints and peer-reviewed literature date: 2021-01-21 journal: bioRxiv DOI: 10.1101/2021.01.21.427563 sha: 22ea7c8f3852d0e9fb16b249e563aef715b72c51 doc_id: 934072 cord_uid: yov56pcl The coronavirus pandemic introduced many changes to our society, and deeply affected the established in biomedical sciences publication practices. In this article, we present a comprehensive study of the changes in scholarly publication landscape for biomedical sciences during the COVID-19 pandemic, with special emphasis on preprints posted on bioRxiv and medRxiv servers. We observe the emergence of a new category of preprint authors working in the fields of immunology, microbiology, infectious diseases, and epidemiology, who extensively used preprint platforms during the pandemic for sharing their immediate findings. The majority of these findings were works-in-progress unfitting for a prompt acceptance by refereed journals. The COVID-19 preprints that became peer-reviewed journal articles were often submitted to journals concurrently with the posting on a preprint server, and the entire publication cycle, from preprint to the online journal article, took on average 63 days. This included an expedited peer-review process of 43 days and journal’s production stage of 15 days, however there was a wide variation in publication delays between journals. Only one third of COVID-19 preprints posted during the first nine months of the pandemic appeared as peer-reviewed journal articles. These journal articles display high Altmetric Attention Scores further emphasizing a significance of COVID-19 research during 2020. This article will be relevant to editors, publishers, open science enthusiasts, and anyone interested in changes that the 2020 crisis transpired to publication practices and a culture of preprints in life sciences. The lifecycle for any research starts and ends with a scholarly communication. Despite a 34 variety of avenues to communicate research findings, the foundation of the modern publication 35 practices is a publication in a peer-reviewed journal. The peer-review system is, at present, deeply 36 engraved in scientific minds as the golden standard for research quality. Certainly, the peer-review 37 process improves the drafted manuscript, but previous studies showed that its positive effect on 38 the overall quality of the final report is minor [1] . Besides, the traditional peer-review system is 39 notorious for reviewer bias, lack of agreement between reviewers, harsh criticism concealed by 40 anonymity, multiple cycles of reviews and rejections by different journals, and associated delays 41 and expenses [2] . Object Notation (JSON) format. Data analysis and visualization was done in Python (pandas, 159 numpy, requests, matplotlib, bokeh, and seaborn) using Jupyter Notebook. 160 To search PubMed, we used Entrez Programming Utilities (E-utilities) [43] , an application 161 programming interface (API) that allows searching 38 databases from the National Center for 162 Biotechnology Information (NCBI). For E-Utilities, data were downloaded via CSV and converted 163 to Microsoft Excel for further analysis and visualization. 164 Rxivist [41] is a Python-based web crawler that parses the bioRxiv website, detects newly 165 posted preprints, and stores metadata about each item in a PostgreSQL database. The metadata we 166 extracted contained title, authors, submission date, category, DOI for preprint and, if published, 167 the new DOI and the journal of publication. 168 Crossref [42] is an official DOI registration agency of the International DOI Foundation 169 that establishes a cross-publisher citation linking system for academic that include journals, 170 conference proceedings, books, data sets, etc. It works with thousands of publishers to provide 171 authorized access to their metadata including DOI, publication date and other basic information. Allen Institute for AI (AI2) in collaboration with many partners and released on March 16, 2020. 175 We used its 2020.09.02 release downloaded on 2020.09.30 from CADRE [46] for metadata 176 associated with refereed journal articles. Data challenges and study limitations. 195 Analysis of published preprints. When a preprint is published in a peer-review journal, a 196 reference to the new DOI of the journal article appears next to its title, and DOIs of a preprint and 197 a published article are permanently linked in indexing platforms and tools, which pull from various 198 APIs. Rxivist [41] showed to be an excellent tool for extracting published DOIs for preprints 199 eventually appearing as peer-reviewed journal articles but only when bioRxiv records linked 200 preprints to their external publications. Rxivist also had a two weeks delay in updating its metrics, 201 and it might be of this delay that some peer-reviewed preprint analogues were missing from 202 Rxivist. Additionally, at the time of our study, Rxivist did not include medRxiv preprints in its 203 database, which changed after Nov 27, 2020. We found that the most reliable method of extracting 204 metadata about each individual preprint was by accessing the BioRxiv API [40] . Using the Python 205 library requests, we were able to extract information about each preprint based on DOI, which 206 gave us a column called 'published.' Within this column, if the preprint was also published in a 207 journal, the metadata provided the DOI that corresponded to the published version of the paper. To ensure we found all published preprints, we also accessed data from Crossref, Dimensions, and 209 CORD-19 APIs. To establish the linkage between the preprints and corresponding peer-reviewed 210 journal articles we performed both, DOI and title matching. All channels were then combined and 211 duplicates were dropped. For detailed demonstration of data obtained by every data channel, see 212 Published Collections in SI. To validate whether we found all peer-reviewed preprint versions based on a combination 214 of Rxivist, Crossref, CORD-19, Dimensions, and BioRxiv API, we randomly selected a sample of 215 100 preprints that our data returned as "unpublished" from both bioRxiv and medRxiv, and 216 searched Google Scholar by title. Our analysis of "unpublished" preprints returned 10% of bioRxiv 217 and 4% of medRxiv preprints as being published in refereed journals. All found journal 218 11 publications had slight modifications in article titles or authors' list, and the original "unpublished" 219 preprints were not linked on preprint servers to the corresponding published versions. In 220 comparison, this false-negative rate is lower than the 37.5%, reported by Blekhman et al. [51] . All 221 manually found journal article versions of "unpublished" preprints were manually added to data 222 discussed in this article. Double DOI. When we looked for published preprints based on title matching, we encountered 224 a few instances when two published DOIs existed for a peer-reviewed preprint version. In one 225 case, it was erratum for the paper and in the other case it was a publication on another preprint 226 server. In both cases, we used only the DOI for the article in the peer-reviewed journal and 227 publication on another preprint server was removed from further analysis. We also encountered a 228 few cases when preprints with different DOIs were linked to the same DOI of the published 229 version. On inspection, preprints with different DOIs were somewhat similar in titles and authors' 230 list but not identical. For our analysis, we kept only one DOI for a preprint that was published 231 earlier. PubMed. As mentioned in the Introduction, the NIH Preprint Pilot started in June 2020 and at 233 this stage, it primarily focuses on NIH-supported and COVID-19 related preprints from various 234 servers. By Sept 26, PubMed indexed 1,048 preprints from medRxiv, bioRxiv, ChemRxiv, arXiv, Research Square, and SSRN, of which 1,043 were on COVID-19, and this constituted only 11.5% 236 of 9,072 medRxiv and bioRxiv COVID-19 related preprints from the BioRxiv API. For these 237 reasons, we did not use PubMed as a data source for preprints. We used PubMed (through E-238 Utilities) to obtain metadata on peer-reviewed articles of "Journal Article" and "Review" article In analyzing PubMed dates, we found that articles with a missing day-of-publication were 242 coded as being published on January 1 st ; a similar issue was reported earlier for Crossref dates [38]. Based on low number of preprints in January, we decided to avoid discussing January data 244 for PubMed (this month is omitted in Fig 2) . 245 13 Categories. In general, we used a single category for a preprint as indicated in metadata from 246 the BioRxiv API. However, as of September 25, we found six out of total 1,956 of COVID-19 247 related bioRxiv preprints (0.3%) that displayed two categories. Since this contradicted the servers' 248 statement that "Only one subject area can be selected for an article", we omitted the additional 249 category in our analysis. The journals' scope categories were extracted from Crossref [54] . ArticleDate@DateType="Electronic". When ArticleDate@DateType="Electronic" from PubMed 254 was not available, we substituted it with the "created-date" from Crossref. Before deciding on which dates to use in our studies, we carefully analyzed those used in 256 previous studies and noted some inconsistency between different authors (Table 4) ArticleDate@DateType="Electronic" in PubMed and/or "date-created" from Crossref. To assess the preprint pre-submission time, we subtracted the preprint deposition date from 273 the date the journal articles was "received". To assess the review time, we subtracted the date the 274 journal articles was "received" from the date it was "accepted". To assess the production stage 275 time, we subtracted the date the journal article was "accepted" from the date it was posted online Consistently throughout the pandemic, medRxiv experienced a significantly higher flux of 291 COVID-19 preprints as compared to bioRxiv (Table 1 and Scholarly Output in SI). On average, 292 medRxiv preprints on COVID-19 constituted 78% (SD = 2%) of combined bioRxiv and medRxiv 293 preprints on any single month, except January, when the number of COVID-19 related medRxiv 294 preprints was only 27% of COVID-19 related bioRxiv preprints. May was the most productive 295 month for authors of medRxiv preprints. In June, the number of medRxiv COVID-19 preprints 296 declined by 31%, while the number of bioRxiv preprints increased by 6%. After June, we noted a study is defined by bioRxiv and medRxiv preprints in relation to "Journal Article" and "Review" 332 article types in PubMed. Based on our analysis, in February, the amount of COVID-19 preprints 333 from medRxiv and bioRxiv constituted only 2% of biomedical articles on all topics but this fraction 334 increased to 15% in May (Fig 3) . The number of peer-reviewed articles on COVID-19 has been 335 growing since the start of pandemic reaching a peak in July. In contrary, the number of unrelated 336 to coronavirus peer-reviewed literature has been slowly declining. As a result, the fraction of 337 COVID-19 journal articles with respect to all articles indexed in PubMed has been increasing since 338 the start of pandemic and reached 71% in October. At that time, the amount of COVID-19 bioRxiv 339 and medRxiv preprints was at 9% with respect to COVID-19 peer-reviewed literature in PubMed, 340 but this fraction was as high as 57% in February 2020. Thus, early in pandemic, there were over 341 half as many preprints as there were peer-reviewed articles about the newly emerged coronavirus. We also analyzed categories for bioRxiv preprints unrelated to COVID-19 (see Categories Analysis in SI) deposited into the server during Jan 1 -Sept 30, 2020. We found that the majority 76% of all published medRxiv preprints (Fig 6) . The publication rates vary across the preprint 400 categories (Fig 7) . Thus, COVID-19 preprints in bioRxiv categories of microbiology and 401 biochemistry display the highest publication rates of 22%. 2020, and we found them at 34% and 29%, for bioRxiv and medRxiv preprints, respectively. Despite being higher than the publication rate of 18% derived from our data in October, reanalyzed 442 publication rates are still low. where pre-submission time (t  ) is the interval between preprint posting on a preprint server 454 and its submission to the peer-reviewed journal; peer-review time (tR) is the duration of the peer-455 review process; and production stage time (t  ) is the interval between article official acceptance 456 statement and its publication online. The descriptive statistics for these publication delays are summarized in Table 2 and Fig 8 458 and will be discussed in detail below. It is worth noting that none of the publication delays display 459 a standard Gaussian distribution (Fig 9) , thus we discuss both their medians and means. time between medRxiv and bioRxiv preprints is statistically significant ( COVID-19 medRxiv and bioRxiv preprints (Jan-April, 2020). Dates retrieval method is not specified. 34.24 < 0.001 1.03 498 We also explored whether T  can explain the different publication rates for preprint preprint on a preprint server shortly after or even prior to its submission to the peer-reviewed 514 journal. In our quest to explain the expedited publication times, we analyzed review time (tR) and 515 the production stage period (t  ) (Fig 10) . We found that a mean review time (tR) for COVID-19 related bioRxiv and medRxiv 523 preprints is 43.4 days ( [67]. This discrepancy in early data is likely due to a sever skew in frequency distribution for tR. 536 The major advance in speeding up the peer-review process was observed for PLOS ONE. The mean production stage time (t  ) for COVID-19 related bioRxiv and medRxiv preprints 549 is 14.6 days, about one third of the average tR found above for the same set of articles (Table 2 ). The difference in t  for medRxiv and bioRxiv preprints on COVID-19 is not significant (Table 3) . As compared to the t  of 147 days reported by Björk Fig 10) . We found that an average pre-submission time (t  ) for COVID-19 related preprints is 5.6 562 days (Table 2) , a positive value implying that, on average, authors posted their manuscript to the 563 preprint server before advancing their preprints to journal publishers (Fig 11) . Authors of bioRxiv 564 COVID-19 preprints waited longer than authors of medRxiv COVID-19 preprints; this difference 565 being statistically significant (Table 3 ). The distribution of t  frequencies indicates a median at 0 566 days ( Table 2, Fig 9) . A more detailed analysis showed that 44% of the COVID-19 preprints were 567 deposited to bioRxiv or medRxiv servers after being submitted to the journal (negative t  ) and 568 only 28% of preprints were posted more than 10 days before they were submitted to the journal 569 where they were published. Our results mirror earlier findings by Anderson [69] , who reported 570 those values as 57% and 29%, respectively, for papers that had preprint analogues and were The t  , in days, is plotted for bioRxiv (red) and medRxiv (blue) preprints deposited during Jan 1 - Sept 30, 2020. The 0 date is the date the preprint was submitted to the peer-reviewed journal and 581 a positive t  indicates that the preprint was deposited before being submitted to the journal. 601 We reasoned that variations in the set of journals publishing the majority of preprints could 602 be explained with difference in elapsed times for each journal (Fig 12) . Indeed, in post hoc (Table 5) . In the previous section, we discussed journals that published the majority of bioRxiv and 618 medRxiv preprints based on the number of preprints. preprints and their article analogues (Fig 13) . We found that the majority of COVID-19 preprints 646 in both medRxiv and bioRxiv were published in journals whose scope is general biochemistry, 647 genetics, and molecular biology. Additionally, microbiology preprints from bioRxiv were 648 published in journals specialized in microbiology, infectious diseases, and virology. The latter 649 category is currently absent in either bioRxiv or medRxiv platforms but is listed among Scopus 650 categories. The majority of medRxiv preprints were published in journals whose scope is general 651 medicine. Preprints in infectious diseases and epidemiology were published in journals whose 652 scope is infectious diseases and microbiology (medical). For both, medRxiv and bioRxiv, the To assess the visibility of COVID-19 preprints, we compared the Altmetric Attention Scores of COVID-19 related articles that had associated preprints to those that did not, and to 689 articles unrelated to COVID-19 that were published between Jan 1 and Nov 19, 2020 (Fig 14, 690 Table 6 ). We also stratified our results by journal to eliminate a potential effect of a journal's impact factor (IF) or other journal-specific variables. For the top ten journals that published the 692 majority of COVID-19 preprints, we found that Altmetric Attention Scores for articles that had 693 associated preprints were slightly higher on average but not significantly different from articles 694 that did not have associated preprints ( Table 6) In this paper, we explored how publication practices in biomedical sciences reacted to an 718 emergency, such as COVID-19 pandemic. Our first focus was analyzing the usage of two major 719 biomedical preprint servers, bioRxiv and medRxiv. Following the deposition of the first preprint 720 on a "novel coronavirus" in mid-January 2020 [56] , preprint submissions to these two platforms 721 increased rapidly. Submissions of new coronavirus related preprints reached 10 to 20 per day by 722 February and increased to about 150 per day by May (Fig 1) . In addition to this incredible flow of The answer to this question lies within the trends in the most active fields in each preprint 737 server. Our analysis revealed that the majority of COVID-19 related preprints in bioRxiv were 738 deposited in those fields that are most relevant to coronavirus research, such as microbiology, 739 bioinformatics, and immunology ( Fig 4A) preprints. Our analysis of publication delays yielded a median t  of 0 days for published COVID-790 19 preprints, implying that preprints were submitted to preprint servers and to journals 791 simultaneously (Fig 9 and Fig 11) . A more detailed analysis showed that only 28% of preprints 792 were deposited into servers for over 10 days prior to journal submission. Journal champions varied at various moments throughout the pandemic, which was found to be 828 related to variations in publication times among the journals (Fig 12) . For example, it took the an earlier observation that a publication peak for COVID-19 preprints in May transfers to the 834 summit in July for journal article publications (Fig 2) . Complementary efforts of preprint servers and scholarly journals to disseminate knowledge 836 promptly, while differentiating reliable and important findings from those that may be misleading 837 attest to the upmost relevance of COVID-19 topic during 2020, as evident from Altmetric Attention Scores for COVID-19 research articles (Fig 14) . In summary, our analysis showed that early in pandemic, preprints were prevailing in 852 disseminating findings on the topic of the public health emergency. Preprint authors deposited 853 them into fields previously underrepresented on bioRxiv or medRxiv servers but those that were consultations. We also thank Dr. Oscar Tutusaus for his assistance with manuscript editing. Comparing quality of reporting between preprints and peer-reviewed articles in the biomedical 882 literature The limitations to our understanding of peer review 887 bioRxiv: the preprint server for biology New preprint server allows earlier sharing of research methods and findings Arxiv.org/help/stats ArXiv usage statistics A 894 systematic examination of preprint platforms for use in the medical and biomedical sciences 895 setting Preprints as a Hub for Early-Stage Research Outputs Technical and social issues influencing the adoption of preprints in the 901 life sciences The Internet and unfrefereed scholarly publishing Preprints could promote confusion and distortion Peer 911 review and preprint policies are unclear at most major journals. PLOS ONE Scientific Community. Preprints for the life sciences The Preprint Dilemma Ten simple rules to consider regarding preprint 921 submission On the value of 923 preprints: An early career researcher perspective Slavov Lab Why I love preprints Does the arXiv lead to higher citations and reduced publisher 928 downloads for mathematics articles? Releasing a Preprint is Associated with More Attention and Citations for 930 the Peer-Reviewed Article. Elife Rise of the Rxivs: How Preprint Servers are Changing the Publishing Process When and how to comply Jisc scholarly communications [Internet]. Perspectives on the open access discovery landscape 941 27906975; (b) Barrett SCH. Proceedings B 2017: the year in review Debunking Bad COVID-19 Research 949 Proliferation of Papers and Preprints During the 952 Pandemic: Progress or Problems With Peer Review Coronavirus disease 2019: the harms of exaggerated information and non-958 evidence-based measures Advancing scientific knowledge in times of 961 pandemics New from eLife: Invitation to submit to Preprint Review Publishing in the time of ProMED International Society for Infectious Diseases Undiagnosed Pneumonia -972 PubMed PMID: 32108094; (b) Callaway E. The COVID-19 crisis could 976 permanently change scientific publishing Preprints can fill a void in times of Timeline of WHO's response to COVID-19 Research 983 methodology and characteristics of journal articles with original data, preprint articles and 984 registered clinical trial protocols about COVID-19 Tracking the popularity and outcomes of all bioRxiv 987 preprints Committee on Publication Ethics Database of COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv Sorting biology preprints using social media and 995 readership metrics Entrez Programming Utilities Help National Center for 999 Biotechnology Information (US) Dimensions API CORD-19 -COVID-19 Open Research Dataset Cloud-Based Solution for Big Bibliographic Data Research in Academic 1007 Libraries. Frontiers in Big Data Repositories and preprint servers tracked by Altmetric Real-world query used to extract publications related to COVID-19 Tracking the popularity and outcomes of all bioRxiv preprints Time to Acceptance of 3 Days for Papers about COVID-19 These subject categories only appear in the REST API of Crossref ASJC codes and are applied at the journal-level, by association with the journal's 1028 Potently neutralizing and protective human antibodies against 1030 SARS-CoV-2 Using text mining to track outbreak trends in global surveillance 1032 of emerging diseases: ProMED-mail Biology preprints over time Publication Types How many preprints have actually been 1039 printed and why: a case study of computer science preprints on arXiv Preprints: An underutilized mechanism to 1042 accelerate outbreak science Horbach SPGM. Pandemic Publishing: Medical journals drastically speed up their publication 1045 process for COVID-19 Preprinting the COVID-1047 19 pandemic Available from: Wikimedia Commons; (b) Designed by Freepik from Flaticon. Coronavirus, 1051 CC-BY Available from: Adioma.com; (d) Kyle Scott. Glasses, CC-BY 1054 (f) Designed by srip from Flaticon. Read, CC-BY The relationship between bioRxiv preprints, citations 1056 and altmetrics Does it take too long to publish research? The publishing delay in scholarly peer-reviewed journals Pandemic publishing poses a new COVID-19 challenge. Nat Hum Behav Analysis for "the history 1064 of publishing delays Trends and analysis of five years of preprints. Learned Publishing arXiv E-Prints 1069 and the Journal of Record: An Analysis of Roles and Relationships Altmetric Scores, Citations, and Publication of Studies Posted as 1072 Sources of Attention Sharing data during Zika and other global health emergencies Editorial Evaluation and Peer Review During a 1080 Pandemic: How Journals Maintain Standards Supporting Information 1084