key: cord-0962489-zhxo1r9g authors: Price, Robyn; Ozkan, Yusuf title: Characteristics of Imperial College London's COVID‐19 research outputs date: 2021-01-12 journal: Learn Publ DOI: 10.1002/leap.1358 sha: d0bd905340e20b2f00dc3a97f6d0f46eb4a0fdef doc_id: 962489 cord_uid: zhxo1r9g We identified 651 research outputs on the topic of COVID‐19 in the form of preprint, report, journal article, dataset, and software/code published by Imperial College London authors between January to September 2020. We sought to understand the distribution of outputs over time by output type, peer review status, publisher, and open access status. Search of Scopus, the institutional repositories, Github, and other databases identified relevant research outputs, which were then combined with Unpaywall open access data and manually‐verified associations between preprints and journal articles. Reports were the earliest output to emerge [median: 103 days, interquartile range (IQR): 57.5–129], but journal articles were the most commonly occurring output type over the entire period (60.8%, 396/651). Thirty preprints were identified as connected to a journal article within the set (15.8%, 30/189). A total of 52 publishers were identified, of which 4 publishers account for 59.6% of outputs (388/651). The majority of outputs were available open access through gold, hybrid, or green route (66.1%, 430/651). The presence of exclusively non‐peer reviewed material from January to March suggests that demand could not be met by journals in this period, and the sector supported this with enhanced preprint services for authors. Connections between preprints and published articles suggests that some authors chose to use both dissemination methods and that, as some publishers also serve across both models, traditional distinctions of output types might be changing. The bronze open access cohort brings widespread ‘free’ access but does not ensure true open access. The novel coronavirus (SARS-CoV-2), the disease it causes , and its implications for society have been described as the fastest-moving production of knowledge in our time (Kupferschmidt, 2020) and are estimated to have resulted in tens of thousands of papers produced in a 6-month period (Teixeira da Silva, Tsigaris, & Erfanmanesh, 2020) . As a large, research-intensive science, technology and medicine university with substantial biomedical and public health expertise, Imperial College London researchers began sharing research on the topic in January 2020, Organization, 2020b). The forecasts of one report (Ferguson et al., 2020) were widely cited as having changed multiple national government responses to the pandemic (Bruce-Lockhart, Burn-Murdoch, & Barker, 2020) (Landler & Castle, 2020) (Boseley, 2020) . This output received phenomenal media and online attention (https:// www.altmetric.com/details/77704842). Many other researchers and groups at Imperial have produced COVID-19 research in a variety of formats and open access models. We sought to understand the quantity and characteristics of all of Imperial's contributions to COVID-19 research in order to provide data for the institution to understand its outputs, as well as to provide an institutional cohort perspective to complement the global output level of analysis in other studies on COVID research (Di Girolamo & Meursinge Reynders, 2020; Fraser et al., 2020; Helliwell et al., 2020; Shuja, Alanazi, Alasmary, & Alashaikh, 2020; Teixeira da Silva et al., 2020) . The institution's commitment to 'consider the value and impact of all research outputs (including datasets and software) in addition to research publications' (SF Dora, 2012) as a signatory of the San Francisco Declaration on Research Assessment instructed us to consider the widest possible interpretation of research outputs that were still feasible to collect using bibliographic and data search methods; resulting in journal articles, preprints, reports, datasets, and software/code forming the dataset. We sought to understand the volume and characteristics of the research from Imperial College London on the novel coronavirus in a publication period of 1st January to 30th September 2020. The following research aims were identified: • Identify the volume of publications and the distribution over the time period by different research output types. • Determine what proportion of preprints went on to be published as journal articles and the average time for this. • Identify open access trends. • Demonstrate the distribution of outputs between publishers. This was a cross-sectional study of Imperial College Londonauthored research outputs related to COVID-19. The data were extracted in October 2020. The search strategy is described in Supplementary data file 1 'Search Strategy'. For all steps, the search terms used are '2019-nCoV', 'COVID-19', 'SARS-CoV-2', or 'coronavirus'. For the equivalent of publication date, the earliest found date in the repository referring to the release or any documented action on the output was taken as a proxy publication date. Anonymous authorship practices in software communities introduce uncertainty around author or affiliated institution. Outputs identified from non-institutionally managed repositories were manually verified to have Imperial authors before inclusion. Multiple versions of the same software/code published in the same repository file were considered as one entity, dated to their earliest found version. Multiple versions of the same preprint that shared a common DOI were counted as a single output, but versions with different DOIs or hosted on different servers or repositories were counted as individual outputs. We could not find a systematic way to identify preprints that also existed as journal articles, so we had to identify these connections manually by similarity of title and author composition. We chose to move the contents of the Unpaywall 'publisher' field into the 'journal name' field for preprints and inputted manually into the 'publisher' field the owner of the server, for example, 'journal name' becomes 'medrXiv' and 'publisher' becomes 'Cold Spring Harbour Laboratory'. A total of 651 outputs were identified from the search. These included journal articles, preprints, software/code, reports, and datasets. See Table 1 for full details. Month-on-month change in the volume of publication was observed across the period, with some instances of no change: (Fig. 1 ). Assuming the first instance of a publication to be Day 1 (report, (Fig. 2) . Across the entire time period, identification of output types as peer reviewed (PR) and non-peer reviewed (NPR) revealed January to March outputs were exclusively NPR, but across the entire time period, the majority of the outputs were PR (60.8%, 396/651). • January (NPR 100%, 5/5), • February (NPR 100%, 6/6), • (Fig. 3) . Thirty preprints were identified as later resulting in a journal article publication (15.8%, 30/189). The median time between the preprint publication and the journal article publication was 60 days (IQR: 25-82.25 days) (Fig. 4) . vs. software/code 4.5%, 1/22); • bronze (journal articles 100%, 199/199) (Fig. 7) . Creative Commons licences were observed across journal articles (38%, 152/396), preprints (65%, 122/189), software/code (17%, 5/29), reports (100%, 29/29), and datasets (75%, 6/8). Across all output types, the most popular variation of the Creative Commons licence used was CC BY, the least restrictive Creative Commons licence and used by 151 outputs overall (23%, Although the majority of outputs over the entire time period were journal articles, the exclusive presence of NPR outputs reports and preprints between January and March, which were not surpassed by PR content until May, suggests that authors needed a faster form of dissemination than journals could offer in the early months of coronavirus pandemic (Kupferschmidt, 2020) , similar to those working in other global health emergencies (Zhang, Zhao, Sun, Huang, & Glänzel, 2020) . As authors chose to disseminate research in the preprint form, the sector responded. PubMed Central adapted to include coronavirus preprints (www. ncbi.nlm.nih.gov/pmc/about/nihpreprints/), and other existing preprint servers have adapted to prioritize this research or have been established solely for the crisis (Lu Wang et al., 2020) . Journal publishers responded to the crisis; a decrease of days between submission and publication by some medical journals publishing on the topic has been observed (Horbach, 2020) , as well as announcements of reduction of peer review times by publishers (Redhead, 2020) . However, whether the likely contradictory demands of both reducing peer review and editorial time whilst retaining quality (Kwon, 2020 ) are sustainable or achievable are yet to be evaluated long term. There is some indication that this pressure is changing journal publisher attitudes to preprints, seen by explicit encouragement of preprints on the topic at The New England Journal of Medicine (Rubin, Baden, Morrissey, & Campion, 2020) , reference to the pandemic as a reason for The Lancet's decision to make their 'Preprints with the Lancet' SSRN platform permanent in September 2020 (Kleinert & Horton, 2020) , and the introduction of a default preprint policy for COVID-19 submissions at eLife (Eisen, Akhmanova, Behrens, & Weigel, 2020) . Publication platforms such as Wellcome Open Research and F1000 further disrupt traditional distinctions in the journal and peer review process. As preprints shift closer to the centre field of established scholarly communications, either the infrastructure and data standards supporting them needs to develop, or bibliographic tools need to adapt to accommodate. The complicated method of preprint data collection in this study (searching through the institution's CRIS records, a search function only available to administrators at the institution, and then supplementing it with a second search on Dimensions) was used because, although some databases index preprints (Europe PMC, Dimensions), contributor affiliation data associated with preprints is not of sufficient quality or sufficiently widespread to enable comprehensive search with verified affiliation. This is not a fault of the databases but perhaps a dependency on structured and parsable metadata from preprint servers that is not always available. Also, a lack of accessible methods through which to search for connected preprints and published journal articles, also perhaps due to missing associated metadata identifiers; prevents large-scale or automated data collection and requires associations to be identified manually as The presence of 52 publishers found is an indication that authors are served with competitive options from which to choose their own preferred outlet for dissemination and are safeguarded against 'lock-in' from any one provider. Whilst the majority of publishers predominantly serve one output type, e.g. journal publishers to journal articles, some are represented across more than one typefor example, the institutional repository publishing as 'Imperial College London' is represented amongst datasets (1), preprints (1), and reports (29). This could be a positive indicator that artificial distinctions in the research life cycle are being replaced with more holistic solutions that offer dissemination for all outputs of research. However, others have raised concern that the representation of commercial publishers across output types poses a threat to equity and value in the research production cycle (Posada & Chen, 2018) . The acquisition of preprint servers by commercial publishers, Elsevier and SSRN (2016) and Wiley and Authorea Inc. (2018) , contributed to their combined preprint and journal article shares in our set (Elsevier 19% and Wiley 12%) . That 100% of papers were published open access in the first 3 months of the pandemic suggests an author preference for this model in this period. However, considering that all of these outputs were not peer reviewed (NPR) types (preprints, reports, and FIGURE 7 Distribution of OA status by output type. FIGURE 8 Open access licence breakdown by output type. Note that due to licensing data irregularities, licence does not correspond directly to OA status. Bronze and closed-access outputs excluded. software/code), it is difficult to robustly argue that these outputs were open access as a conscious choice and not a consequence of the NPR output type. There are examples in the full time period of NPR outputs published closed access (the SSRN preprints considered closed due to their membership log in wall and one item of software set to internal view (private) in the Imperial Github repository) across the entire time period, but their presence is small (1.4%, 9/651). Publisher intervention to convert content to bronze open access is positive but has limitations; the access is not ensured in perpetuity and could be revoked in the future (Elsevier, 2020), and conditions of rights are not consistently clarified. Areas of particular need in this crisis that free access alone does not ensure are machine access for text and data-mining purposes, which is needed to apply artificial intelligence and machine-learning techniques to COVID-19 research (Shuja et al., 2020) and translation rights to disseminate in a global public health event. This study of a single institution's outputs was undertaken with an awareness that Imperial is not the largest contributor by publication volume to COVID-19 research (Hook & Porter, 2020) and obviously not the only institution to have produced impactful results. Despite suggestions of the pressures of adapting research practices to accommodate lab closures and the demand for rapid results leading to smaller teams and fewer international collaborative partners in the early months of the pandemic (Fry, Cai, Zhang, & Wagner, 2020) , we understand that coronavirus research demands collaboration at every level (Apuzzo & Kirkpatrick, 2020 ) and that any institutional-level analysis should be interpretated in relation to organisation size, mission and resouces. We recognize the limitations of comparing output types without adjusting for their characteristics or context. For example, comparison of publication times of journal articles and preprints is not a truly fair comparison given the vastly different time enterprises required of each type; neither is to compare the open access models of output types that are mandated for open access (e.g. articles) and the other output types which are not (preprints, datasets, reports, software/code). The green open access share of the data may underrepresent the true number of articles self-archived, an action that is mandated by the institution's open access policy. This is because outputs would only be classified green when there is no publisher-hosted option available (Piwowar et al., 2018) , so it is possible that some of the bronze open access items also exist as repository-archived green open access, but the Unpaywall hierarchy gives authority to the bronze publisher-hosted version in classification. Authors were served with options to publish rapidly in non-peer review form and under open access models throughout the entire period, and from January to March, these options were exclusively used. Across the entire period, however, the most commonly observed output was journal articles. The association of some preprints with journal articles suggests that the status of peer review versus non-peer review is, for some outputs, not binary. This increasing connectedness between the two can also be seen by the presence of publishers serving across both types. That the majority of outputs were published under some form of open access is positive; however, whether the bronze OA cohort is truly compliant with the long-term needs of this global challenge (World Health Organization, 2020a) (Wellcome, 2020) is not clear. The inclusion of reports, preprints, datasets and software/code as output types permits a richer and more accurate description of the institution's activities and talents than considering journal articles alone. There is a need for bibliographic methods to adapt to better identify and classify these valuable non-journal output types. Covid-19 changed how the world does science, together. The New York Times New data, new policy: Why UK's coronavirus strategy changed. The Guardian The shocking coronavirus study that rocked the UKand US. The Financial Times Bronze, free, or fourrée: An open access commentary Characteristics of scientific articles on COVID-19 published during the initial 3 months of the pandemic Peer review: Publishing in the time of COVID-19. eLife, 9 Elsevier gives full access to its content on its COVID-19 Information Center for PubMed Central and other public health databases to accelerate fight against coronavirus Impact of nonpharmaceutical interventions (NPIs) to reduce COVID19 mortality and healthcare demand Preprinting the COVID-19 pandemic Consolidation in a crisis: Patterns of international collaboration in early COVID-19 research Global academic response to COVID-19: Cross-sectional study How COVID-19 is changing research culture Pandemic publishing: Medical journals drastically speed up their publication process for Covid-19 Preprints with the lancet are here to stay Coronavirus outbreak changes how scientists communicate How swamped preprint servers are blocking bad coronavirus research Behind the virus report that jarred the U.S. and the U.K. to action. The New York Times CORD-19: The Covid-19 open research dataset The state of OA: A largescale analysis of the prevalence and impact of open access articles Inequality in knowledge production: The integration of academic infrastructure by big publishers What do the types of oa_status (green, gold, hybrid, and bronze) mean? Unpaywall Scholarly publishers are working together to maximize efficiency during COVID-19 pandemic Medical journals and the 2019-nCoV outbreak San Francisco Declaration on Research Assessment COVID-19 open source data sets: A comprehensive survey. Applied Intelligence Publishing volumes in major databases related to Covid-19 Sharing research data and findings relevant to the novel coronavirus (COVID-19) outbreak. Wellcome Solidarity Call to Action: Making the response to COVID-19 a common public good WHO director-general's statement on IHR emergency committee on novel coronavirus (2019-nCoV) How scientific research reacts to international public health emergencies: A global analysis of response patterns Alonso Alvarez, both of Imperial College London, on including datasets and software/code in analysis. Both authors declare that they are employees of Imperial College London, UK. Data collected for this study are available at https://doi.org/10. 5281/zenodo.4269922. Additional supporting information may be found online in the Supporting Information section at the end of the article: File S1. Supplementary Information.