key: cord-171089-z4oya6kz authors: Liu, Meijun; Bu, Yi; Chen, Chongyan; Xu, Jian; Li, Daifeng; Leng, Yan; Freeman, Richard Barry; Meyer, Eric; Yoon, Wonjin; Sung, Mujeen; Jeong, Minbyul; Lee, Jinhyuk; Kang, Jaewoo; Song, Min; Zhai, Yujia; Ding, Ying title: Can pandemics transform scientific novelty? Evidence from COVID-19 date: 2020-09-26 journal: nan DOI: nan sha: doc_id: 171089 cord_uid: z4oya6kz Scientific novelty is important during the pandemic due to its critical role in generating new vaccines. Parachuting collaboration and international collaboration are two crucial channels to expand teams' search activities for a broader scope of resources required to address the global challenge. Our analysis of 58,728 coronavirus papers suggests that scientific novelty measured by the BioBERT model that is pre-trained on 29 million PubMed articles, and parachuting collaboration dramatically increased after the outbreak of COVID-19, while international collaboration witnessed a sudden decrease. During the COVID-19, papers with more parachuting collaboration and internationally collaborative papers are predicted to be more novel. The findings suggest the necessity of reaching out for distant resources, and the importance of maintaining a collaborative scientific community beyond established networks and nationalism during a pandemic. Newton developed the basis for his groundbreaking work during the Great Plague, having far-reaching impacts on classical physics and many other domains. The experience of Newton has been raised repeatedly in the 2020 context of the global COVID-19 pandemic. Will scientists be more novel during the pandemic like Newton? Truly breakthrough ideas rarely occur overnight, while novelty is sometimes sparked by extreme time pressure and urgent needs. These two points of views provide contrasting possibilities regarding the association between scientific novelty and the disruption of a pandemic. As the seed of innovation (1), scientific novelty could be considered the recombination of prior knowledge components in an unfamiliar or atypical fashion (2) (3) (4) . Novel research is more likely to advance the frontier of scientific discoveries, and becomes more important than ever during the pandemic because of the urgent need for new vaccines for public health. Scientific teams' search activities that are important for scientific novelty might be reshaped during the pandemic, while it is unknown whether parachuting collaboration and international collaboration that are closely associated with teams' "search space" are accelerated or reversed. During the pandemic, the lack of access to resources that might be only available in special localities motivates scientific teams to overcome the constraints of localized search by collaborating outside their established networks and across national borders (5) . On the other hand, the novel global challenge and the urgent need for effective vaccines might encourage the adjustment of team assembly towards effective teamwork that produces new ideas by including newcomers beyond team members' pre-existing relationships and reaching international networks (6) (7) (8) . Given the importance of international collaboration and parachuting collaboration measured by the faction of team members without prior collaboration in a team, in reaching out for distant resources and producing novel knowledge, these two types of collaboration patterns might increase during the pandemic. However, the urgent pandemic situation might lead to a reduction in search and outreach (5) and increased costs of communication and coordination, which thus causes a decline in these two collaboration patterns. A disaster of global scale, as of 13 September 2020, COVID-19 has infected at least 28.6 million people, proving deadly to 917,000 individuals. 1 We use this unexpected outbreak as a natural experiment and find that (1) coronavirus research became far more novel during the pandemic; and (2) scientific teams involved more parachuting collaboration defined as collaboration between two authors without prior collaboration, while international collaboration suddenly decreased; and (3) during the pandemic, papers with a higher parachuting ratio, and internationally collaborative papers are predicted to be more novel. Building on the "knowledge recombination" theory (4) and the combinatorial perspective of novelty (3, 9) , we assess papers' scientific novelty by quantifying how extraordinary a combination of bio-entities is in a coronavirus-related paper using BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) (10), a language model that is pre-trained on 29 million PubMed articles. In medicine, represented by bio-entities (e.g., drug, disease, and gene) in publications (11) , knowledge is combined to form new ideas; especially uncommon combinations of bio-entities form especially novel ideas (12) . In this study, when the distance between two bio-entities is in the upper 10 th percentile in the distribution of the distance of all entity pairs extracted from 58,728 coronavirus research articles, the combination of two bio-entities is considered novel. Paper's novelty score is measured by the fraction of novel entity combinations extracted from the paper, ranging between 0 and 1. The higher paper's novelty score, the more novel entity combinations in the paper. 2 The framework and an illustrative example of calculating papers' novelty scores are shown in Fig.S9 in the Supplementary Material. The details of quantifying papers' novelty regarding entity combinations are shown in Measuring Scientific Novelty of Papers in the Supplementary Material. We treat the outbreak of COVID-19 as a natural experiment to explore how scientific novelty, parachuting collaboration and international collaboration evolve during such a disaster. We use a difference in differences (DID) approach based on 11,678 coronavirus articles published from January 2018 to April 2020 by the top 50 prolific countries ranked by the number of coronavirus papers published during the study period. The details of the data source and sample selection are shown in the section of Data and Sample in the Supplementary Material. We examine the association between monthly change in scientific novelty, parachuting collaboration ratio and international collaboration of coronavirus papers by 50 sampled countries and their status as a confirmed COVID-19 infection site from January 2018 to April 2020 by month. Variables and the DID strategy are specifically reported in the sections of Variables and Method: a Difference in Differences Strategy, respectively in the Supplementary Material. Our findings suggest that coronavirus research has become more novel since the outbreak of the COVID-19. After 2019, the year of the COVID-19 outbreak, there is a dramatic increase in the average novelty score of global coronavirus research relative to the earlier years (see Fig.1(A) ). Since the global first COVID-19 case was officially confirmed in December 2019, the average novelty score of global coronavirus papers sharply went up (see Fig.1(B) ). The results of the DID regression show that "treated" countries (i.e., countries with an infection) have a 0.038 (p<0.01) higher novelty score than "untreated" countries (i.e., countries without infection)-this is an increase of 28.51% standard deviation (see column 1 in Table S8 in the Supplementary Material). The estimated dynamic impact of a COVID-19 outbreak in a 2 Creative ideas could be reflected by two dimensions, namely usefulness and novelty. Novelty is the key distinguish feature of creativity beyond ideas that are well conceived. The notion that considers novelty as a process of recombination has bene shown as valuable, while some criticized that many unusual configurations might be worthless. In this study, we only consider the aspect of novel recombination of knowledge as for recently published papers, the time is too short to receive citations from subsequent research. We conduct analyses of the relationship between coronavirus papers' novelty and citation, and do not found that papers' novelty is significantly negatively related to citations papers gained, which suggests that at least, novel papers according to our definition, are not useless papers. The details of the relationship between papers' novelty score and citations paper received are illustrated in the section of The Relationship between Papers' Novelty Scores/Teams' Characteristics and Citations in the Supplementary Material. country on the country's scientific novelty score of coronavirus literature is shown in Fig.2(A) , which illustrates a jump in countries' average novelty scores in the first month (i.e., t+1 where t refers to the month the first COVID-19 case was confirmed in a country) after the first occurrence of COVID-19 case in a country, while there is no significant difference between treated and untreated countries before the first COVID-19 case in the country. The regression results show that more COVID-19 cases and deaths in a month predict a higher scientific novelty (see Table S9 in the Supplementary Material), suggesting that the increased scientific novelty might be associated with the severity of the local outbreak. After the global first COVID-19 case, Fig.1 (B) presents a sudden decrease in global coronavirus papers' international collaboration ratio. DID estimates suggest that countries' parachuting collaboration ratio increased by 3.1% (coefficient: 0.031, p < 0.01 in column 3 in Table S8 in the Supplementary Material) after the report of the first COVID-19 case in a country. This suggests that after the first case confirmed in the country, more parachuting collaboration is found in coronavirus research for the country. We further find that country's proportion of internationally collaborative papers in coronavirus research shrunk by 9.2% (coefficient: -0.092, p < 0.01 in column 5 in Table S8 in the Supplementary Material) after the occurrence of the first COVID-19 case in a country. The dynamic impact of the first COVID-19 case in the country on its average parachuting collaboration and international collaboration ratio is estimated in columns 4 and 6 in Table S8 in the Supplementary Material, respectively, and is illustrated in Fig.2 We also observe a sudden change in scientific novelty, international and parachuting collaboration ratio around the year of the outbreak of SARS, with the same direction we find during the COVID-19 (see Fig.1(A) ). Figure.S10(A) in the Supplementary Material illustrates papers' estimated novelty score estimated by a regression model including interaction terms between papers' parachuting collaboration ratio or international collaboration and the occurrence of the first global COVID-19 case. It suggests that before COVID-19, papers' parachuting collaboration ratio is significantly negatively related to papers' novelty scores. However, this relationship turns significantly positive for papers published during the COVID-19. This pattern holds for the association between papers' international collaboration and novelty scores ( Fig. S10 (B) in the Supplementary Material). The subsample analyses also confirm these findings (see columns 1 and 2 in Table S10 in the Supplementary Material). The methods to conduct sub-sample analyses and regression analyses including the interaction terms between papers' parachuting collaboration ratio/international collaboration and the occurrence of the first global COVID-19 case are shown in Sub-sample Analyses and Regression including Interaction Terms in the Supplementary Material. Our results show that in the initial period following a coronavirus outbreak, scientific novelty dramatically increased, which suggests scientists' efforts to try novel recombination of existing knowledge to combat this global pandemic. The fraction of parachuting collaboration, i.e., collaboration between team members without prior collaboration, in the scientific teams of coronavirus research grew, while the proportion of internationally collaborative papers sharply decreased. In the pre-COVID19 period, parachuting collaboration is significantly negatively associated with paper's novelty score, while this relationship turns significantly positively related to paper's novelty during the pandemic. Teams' characteristics are important determinants of team efficiency and the production of novel knowledge. Parachuting collaboration, the foil of repeat collaboration, entails both advantages and costs that influence teams' novelty, making its contribution to scientific novelty not straightforward. Unlike repeat collaboration, parachuting collaboration involves more search, coordination and innovation costs, less risk-sharing, trust and reciprocity, which might dampen scientific novelty (5, 6, 13) . However, parachuting collaboration allows pooling together a broader scope of information, data and resources outside the preexisting relationships and conflicts that might improve scientific novelty (7, 14) . During the pandemic, papers produced by teams with a larger proportion of parachuting collaboration are more novel, which suggests greater importance of a broader scope of search activities and quick access to non-local information that is only available outside teams' pre-existing networks in tackling global challenges timely. We find that internationally collaborative papers are more novel than their counterparts during the pandemic. International collaboration trends to produce more conventional knowledge combinations since transaction costs and communication barriers to international collaboration might hinder novelty (8) . This is consistent with what we find in the normal science period. However, during COVID-19, producing novel knowledge might require collaborative efforts across national borders that pool global resources more than ever. The best example is the discovery of the causative agent of SARS, a result of close international collaboration among 13 laboratories from 10 countries. 3 Most science of science studies assumes that the research system operates with institutional stability, in the framework of "normal science" (15) . With rapidly developing globalization and the increasing complexity of economic, societal, political and environmental issues, the traditional perception of normal science is no longer sufficient to address issues or problems in the scientific community. Local and even global research systems could be immediately influenced by exogenous and unexpected events. This study provides evidence on how science progresses differently during a pandemic from a normal science period. The left vertical axis in each sub-figure indicates the novelty score of papers and the right one refers to parachuting/international collaboration ratio. In sub-figure B, the study period is from January 2018 to April 2020, with a total of 28 months. The global first COVID-19 case is officially reported in the 24 th month, December 2019; the number in the X-axis indicates n month since the start of the study period (Jan 2018). The solid and dash lines indicate the actual value and the predicted value of variables based on the trend of variables before December 2019, respectively. The purple and blue dash lines indicate the time series prediction of parachuting/international collaboration ratio, respectively. The orange dash line refers to the predicted values for novelty score after a linear regression where country's monthly novelty score is the dependent variable, and explanatory variables include country's parachuting/international collaboration ratio, team size and productivity, with all values of explanatory variables in and after December 2019 replaced by their time-series predicted values. In this way, we construct a counterfactual-like framework where the novelty score evolves if COVID-19 did not occur, in other words, all explanatory variables follow their trends in pre-COVID19 period after December 2019. The shaded areas represent upper and lower bounds of 95% CIs. Fig. 2 . The DID estimates of the relationship between the occurrence of the first case of COVID-19 in the country and countries' average novelty scores, parachuting collaboration ratio and international collaboration ratio in a month. T-n indicates n month(s) before the month (t0) when the first COVID-19 case was confirmed in the country, and t+n indicates n month(s) after t0. ***, ** and * represent significance at the 1%, 5%, and 10% level. The shaded areas represent upper and lower bounds of 95% CIs. Figs. S1-S10 Tables S1-S10 References Two major datasets are used in this study, with one including publication data on coronavirus research that is used to measure an individual paper's scientific novelty and capture authors' country information, and another including country-by-country patient data about COVID-19 that is used to identify the timing when the first COVID-19 case is confirmed in a country. Publication data on coronavirus research is collected from the COVID-19 Open Research Dataset (hereafter CORD-19) 4 that covers 58,728 research articles about COVID-19 and related historical coronaviruses, such as SARS and MERS, that were published during the 1951-April 2020 period. This dataset includes title, abstract, author name, DOI, PubMed ID, and publication date. This dataset is constructed by the Allen Institute for AI and other leading research groups to facilitate researchers to discover relevant information more quickly from the literature. CORD-19 papers are sourced from PubMed Central, bioRxiv and medRxiv, with titles, abstracts or full text including the following keywords: "COVID-19" OR "Coronavirus" OR "Corona virus" OR "2019-nCoV" OR "SARS CoV" OR "MERS-CoV" OR "Severe Acute Respiratory Syndrome" OR "Middle East Respiratory Syndrome". CORD-19 dataset has been recently used in analyzing coronaviruses literature (1-4), and is viewed as a reliable data source to map coronavirus-related research. The distribution of papers per year in CORD-19 is illustrated in Fig. S1 (A), which indicates a sudden growth of papers in the years of significant pandemics. We identify authors' country information based on authors' address information provided by CORD-19 dataset and the 29 million PubMed dataset that covers 1800-2020 with author names disambiguated (5) . Based on DOI and PubMed ID provided in CORD-19 dataset, 51,726 CORD-19 papers are linked to their versions in the PubMed dataset where all author names have been disambiguated and thus the following information of CORD-19 papers was obtained: authors' unique identifiers and authors' address information. Authors' unique identifiers allow us to know whether authors in a paper have collaborated in the past according to their publications records in the PubMed database, which enables us to identify parachuting collaboration. Authors' affiliation information helps us to identify country names from authors' address information in each article by manually merging variations (e.g., ISO two-letter or three-letter country codes, alternative country names, country names in other languages, and country names with typos) of country names into the same country. Finally, standard country names corresponding to authors' locations in 45,470 CORD-papers are found. 17 Papers with the number of authors larger than 100 have been removed as the inclusion of papers with hyper-authorship might make the calculation of team variables biased because of the outliers. Additionally, we use the patient data on COVID-19 derived from the website of Our World in Data that covers 211 countries from December 2019 to 2020 May, 5 to capture the timing when the first COVID-19 case is officially confirmed, 6 and the daily number of new COVID-19 cases and deaths in each sampled country during the December 2019-April 2020 period. The 50 sampled countries account for 92.85% of cases and 97.24% of deaths related to COVID-19 from December 2019 to April 2020. The distribution of COVID-19 cases and deaths in each month is illustrated in Fig. S2 . The final dataset used for the regression analysis includes 11,678 research articles published from January 2018 to April 2020 by the top 50 prolific countries that are ranked by the number of coronavirus-related papers published during the study period. The research goal of this study is to compare monthly changes in countries' scientific novelty before and during the COVID-19. The period of 2018 Janurary-2019 November is considered a sufficient time window to present scientific novelty in the pre-COVID-19 period (1). To measure the country's productivity, we use a full counting method (6) based on the authors' address information. For example, for a paper authored by two scientists with Chinese affiliations, one scientist with a US affiliation and three scientists with UK affiliations, China, the US and the UK get two, one and three papers, respectively. 7 Hence, overall six publications are allocated to these three countries. More than 108 countries published coronavirus research during the study period, among which the top 50 prolific countries are selected as the sampled countries. The variable that measures the monthly average novelty score for countries that did not publish any or publish very few coronavirus papers would be missing. The observations that have a missing value for any one of the variables used in the regression model would be dropped by Stata. This is why we limit the regression analysis to the top 50 most prolific countries, which account for more than 98.8% of the total coronavirus-related research articles over the study period. The productivity of the 50 sampled countries/regions is shown in Table S1 . The distribution of CORD-19 papers by month and country from January 2018 to April 2020 is indicated in Fig. S3 . Entities extracted from CORD-19 papers' titles and abstracts are the basic elements used for calculating novelty scores of entity combination in each CORD-19 paper and thus allow capturing the changes in the sampled countries' novelty score of coronavirus research by month. We extract bio-entities from titles and abstracts of CORD-19 papers from January 1951 to April 2020 using PubTator Central (PTC), 8 a web-based application that automatically tags the input text with standardized biological entities (7) . As PTC has annotated entities for all the papers indexed by PubMed, for those CORD-19 papers that have PubMed IDs, we programmatically retrieved their annotations by submitting PubMed IDs, in batches of 100 to PTC. For those not having a PubMed ID, we first submitted the request with their abstracts and titles, in which entities will be annotated by PTC server, then retrieve the annotated files by submitting the session ID returned by the previously submitted request. 39,882 unique bio-entities from 38,787 CORD-19 papers were identified by PTC and were automatically categorized into four types: species, disease, gene and chemical. The major reason why PTC fails to extract any entities from titles and abstracts of 33% CORD-19 papers is that terms in those papers' titles and abstracts do not include any standardized bio-concepts detected by PTC. The distribution of entities extracted from CORD-19 papers by type is illustrated in Fig. S4 . Measuring Papers' Novelty Score using BioBERT Building on the knowledge recombination theory (8) and the perspective of combinatorial novelty (9, 10) , an indicator that measures the degree of novelty to which knowledge entities are combined in a paper has been proposed. In medicine, these knowledge entities can be represented by biological entities (e.g., drug, disease, and gene) in publications. Innovative discoveries and creative ideas usually stem from the recombination of more distant and diverse sources (9, (11) (12) (13) . Novelty is a recombination of pre-existing knowledge components in an atypical way (9, 13) . The key idea of the entity-based approach proposed in this study is that two entities that are more distant in the preexisting knowledge base, i.e., 29 million PubMed articles, their combination is perceived more novel. To measure the novelty of entity combination of CORD-19 papers, we use BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) to capture the distance between two bio-entities in each entity pair extracted from CORD-19 papers. BioBERT is a language model pre-trained on biomedical literature, PubMed articles (14) . Following the structure of BERT by Devlin, Chang, Lee and Toutanova (15) , BioBERT consists of multi-layer bi-directional transformers and is pre-trained by masked language model and next sentence prediction tasks. BioBERT can generate a contextual representation for biomedical corpora, which allows different embeddings for the same word in different contexts, instead of producing context-independent word embeddings like Word2Vec (16) or GloVe (17) does. BioBERT has 13 hidden layers. The first layer, also known as input embedding or sub-word embedding layer, is the sum of token embeddings that are obtained from the WordPiece tokenizer (18), the segmentation embeddings and the positional embeddings. The last layer is the contextual (sub-word) representations (token-level). BioBERT has five versions (14), 9 among which we use the version of BioBERT-Base v1.1 that is pre-trained on PubMed 29 million articles with titles and pre-trained for one million steps based on BERT-base-Cased with the same vocabulary. The batch size for this model is set to 32 and the maximum sequence length is set to 32 tokens. In the former stage, we recognized bio-entities using PubTator and generated entity pairs from CORD-19 papers. We generate a sub-word representation for each bio-entity extracted from CORD-19 papers. The pipeline is shown in Fig. S5 . Then we calculate bio-entities' distance between the two sub-word representations for each bio-entities pair generated from CORD-19 papers. In Fig. S5 , entity n is segmented by the WordPiece tokenizer (18) into a sequence of sub-words to mitigate the out-of-vocabulary (OOV) problem. (CLS) is added at the beginning and (SEP) is added at the end of, and get sub-words S={ , 1 , 2 , ..., , .. For example, Entity "coronaviruses" is transformed into "(CLS) Co##rona##virus##es (SEP)". The segmented entity S is then fed into the BioBERT. We use the last layer of BioBERT as the contextual representation, which generates a sequence of sub-word representations V in ×( +2) where d denotes the 768 dimensions of a hidden layer in BioBERT and i indicates the number of tokens except CLS and SEP for entity n. For instance, for the entity of "coronaviruses", the model generates a vector, with 6 × 768 dimensions. The sub-word representations of CLS and SEP are ignored and the rest of the sequence of sub-word representations V are fed into average pooling that is used to calculate the average for each patch of the feature map with the kernel size of (number of tokens, 1) to get the final vector V in . Based on the embedding of bio-entities extracted from CORD-19 papers, we calculate the cosine distance defined in Equation 1 between two resulting vectors corresponding to each entity in an entity pair extracted from CORD-19 dataset (1) Where and indicate two entities in an entity pair; • refers to the dot product of and ; ‖ ‖ 2 ‖ ‖ 2 means the product of 's Euclidean norm and 's Euclidean norm. We extract 39,882 unique bio-entities using Pubtator Central from titles and abstracts of CORD-19 papers published from January 1951 to April 2020 and pair them up (i.e., 1 − 2 , 1 3 , … ). The cosine distance of two entities in each of 596,495 entity pairs detected in CORD-19 paper is captured from the resulting embedding using BioBERT that is pre-trained on 29 million PubMed articles. The distribution of the distance between two entities in entity pairs extracted from CORD-19 papers is shown in Fig. S6(A) . We consider an entity pair in which the distance of two entities is in the upper 10 th percentile of this distribution as a novel entity combination. The novelty score for each paper is measured by the proportion of novel entity pairs according to our definition of novelty entity combination to the possible number of entity pairs in a paper. The formula used to calculate the novelty score for a paper is shown in Equation 2. where denotes paper ; indicates the number of bio-entities extracted from paper ; 2 refers to the number of combinations of two that can be drawn from the set of n bio-entities extracted from paper , i.e., the number of entity pairs generated by n bio-entities; denotes the number of entity pairs in which two entities' distance is in the upper 10 th percentile of the distribution of the distance of two entities in all entity pairs generated from CORD-19 papers. For example, for a paper that contains three bio-entities (i. e., entity a, b and c) , the number of entity pairs for this paper is three. If the distance between a and b is in the upper 10 th percentile of the distribution shown in Fig. S6(A) , the novelty score for this paper is 1/3. The higher the novelty score, the more novel entity combination in a paper. The distribution of CORD-19 papers' novelty scores is indicated in Fig. S6(B) , suggesting that most of CORD-19 papers include no novel entity combination. This study investigates the relationship between the country-level monthly change in scientific novelty regarding entity combination of coronavirus papers and the occurrence of the first COVID-19 case in the country in a given month from January 2018 to April 2020. The major independent variable is whether the first case of COVID-19 (COVID19) has been confirmed in the country by the month. We identify the month when the first case of COVID-19 is officially confirmed in each of 50 sampled countries according to the patient data from the website of Our World in Data. For example, China is the country where the first case of COVID-19 in the world was found in December 2019, followed by the US (Jan 2020), the UK (Jan 2020) and so forth. Once the first COVID-19 case has been confirmed in the country, the country gets treated in the month and the succeeding months. The distribution of treated countries (i.e., the countries where the first COVID-19 case has been confirmed) and untreated countries (i.e., the countries where the first COVID-19 case has not been confirmed) by the month is indicated in Fig. S7 , suggesting that for most of the sampled countries, the first case was detected in either Jan 2020 or Feb 2020. All sampled countries have been exposed to COVID-19 by the end of the study period. Paper-level variables: The way to generate the novelty score of each paper is explained in the section of Measuring Papers' Novelty Score using BioBERT and the formula to calculate a paper' novelty score is shown in Equation 2. Besides papers' novelty scores, we are also interested in the change of papers' parachuting collaboration ratio and international collaboration before and during the pandemic. Parachuting collaboration is defined as a co-authorship in which two authors never collaborated in the past. Parachuting collaboration ratio for a paper indicates the fraction of author pairs where two authors did not collaborate in the past to the total number of author pairs in a paper, measuring the degree to which parachuting collaboration is involved in the team, which is defined in Equation 3 where denotes paper ; 2 refers to the number of combinations of two that can be drawn from the set of authors listed in paper ; indicates the number of author pairs in which two authors have no prior collaboration. The higher the parachuting collaboration ratio for a paper, the more parachuting collaboration involved in the team of the paper. International collaboration for a paper is a binary variable that is determined by whether authors listed in a paper are from at least two countries. It is one if at least two authors are from different countries, and zero otherwise. We also calculate team size for each paper defined as the number of authors listed in a paper as a control for papers' novelty score, since team size is considered an influential factor of scientific novelty from prior literature (10, (19) (20) (21) . Country-level variables: All paper-level variables need to be aggregated to the country level since this study examines the relationship between countries' scientific novelty and the occurrence of the first COVID-19 case in the country. Based on a full counting method (1, 6), we use an example to demonstrate how paper-level variables are calculated to country-level variables. As shown in Fig. S8 , there are two papers, P1 by five authors from three countries (i.e., C1, C2 and C3), and P2 by three authors from two countries (C2 and C3), respectively. A1 who is an author of P1, and A2 who is in the author lists for P1 and P2, both belong to the country, C1. We generate a vector, { 11 , 12 ,…, 1 ., for P1, and a vector, { 21 , 22 ,…, 2 ., for P2. The element in the vector represents a variable for a paper, such as paper's novelty score ( 1 ) , parachuting collaboration ratio ( 2 ), whether or not the paper is internationally collaborative ( 3 ) and team size ( 4 ). For example, 11 and 21 indicate novelty score for P1 and P2, respectively. C1's average novelty score is the sum of 11 , 11 and 21 weighted by the unique number of author-paper pairs (i.e., A1-P1, A2-P1 and A2-P2), three. Similarly, C3's average novelty score is equal to the sum of 11 , 11 and 21 weighted by the unique number of author-paper pairs (i.e., A4-P1, A5-P1 and A6-P2), three. The country-level versions for the remaining three paper-level variables are shown in the table in Fig. S8 . The dependent variable is a country's average novelty score (novelty score) of entity combination for papers by this country published in a given month, which quantifies the monthly average extent to which entities are combined rarely for knowledge production of the country. The higher the novelty score, the more novel countries' knowledge production in a month. We are also interested in the association between new COVID-19 cases and deaths, and countries' scientific novelty. Therefore, the daily numbers of new COVID-19 cases (COVID19 case) and deaths (COVID19 death) confirmed in each sampled country are aggregated to the month level and considered two explanatory variables. Various characteristics of scientific teams might be related to novelty, such as team size (19, 20) , international collaboration (10) and collaboration of two authors who have not worked with each other before (22) . To control these influential factors, the following control variables are introduced to the model. The country's monthly average number of authors in CORD-19 papers is used to measure the average team size (team size) of coronavirus papers in a country. The proportion of internationally collaborative papers in a country in a given month is used to reflect the degree to which the papers are internationally collaborative (international collaboration ratio). The country's average parachuting collaboration ratio (parachuting collaboration ratio) is used to measure the extent to which parachuting collaboration is involved in teams for CORD-19 papers published in the month. Summary statistics of variables and the correlation matrix across variables are shown in Tables S2 and S3, respectively. We use the unexpected outbreak of COVID-19 as a natural experiment to explore how scientific novelty evolves before and during the pandemic by using a difference in differences approach based on the data on 50 sampled countries over 28 months from 2018 January to 2020 April. Our major goal is to estimate the association between the countries' monthly average novelty scores of entity combinations for all papers published in a given month and whether the first COVID-19 case in a country has been confirmed by that month. To estimate the potential impact of the outbreak of COVID-19 in the country on scientific novelty of entity combination, we regress the dependent variables, i.e., novelty score, on whether the first case of COVID-19 in the country (COVID19) has been confirmed by the month and other covariates that might influence scientific novelty as shown in Equation 4 . We apply an OLS linear model that contains fixed effects for country, , those for month, , to control the time-invariant and countryinvariant factors. The coefficient on COVID19 is a before-after estimate of the impact of the pandemic on scientific novelty. , = + 19 , + , + + + (4) To investigate the dynamic effect of the outbreak of COVID-19 in the country on countries' novelty, we introduce a set of dummies variables that reflect the timing of the occurrence of the first cases. If the outbreak of COVID-19 in month t truly impacts countries' novelty, we expect to find the effect coming solely after the first cases are confirmed with similar patterns of change for the treatment and control groups before the first cases are detected. Following (23), we test this by replacing COVID19 in Equation 4 by a set of dummies variables that relate countries' novelty score to the outbreak of COVID-19 in the prior, current and succeeding years: Where T0 refers to the month when the first case is confirmed: T-n indicates whether the observation occurs n month(s) before the month of the first case; and T+n represents whether the observation occurs n month(s) after the month of the first case. The outbreak of the COVID-19 that causes an effect in a given month can be effective in T0 or later years, but it cannot have an impact before the month of the outbreak. Controls include variables that might be related to scientific novelty: countries' monthly average parachuting collaboration ratio, the fraction of internationally collaborative papers by the country in the month, the average team sizes of papers by the country in the month and countries' monthly productivity in coronavirus research. Similarly, using the DID strategy, we investigate the association between countries' parachuting collaboration ratio in the month/the fraction of internationally collaborative papers by the country in the month and the outbreak of the COVID-19 in the country. The fixed effects of countries and months are included. To explore the relationship between the severity of COVID-19 in the country and the country's novelty score, we regress the country's average novelty score in the month on the monthly number of new COVID-19 cases and deaths. Control variables are the same with those in Equation 4 . Fixed effects of country and month are included. By conducting sub-sample analyses and regression analyses including interaction terms, we investigate the association between the occurrence of the first global COVID-19 case and papers' novelty score, as well as the relationship between papers' novelty score and two collaboration patterns (i.e., papers' parachuting collaboration ratio and whether the paper is internationally collaborative) at the paper level in the normal science period and during COVID-19. The association between papers' novelty scores and the occurrence of the first global COVID-19 case is estimated by Equation 6: Where i denotes a paper; novelty score indicates the proportion of entity pairs that are highly distant to the possible entity pairs in a paper; COVID19 is a binary variable that is one if the paper is published in and after December 2019, and zero otherwise; parachuting ratio indicates the proportion of author pairs in which two authors have no prior collaboration in the past to the possible author pairs in a paper; international collaboration is a binary variable that is one if the team includes authors from at least two countries, and zero otherwise; team size indicates the number of authors listed in a paper; fixed effects regarding papers' publication year ( ) is included; to explore the relationship between parachuting/international collaboration and papers' novelty score before and during the COVID19, the interaction terms between parachuting/international collaboration ratio and the occurrence of the outbreak of COVID-19 are introduced to the model, i.e., ℎ × 19 , and × 19. Sub-sample analyses are conducted to confirm the relationship between papers' novelty score and two collaboration patterns before and during the pandemic by separating all coronavirus papers into two groups, with papers published before the occurrence of the first global COVID-19 case, i.e., December 2019, and those published after that month. Then, we estimate the relationship between papers' novelty and two collaboration patterns based on these two groups of papers, separately. We explore the association between paper's novelty scores and citations the paper received in a two-year (citation_2), five-year (citation_5), ten-year citation (citation_10) window and whether or not the paper is the top 1% highly cited papers (top1% citation) among the papers published in the same year. CORD-19 articles are linked to their versions in Microsoft Academic Graph dataset (MAG) according to papers' DOI and PubMed ID. MAG includes more than 220 million publication records and their metadata (e.g., DOI, title, journal/conference, keywords, fields/disciplines, abstract, authors and their affiliations, etc.), as well as the more than 1.41 billion citation relationships among them. We found that 32,316 CORD-19 papers get at least one citation. Equation 7 is used to estimate how papers' novelty scores and teams' characteristics are related to papers' citations. = + 1 + 2 ℎ + 3 + 4 + + + (7) Where i denotes a paper; citation indicates citation counts papers received in a two-year, five-year or ten-year citation window; novelty sore indicates the proportion of entity pairs that are highly distant to the possible entity pairs in a paper; parachuting collaboration ratio indicates the proportion of author pairs in which two authors never collaborated in the past to the possible author pairs in a paper; international collaboration is a binary variable that is one if the team includes authors from at least two countries, and zero otherwise; team size indicates the number of authors listed in a paper; Fixed effects regarding papers' publication year is included. The equation is estimated by ordinary least squares regression models (OLS). The results are shown in Table S4 . Papers' novelty scores are insignificantly negatively related to citations of papers in different time windows, and whether the paper is the top 1% receivers of citations among papers published in the same year. This result suggests that papers in which entity combinations are novel are not "useless" papers, i.e., papers that are less cited. Besides, papers' parachuting collaboration ratio, and whether the paper is international collaborative are significantly negatively related to citations papers received, irrespective of the length of the citation window. Papers' team size is significantly positively correlated to citations of papers. We use multiple strategies to confirm the major findings of this study. First, we change to the 95 th percentile as the threshold of the location of the distance between two entities in a novel entity pair discard bio-entities that appeared less than five times in CORD-19 papers, and conduct all the analyses. 72.07% of bio-entities only appear once in CORD-19 papers, and the inclusion of bio-entities with a small frequency might make the distance between entities unreliable. We discard bio-entities with a total frequency lower than five, with 121,615 entity pairs remaining. We still use the BioBERT model pre-trained on 29 million PubMed articles to calculate the distance between entities for each entity pair obtained. Besides, we use the 95 th percentile as the threshold of the location of the distance between two entities in a novel entity pair. Generally, we obtain consistent results shown in Tables S5 to S7. Most studies on science of science assume that the system operates under the condition of institutional stability. The current studies on the scientific community are restricted to the framework of "normal science" (24) which is analogous to a gradually evolving ecological system proposed by Charles Darwin. But what would the scientific community react if the stable social and institutional conditions are punctuated by unexpected and exogenous events, such as natural or human-made disasters? The ongoing COVID-19 pandemic leads to significant disruptions on every aspect of economy and society, while little is known about whether and how extreme events or shocks, such as pandemics, reshapes the scientific community and scientific production, especially collaboration and innovation, as well as the nature and magnitude of this impact. Most studies that evaluate disasters mainly focus on the impacts on economy (25, 26) , politics, public health system (27, 28) , psychology (29, 30) , human life, social infrastructure, environment (31) and so forth. As an important component of society, science should be also impacted, whereas how science responds to disasters remains open. The outbreak of the COVID-19 pandemic stimulates the emerging studies on this topic, while neither survey-based research nor studies that focus on the short-term effect of a particular event (32, 33) fails to capture an overview landscape of the effect of pandemics on science in the long run. There is a lack of understanding of how disasters influence collaboration and innovation, with only a few studies providing initial evidence on both disruptive and positive changes in research productivity after disasters. On the one hand, disasters increase knowledge production linked to the disaster (34, 35) and lead to changes in research topics. On the other hand, evidence shows a negative impact of disasters on research outside the related topics. A recent study shows the expansion of knowledge related to the disaster after the Fukushima Daiichi accident (35) . The analysis of terrorism studies from 1991 to 2011 presents a positive relationship between the occurrence of terrorism events and productivity in the domain, with a declining trend of this productivity (35) . This study also indicates after the 9/11 attacks, the terrorismrelated academic literature has grown substantially in the US. However, using the data on 107 journals in material science, Magnone (32) find the number of submitted papers and the number of contributing authors in the areas affected by disaster decrease immediately after Japan's triple disaster. An analysis of the evolution of research topics pertaining to the Chernobyl accidents suggests that disasters could generate new scientific trends by motivating scientists to identify the important research problems caused by the disaster that requires solutions (33) . Specifically, in the early years following the disaster, publications tend to address research questions in biochemistry, genetics and molecular biology, while the topics change to humanity-and environmentrelated topics in later years (33) . A recent survey including 4,500 PIs in the US and European countries shows how scientific workforce is affected by the outbreak of the COVID-19, as well as how research output is influenced in the near future (36) . This survey finds a dramatic decline in time spent on research on average, especially on laboratory-based research after the onset of the pandemic with significant heterogeneities due to differences in fields, genders and individual characteristics. Another strand of the literature shows that the structure of scientific collaboration is impacted by disasters in both directions. With several exceptions, most literature provides descriptions of collaboration patterns after the disasters, failing to show a comparison between pre-and post-disaster periods (33, 35, 37) . A study found that scientific teams become more collaborative and more productive with new collaborative relationships developed following natural disasters (38) . However, a recent study reveals smaller-sized teams on coronavirus-related research during the pandemic relative to those before the outbreak of the COVID-19 (1). This study also indicates scientific teams with fewer nations, despite an increasing level of collaboration between China and the US. Existing Approaches that Measure Scientific Novelty As one aspect that reflects the core value in science, creativity or novelty is of great importance for scientific progress (39) (40) (41) . Innovation is highly recognized in the research system and is often associated with critical criteria based on which decisions of funding allocation, hiring, promotion and scientific awards are made (42) . Because of its significant importance, extensive efforts have been made to measure the degree to which a scientific discovery provides unique knowledge that is unavailable from prior studies, and explore factors that influence the creation of innovation. In the early years, novelty is often evaluated through peer reviews or surveys (43) (44) (45) , which is practical only on a small scale. The development of computing power and enriched bibliometric data encourage the advancement in measuring various aspects of scientific discoveries including novelty. The first approach considers novelty as the degree to which scientific discovery is reused by subsequent literature regardless of the intrinsic quality of the study (12) . Therefore, it is argued that citations could be a measure of usefulness and thus a proxy for creativity (46) . Integrating the quality aspect of research articles, the second approach focuses on either the newness or the diversity of knowledge, with the first one related to the introduction of a new concept or objective in a study, and another linked to a broader range of knowledge embedded in research. For example, the novelty of a life sciences study is measured based on the age of keywords assigned to the article, which captures the extent to which a scientist's work is novel relative to the world's research frontier (47) . In the field of biochemistry, another strategy to measure novelty is based on the introduction of a new chemical entity in research (48) . The second strategy that focuses on the quality of scientific discovery is to measure the diversity of technological domains a patent cites using Herfindahl-type index of patent class cited by the focal patent (49, 50) . The combinatorial perspective of novelty is often applied in measuring scientific novelty or originality. Novelty in science, technology and artistic creation, is often conceptualized as recombination of antecedent knowledge elements in an atypical way (9, 13, (51) (52) (53) (54) , which has become standard in the study of innovation. For example, according to Schumpeter (55) , "innovation combines components in a new way, or that it consists in carrying out new combinations." From Nelson and Winter (51) , the creation of novelty in various fields ranging from art, science to practical life, is a result of the recombination of pre-existing conceptual and physical materials. An invention is considered as either a new combination of components or a new relationship between previously combined components (56) . Building upon the perspective of combinatorial novelty, some researchers view novelty as a new or unusual combination of pre-existing knowledge components that could be operationalized by patent classes (13) , keywords (57, 58) , referenced articles (59, 60) , referenced journals (9, 11) and chemical entities (48) . Fig. S8 . An example demonstrating the procedure to aggregate four paper-level variables to the country level. The element in the vector represents a variable for paper , such as paper's novelty score ( 1 ), parachuting collaboration ratio ( 2 ), whether or not the paper is international collaborative ( 3 ) and team size ( 4 ) Fig. S9 . An example of calculating a paper's novelty score. and indicate papers and entities respectively. The formula used to calculate the novelty score for a paper is shown in the following equation. where denotes paper ; indicates the number of bio-entities extracted from paper ; 2 refers to the number of combinations of two that can be drawn from the set of n bio-entities extracted from paper , i.e., the number of entity pairs generated by n bio-entities; denotes the number of entity pairs in which two entities' distance is in the upper 10 th percentile of the distribution of the distance of two entities in all entity pairs generated from CORD-19 papers. For example, for paper that contains three bio-entities (i.e., 1 , 2 and 3 ), the number of entity pairs for this paper is three. If the distance between 1 and 3 is in the upper 10 th percentile of the distribution of the distance of two entities in all entity pairs generated from 58,728 coronavirus-related research articles, the combination of 1 and 3 is considered novel and thus the novelty score for this paper is 1/3. The way to generate the novelty score of a paper is explained in the section of Measuring Papers' Novelty Score using BioBERT. Notes: fixed effects regarding publication year are included; robust standard errors are in parentheses; ***, ** and * represent significance at the 1%, 5%, and 10% level. Table S5 . The DID estimates of the relationship between the occurrence of the first COVID-19 case in the country and countries' novelty scores in the month when using the 95 th percentile as the threshold of the location of the distance between two entities in a novel entity pair. (1) Notes: The independent variable in columns 1 is whether the first COVID-19 case in the country has been confirmed by the month; fixed effects regarding month and country are included; robust standard errors clustered by countries are in parentheses; ***, ** and * represent significance at the 1%, 5%, and 10% level Table S6 . The OLS estimates of the relationship between the logged number of new COVID-19 cases and deaths in the country and countries' novelty score in a month when using the 95th percentile as the threshold of the location of the distance between two entities in a novel entity pair. (1) Notes: The independent variables in columns 1 and 2 are the monthly logged transformed number of new COVID-19 cases and that of death, respectively; fixed effects country are included; robust standard errors clustered by countries are in parentheses; ***, ** and * represent significance at the 1%, 5%, and 10% level. Table S7 . The estimated relationship between papers' novel scores and teams' characteristics when using the 95 th percentile as the threshold of the location of the distance between two entities in a novel entity pair. (1) Notes: fixed effects regarding publication year are included; robust standard errors are in parentheses; ***, ** and * represent significance at the 1%, 5%, and 10% level Table S8 . The DID estimates of the relationship between the occurrence of the first COVID-19 case in a country and countries' novelty scores and two collaboration variables in the month. (1) Notes: The independent variable in columns 1, 3 and 5 is whether the first COVID-19 case in the country has been confirmed by the month; fixed effects regarding month and country are included; robust standard errors clustered by countries are in parentheses; ***, ** and * represent significance at the 1%, 5%, and 10% level Notes: The independent variables in columns 1 and 2 are the monthly logged transformed number of new COVID-19 cases and that of death, respectively; fixed effects of country and month are included; robust standard errors clustered by countries are in parentheses; ***, ** and * represent significance at the 1%, 5%, and 10% level Table S10. The estimated relationship between papers' novel scores and teams' characteristics. (1) Notes: fixed effects regarding publication year of papers are included; robust standard errors are in parentheses; ***, ** and * represent significance at the 1%, 5%, and 10% level. The direction of coefficient (-0.001, p>0.1 in column 1 of Table S10 ) on international collaboration ratio on novelty score for papers published before COVID-19 is opposite to that (0.036, p<0.01 in column 2 of Table S10 ) for papers published during COVID-19. Furthermore, the coefficient (-0.011, p<0.01 in column 1 of Table S10 ) of parachuting collaboration ratio on novelty score is significantly negative for papers published in the normal science period, whereas it (0.081, p<0.01 in column 2 of Table S10 ) becomes significantly positive for papers published during the pandemic. Creativity and innovation in organizations Atypical combinations and scientific impact Recombinant uncertainty in technological search An evolutionnary theory of economic change Consolidation in a Crisis: Patterns of International Collaboration in COVID-19 Research Team assembly mechanisms determine collaboration network structure and team performance Human capital heterogeneity, collaborative relationships, and publication patterns in a multidisciplinary scientific alliance: a comparative case study of two scientific teams International research collaboration: Novelty, conventionality, and atypicality in knowledge recombination Knowledge of the firm, combinative capabilities, and the replication of technology BioBERT: a pre-trained biomedical language representation model for biomedical text mining Building a PubMed knowledge graph Tradition and innovation in scientists' research strategies Economic action and social structure: The problem of embeddedness The effects of repeat collaboration on creative abrasion The structure of scientific revolutions Consolidation in a Crisis: Patterns of International Collaboration in COVID-19 Research CORD-19: The Covid-19 Open Research Dataset Comprehensive named entity recognition on cord-19 with distant or weak supervision Rapidly deploying a neural search engine for the covid-19 open research dataset: Preliminary thoughts and lessons learned Building a PubMed knowledge graph A review of the literature on citation impact indicators PubTator central: automated concept annotation for biomedical full text articles An evolutionary theory of economic change Atypical combinations and scientific impact International research collaboration: Novelty, conventionality, and atypicality in knowledge recombination Bias against novelty in science: A cautionary tale for users of bibliometric indicators Measuring originality in science Recombinant uncertainty in technological search BioBERT: a pre-trained biomedical language representation model for biomedical text mining Pre-training of deep bidirectional transformers for language understanding Efficient estimation of word representations in vector space Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP Google's neural machine translation system: Bridging the gap between human and machine translation Creativity in scientific teams: Unpacking novelty and impact Team-level predictors of innovation at work: a comprehensive meta-analysis spanning three decades of research Large teams develop and small teams disrupt science and technology The effects of repeat collaboration on creative abrasion The impact of investor protection law on corporate policy and performance: Evidence from the blue sky laws The structure of scientific revolutions What will be the economic impact of covid-19 in the us? rough estimates of disease scenarios Macroeconomic Implications of COVID-19: Can Negative Supply Shocks Cause Demand Shortages? COVID-19 and Italy: what next? The Lancet The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak Prevalence, characteristics, and long-term sequelae of natural disaster exposure in the general population Mental health consequences of disasters. Annual review of public health Did COVID-19 Improve Air Quality Near Hubei? An analysis for estimating the short-term effects of Japan's triple disaster on progress in materials science Quantifying the evolution of a scientific topic: reaction of the academic community to the Chornobyl disaster The extreme case of terrorism: a scientometric analysis Knowledge generation in the wake of the Fukushima Daiichi nuclear power plant disaster Quantifying the Immediate Effects of the COVID-19 Pandemic on Scientists Fifteen years after September 11: Where is the medical research heading? A scientometric analysis When disasters strike environmental science: a case-control study of changes in scientific collaboration networks The sociology of science: Theoretical and empirical investigations Competition in science Originality and competition in science: A study of the British high energy physics community The economics of science Peerless science: Peer review and US science policy A measure of originality: The elements of science What is Originality in the Humanities and the Social Sciences? Knowledge creation in collaboration networks: Effects of tie configuration Incentives and creativity: evidence from the academic life sciences Tradition and innovation in scientists' research strategies Using a distance measure to operationalise patent originality University versus corporate patents: A window on the basicness of invention An evolutionnary theory of economic change Hybridizing growth theory Knowledge of the firm, combinative capabilities, and the replication of technology Thematic fame, melodic originality, and musical zeitgeist: A biographical and transhistorical content analysis Business cycles Architectural innovation: The reconfiguration of existing product technologies and the failure of established firms. Administrative science quarterly Looking across and looking beyond the knowledge frontier: Intellectual distance, novelty, and resource allocation in science Breakthrough recognition: Bias against novelty and competition for attention How novelty in knowledge earns recognition: The role of consistent identities When is an invention really radical?: Defining and measuring technological radicalness. research policy