key: cord-0959762-2rrlt5z9
authors: Kostoff, Ronald N.; Morse, Stephen A.
title: Structure and infrastructure of infectious agent research literature: SARS
date: 2010-05-30
journal: Scientometrics
DOI: 10.1007/s11192-010-0240-6
sha: 343e5019aaa831f4e134ac30bc30b23d521c2938
doc_id: 959762
cord_uid: 2rrlt5z9

Text mining was used to extract technical intelligence from the open source global SARS research literature. A SARS-focused query was applied to the Science Citation Index (SCI) (SCI 2008) database for the period 1998–early 2008. The SARS research literature infrastructure (prolific authors, key journals/institutions/countries, most cited authors/journals/documents) was obtained using bibliometrics, and the SARS research literature technical structure (hierarchical taxonomy) was obtained using computational linguistics/document clustering.

The first pandemic of the twenty first century was the outbreak of Severe Acute Respiratory Syndrome (SARS) caused by the SARS-CoV coronavirus. As far as is known, this outbreak was not due to the deliberate release of the SARS-CoV coronavirus, but rather was a naturally occurring event. The appearance of SARS seems to have involved: (1) a zoonotic origin for SARS-CoV [e.g., horseshoe bats and/or Chiroptera as one wildlife reservoir (Janies et al. 2008) ]; (2) transmission to intermediate hosts [e.g., civet cats, racoon dogs (Feng and Gao 2007) ]; (3) human contact with these intermediate hosts in Southern China (Guangdong Province, November 2002) and subsequent cross-species transmission of the coronavirus to humans (Feng and Gao 2007) ; (4) transmission of the virus through both non-hospital personal contact and hospital staff contact (Zhang 2007) ; and (5) global transmission of the virus via travelers from affected regions in Asia to other countries. SARS was eventually controlled through increased hygienic measures (e.g., face mask protection, frequent hand washing, living quarter disinfection), travel restrictions, and quarantine.

The implications of the emergence and spread of SARS are ominous. A new virus emerged that had not been identified previously. Due to a combination of several factors including: gestation period before the outbreak was identified and the causative virus discovered; bureaucratic delay in reporting the extent of the pandemic by Chinese officials; insufficient initial hygienic measures; inadequate hospital capacity for dealing with the full extent of treatment and confinement; and, insufficient initial travel and communication restrictions, SARS spread to about thirty countries before it was finally contained. During this outbreak, more than 8000 cases were reported to the World Health Organization (WHO), with a case fatality rate of approximately 10% (Hao et al. 2006) .

How would these numbers change if SARS-CoV had been deliberately released by a bio-terrorist? No advanced technology would have been required. Infected intermediate hosts could have been distributed to multiple locations to serve as multiple sources for the infection. By the time the intentional nature of the outbreak was recognized, it would have spread globally, causing an overwhelming strain on healthcare resources throughout the world.

One of the less tangible weapons in the arsenal to combat both bio-terrorism and natural pandemics is intelligence. This has myriad forms, including the direct use of humans to access information, use of sophisticated computer systems to track infrastructure and resource movements, use of literature-related discovery techniques to help predict the development of engineered biological threat agents (Swanson et al. 2001) and to predict the emergence of naturally occurring pathogens, and development of technologies to detect, neutralize, shield and vaccinate against bio-warfare and naturally occurring agents.

To demonstrate the use of text mining to gain a better understanding of SARS and SARS-CoV, and how this information could be used to impact preparedness and response activities, a three-part examination of the global SARS literature was undertaken. In the first part, the global SARS research literature was retrieved, and the infrastructure of this literature (key researchers, institutions, journals, etc.) and medical structure (medical themes, inter-relationships among the themes) were identified. In the second part, the seminal documents of this literature will be identified, and summarized as a historical literature review (e.g., Kostoff et al. 2007 ). In the third part, literature-related discovery (Kostoff 2008; Kostoff et al. 2008) will be used to identify potential preventive measures and treatments for SARS.

Text mining is the extraction of useful information from large volumes of text. Its component capabilities of computational linguistics and bibliometrics were the main analytical techniques used in the present study.

Computational linguistics identifies the main technical/medical themes of the database(s) being examined as well as the relationships among these themes Science and technology (S&T) computational linguistics (Hearst 1999; Losiewicz et al. 2000) is a process that underlies the extraction of useful information from large volumes of technical text. It identifies pervasive technical themes in large databases from technical phrases that occur frequently. It also identifies relationships among these themes by grouping (clustering) the phrases (or their parent documents) on the basis of similarity.

Computational linguistics has been used for a number of purposes including: enhancing information retrieval and increasing awareness of the global technical literature (Kostoff et al. 1997; Greengrass 1997) ; discovery and innovation based on merging common linkages among very disparate literatures (Swanson 1986; Kostoff 2008; Kostoff et al. 2008) ; uncovering unexpected asymmetries in the technical literature (Goldman et al. 1999; Kostoff 2003) ; estimating global levels of effort in S&T sub-disciplines (Kostoff et al. , 2004a ; helping authors to increase their citation statistics by improving access to their published papers, which also may help journals increase their Impact Factors (Kostoff et al. 2004a, b) ; and tracking the impact of a specific research area across time and applications areas (Davidse and Van Raan 1997; Kostoff et al. 2001) .

Evaluative bibliometrics (Narin 1976; Garfield 1985; Schubert et al. 1987 ) uses counts of publications, patents, citations and other potentially informative items to develop S&T performance indicators. Its validity is based on the premises that: (1) counts of patents and papers are a valid indicator of R&D activity in the subject area of those patents or papers;

(2) the number of times those patents or papers are cited in subsequent patents or papers is a valid indicator of the importance or impact of the cited patent or paper; and (3) the citations from paper to paper, from patent to patent, and from patents to papers are an indicator of the intellectual linkages between the organizations that are producing the patents and papers, and knowledge linkage between their subject areas (Narin et al. 1994) .

Evaluative bibliometrics can be used to: identify the infrastructure (authors, journals, institutions) of a technical domain; identify experts for innovation-enhancing technical workshops and review panels; develop site visitation strategies for assessment of prolific organizations globally; and identify impacts (literature citations) of individuals, research units, organizations, and countries.

A typical text mining study of the published literature involves the development of a query for comprehensive information retrieval, a search of the database using computational linguistics and bibliometrics, and an integration of the processed information.

The primary objective of this study was to identify the global research literature that was related directly to SARS. A secondary objective was to estimate the relative level of global effort in the sub-categories of SARS research, as reflected by the emphasis in the published literature. Because citation bibliometrics are an important tool, the SCI was selected as the database for most of the analyses. Additionally, only records classified as ''articles'' or ''reviews'' in the SCI were downloaded. Once the source database was selected (1998early 2008) , the iterative search approach of Simulated Nucleation (Kostoff et al. 1997) was used to generate the final search query.

The SARS research literature, as defined by the authors of this study, consists of published open-literature papers that focus on theoretical, laboratory, biological, clinical, and epidemiological aspects of SARS and its coronavirus, and emphasizes the original research literature accessed by the SCI. The SARS literature is defined operationally by the following query: [(SARS and (coronavirus or infect* or virus* or viral or epidemic* or epidemiology or antibodies or antibody or vaccine* or influenza or pandemic* or outbreak* or syndrome)) OR ''sars patient*'' or ''sars transmission'' OR ''SARS-CoV'' OR ''severe acute respiratory syndrome''].

Based on recent text mining results (Kostoff et al. 2004a (Kostoff et al. , 2006 , document clustering was the only theme identification method used, and was performed using the Abstracts text only. Document clustering combined documents into groups based on their text similarity. Document clustering provided numbers of documents in each cluster, a proxy metric for level of emphasis in each taxonomy category. The approach used in this study was based on a partitional clustering algorithm (Karypis 2004; Zhao and Karypis 2004) contained within a software package named CLUTO. Figure 1 shows the number of SARS SCI articles (retrieved with the above query) as a function of time. Between 2003 and 2008, a total of 2874 papers were retrieved. For the major country producers of SARS research articles, the temporal production is as follows (country/number of SCI articles): This distribution reflects the countries hit hardest by SARS (China, Singapore, Taiwan, and Canada) and the role of the USA in leading the public health response. Furthermore, the temporal pattern observed reflects the emergence of a previously unrecognized disease, the discovery of its causative agent, and the control/elimination of disease. Publication metrics are counts of papers published by different entities, and citation metrics are counts of citations to documents published by different entities. The citations were aggregated in all the retrieved SCI papers, the authors, specific documents, years, journals, and countries cited most frequently were identified, and are presented in order of decreasing frequency.

The 10 most prolific SARS research authors since the first paper was published (in 2003) and their publication frequency are shown in Table 1 . Some of the authors listed collaborated on some of the papers, as shown in Fig. 2 . The last column in Table 1 reflects the number of papers on which each author was listed, so there is some double-counting involved.

Five authors (Yuen, Pieris, K. H. Chan, P. K. S. Chan, Sung) account for 65% of the total number of articles (N = 497) published by the top 10 most prolific SARS research authors. Nine of the most prolific authors were from China (all from Hong Kong), and the tenth was from Japan. One of the authors was from a research institution, and the other nine were from universities.

To identify the authors most associated with highly cited SARS papers, the 100 most frequently cited SARS-related documents (as listed in the SCI) were retrieved, and the author frequency extracted. The top 10 central authors are shown in Table 2 . Five (50%) of the authors do not appear in the list of most prolific SARS research authors (Table 1) . As in Table 1 , there is some double-counting due to co-authorship.

The citation data for authors and journals represent citations generated only by the specific records extracted from the SCI database for this study. The data do not represent all the citations received by the papers in those records as they could have received additional citations by papers in other technical disciplines.

Seven of the authors are from China, two are from the Netherlands, and one from the USA. This is the best performance by Chinese authors in absolute terms relative to cited papers in any study we have seen. 

A visual way of displaying collaborating teams as indicated by co-authorship is through an auto-correlation map (an auto-correlation function describes the correlation between a random function and a copy of itself shifted by some 'lag' distance. An auto-correlation map of authors shows teams of people who publish together.) The authors are represented as map nodes, and those related are connected by lines of different thicknesses, thicker being stronger. Figure 2 depicts an author auto-correlation map of the prolific authors, ten of whom are listed in Table 1 . Six publishing groups are evident: the largest group (I), (Yuen, Guan, Zhang, Zhou, et al.) , is from China and the USA. Other than the intraconnection within groups I to VI, there is not a great deal of inter-connection across groups evident from this diagram, based on the threshold values necessary to display linkages.

The nine journals containing the most SARS research papers are listed in Table 3 along with the number of SARS-related papers published between 2003 and 2008. These nine journals contained about of the total retrieval of 2874 articles. Two of the journals, Journal of Virology and Emerging Infectious Diseases, clearly constitute the first tier of SARS-publishing journals. Many of these journals are highly specialized, and appear quite applied. The biomedical emphases of these journals are medicine (mainly infectious diseases), virology, and microbiology. 

The top journals in which the 100 most cited SARS papers were published are listed in Table 4 . Seventy-six of the 100 most highly cited SARS papers were published in the major medical and multi-disciplinary journals listed in Table 4 . No other journal published more than two highly cited papers.

The 10 institutions producing the highest numbers of SARS research papers recovered in our search are listed in Table 5 . The 1308 papers in Table 5 represent about 45% of the total 2874 papers. However, the 1308 includes some double-counting because of institutional co-authorship. Six of the top 10 institutions are universities, three are governmental, and one is a hospital. Four are from China, two from the USA, two from Singapore, one from Canada, and one from Taiwan. Five of the institutions stand out in terms of productivity: University of Hong Kong (incl. Queen Mary Hospital); Chinese University of Hong Kong (incl. Prince Wales Hospital); Chinese Academy of Science (all institutes); CDC (a public health agency-includes all variants); and National Taiwan University (incl. NTU Hospital).

The institutions producing the most highly cited SARS papers are listed in Table 6 . In Table 6 , # HI-CIT is the number of highly cited SARS papers published by the institution, and # REC is the total number of SARS papers published by the institution. The two Hong Kong institutions performed well in terms of absolute numbers of highly cited papers, but when normalized by their total paper production, they are well behind the European institutions of Erasmus Medical Center and University of Frankfurt, and behind the American institutions of Harvard University and NIH as well.

The 10 countries producing the most SARS research papers between 2003 and 2008 are listed in Table 7 .

These numbers reflect each country's appearance in a paper, and could include some double-counting due to country co-authorship. With the exception of Taiwan and Australia, eight of the top 10 countries participated in the WHO multicenter SARS collaboration and had access to specimens and virus from SARS patients. The first five countries were affected by the SARS outbreak, although to different degrees.

The countries producing the most highly cited SARS papers are listed in Table 8 .

The top eight countries were part of the WHO collaborating SARS network. All of these countries had immediate access to specimens and virus from SARS patients. The ten most highly cited papers are listed in Table 9 . Of interest is that all of the top 10 articles were published in 2003. This reflects the fact that SARS was not recognized before October, 2002. The papers can be grouped into several broad categories. The first is the identification of the novel coronavirus associated with SARS and its properties (papers by Ksiazek, Drosten, Rota, Peiris, and Marra). The second group of papers describes the clinical features of SARS (papers by Lee, Peiris, Poutanen, and Tsang). The third group comprises one paper describing viral features that could be exploited for development of anti-viral drugs (paper by Li). It is likely that the high number of cites for the top five articles reflects the fact that they are cited in the introduction of most SARS papers.

Thirty-two individual clusters were chosen for the database (2003-2008; Articles and Reviews retrieved from the SCI), and they were agglommoraterd as shown in the next section. For the second level taxonomy, each first level category is divided into two subcategories. SARS Epidemiology/Clinical Medicine is divided into SARS Epidemiology (N = 820 records) and SARS Diagnosis and Clinical Treatment (N = 397 records), while SARS Biology is divided into SARS Coronavirus Structure (N = 1264 records) and SARS Drug Development (N = 240 records). The SARS Epidemiology category focuses on controlling the spread of infection both in health care facilities as well as in the larger local and global settings, while the SARS Diagnosis and Clinical Treatment category focuses on The second level categories are further sub-divided to form eight-third level categories. This will be the final level discussed. The third level is treated as a flat taxonomy. The category heading (in bolded italics) is followed by the category summary metrics (prolific Authors, Countries, Institutions), which are followed by the component cluster themes, bulletized.

The Categories 1 and 2 are epidemiology-related. Category 1 focuses on the broader epidemiological issues, such as classes of emerging infectious diseases, surveillance, monitoring, and quarantine during outbreaks, and modeling epidemiologic dynamics. Overall, China produced about 35% more papers than the USA; however, in this category the USA outproduced China by about 60%. Canada, which overall had 25% of the papers of China, had about half of the Chinese papers in this category. Also overall, the Chinese University of Hong Kong had almost as many papers as the University of Hong Kong; but, in this category it had about half as many.

Category 2 focused on attempts to control the spread of infection within the health care setting, and to identify the effects of SARS on the physical and mental health of the care givers. In this category, unlike Category 1, China had more than twice as many papers as the USA, perhaps reflecting the urgency and much more operational experience with SARS relative to the USA. Also, in this case, the Chinese University of Hong Kong had almost 50% more papers than the University of Hong Kong.

Categories 3 and 4 reflect the clinical aspects of SARS. Category 3 addresses clinical treatments, emphasizing outcomes and clinical features. In this category, China outproduced the USA by a factor of four, again reflecting the much more extensive experience in treating patients with SARS. Category 4 addresses clinical diagnosis, emphasizing chest radiographs and pulmonary pathology. China outproduced the USA by a factor of six, reflecting again its greater experience. While the numbers of records in Category 4 are relatively small, the Chinese University of Hong Kong outproduced the University of Hong Kong by almost a factor of two.

Categories 5 and 6 reflect various aspects of the SARS coronavirus and its pathogenesis. Category 5 is relatively broad, addressing the genetic components of the coronavirus as well as the structural components. The striking feature of this category is that University of Hong Kong now has double the paper output of Chinese University of Hong Kong.

Category 6 reflects the protein components of SARS coronavirus and their role in pathogenesis, host immune response, and as potential vaccine candidates. Here, the Chinese Academy of Science has 50% more papers than the University of Hong Kong, and the Chinese University of Hong Kong is not even among the top five producers.

Categories 7 and 8 reflect issues related to drug development. Category 7 addresses protease/proteinase inhibitors. The Hong Kong universities don't even show up among the first five, the leaders being Chinese Academy of Science, followed by Academica Sinica. Category 8 addresses virus inhibiting drugs. In this category, Belgium, which had slightly over 2% of China's papers overall, has one-third of China's paper production, with the main Belgian institution being Catholic University of Louvaine.

To summarize, the Chinese University of Hong Kong played the strongest role in the patient care and epidemiology issues, while the University of Hong Kong and Chinese Academy of Sciences played a leading role in the biological research output. Among USA institutions, CDC played a strong role in the epidemiology of SARS and in the characterization of the SARS coronavirus. NIAID and Harvard played an important role in elucidating the role of viral proteins in the immune response and as potential vaccine candidates.

SARS was the first major pandemic of the twenty first century. The nations involved were essentially caught unprepared, since there are few efforts devoted to predicting new pathogenic forms such as biowarfare agents or emerging infectious diseases (Swanson et al. 2001 ). Nevertheless, multilateral collaboration was rapidly established, as exemplified by the WHO-organized multicenter collaborative network. As the literature bibliometrics results show, PRC played a major role in all aspects of diagnosing, treating, and researching the disease. In fact, PRC had a higher fraction of most highly cited papers in this study than in any we have seen before. Within PRC, the researchers at the Hong Kong institutions played a major role in significant advances across the SARS spectrum.

According to recent surveys, none of the treatments used during the pandemic were effective. 8000 people were reported to have contracted SARS, and about 10% of those died. Given the large numbers of contacts, why didn't far more people contract SARS? And, even though ten percent died, how did the other 90% survive, given the lack of effective treatment?

We believe there is a message in the fact that these numbers were not substantially larger than they were. We will be using these results in the third part of the study, identifying potential discovery (treatments and preventatives) related to SARS.

Out of particles: impact of CERN, DESY, and SLAC research to fields other than physics

Towards our understanding of SARS-CoV, an emerging and devastating but quickly conquered virus. Comparative Immunology, Microbiology and Infectious Diseases

History of citation indexes for chemistry-a brief review

Term domain distribution analysis: A data mining tool for text databases

Information retrieval: An overview. National Security Agency

Bioinformatics research on the SARS coronavirus (SARS_CoV) in China

Untangling text data mining

Evolution of genomes, host shifts and the geographic spread of SARS-CoV and related coronaviruses

CLUTO-A clustering toolkit

Bilateral asymmetry prediction

Literature-related discovery: Introduction and background

Fullerene roadmaps using bibliometrics and Database Tomography

The structure and infrastructure of Chinese science and technology. DTIC Technical report number ADA443315. Fort Belvoir, VA: Defense Technical Information Center

Literature-related discovery: Methodology

Citation mining: Integrating text mining and bibliometrics for research user profiling

Database Tomography for information retrieval

The seminal literature of anthrax research

Fractals roadmaps using bibliometrics and Database Tomography

Nonlinear dynamics roadmaps using bibliometrics and Database Tomography

Textual data mining to support science and technology management

Evaluative bibliometrics: The use of publication and citation analysis in the evaluation of scientific activity (monograph). NSF C-637. National Science Foundation. 1976. Contract NSF C-627

Bibliometrics theory, practice and problems

Subject field characteristic citation scores and scales for assessing research performance

Fish Oil, Raynauds Syndrome, and undiscovered public knowledge

Information discovery from complementary literatures: Categorizing viruses as potential weapons

The outbreak pattern of SARS cases in China as revealed by a mathematical model

Empirical and theoretical comparisons of selected criterion functions for document clustering

Structure and infrastructure of infectious agent research literature 209