key: cord-0919129-3r8jbhhq authors: Zhang, Chengxin; Zheng, Wei; Huang, Xiaoqiang; Bell, Eric W.; Zhou, Xiaogen; Zhang, Yang title: Protein Structure and Sequence Reanalysis of 2019-nCoV Genome Refutes Snakes as Its Intermediate Host and the Unique Similarity between Its Spike Protein Insertions and HIV-1 date: 2020-03-22 journal: J Proteome Res DOI: 10.1021/acs.jproteome.0c00129 sha: e8b00ac916a001809f009174d88afaafde56c2f6 doc_id: 919129 cord_uid: 3r8jbhhq [Image: see text] As the infection of 2019-nCoV coronavirus is quickly developing into a global pneumonia epidemic, the careful analysis of its transmission and cellular mechanisms is sorely needed. In this Communication, we first analyzed two recent studies that concluded that snakes are the intermediate hosts of 2019-nCoV and that the 2019-nCoV spike protein insertions share a unique similarity to HIV-1. However, the reimplementation of the analyses, built on larger scale data sets using state-of-the-art bioinformatics methods and databases, presents clear evidence that rebuts these conclusions. Next, using metagenomic samples from Manis javanica, we assembled a draft genome of the 2019-nCoV-like coronavirus, which shows 73% coverage and 91% sequence identity to the 2019-nCoV genome. In particular, the alignments of the spike surface glycoprotein receptor binding domain revealed four times more variations in the bat coronavirus RaTG13 than in the Manis coronavirus compared with 2019-nCoV, suggesting the pangolin as a missing link in the transmission of 2019-nCoV from bats to human. The 2019 novel coronavirus (2019-nCoV), also known as SARS-CoV-2 1 and HCoV-19, 2 is the pathogen behind COVID-19, a new type of pneumonia that initially caused an outbreak in Wuhan, China and has since spread to most countries in the world. The rapid transmission across country borders and the large number of confirmed cases prompted the World Health Organization (WHO) to declare COVID-19 as a global pandemic on March 11, 2020. As of March 23, there are at least 332 930 and 14 510 patients who have been diagnosed with and have died of COVID-19 worldwide, respectively. Among the affected countries, China has the largest population of confirmed cases (81 610) and the second highest death toll (3276). Meanwhile, Europe and North America have also been hit hard: 59 138 and 31 573 cases were confirmed in Italy and the United States, which are the nations with the highest number of 2019-nCoV infected patients in their respective continents, with the number of deaths in Italy (5476) surpassing that of China. Understanding the viral infection mechanisms and animal hosts is of high urgency for the control and treatment of the 2019-nCoV virus. Whereas it is now commonly recognized that bats such as Rhinolophus aff inis may serve as the natural reservoir of 2019-nCoV, 3 it is still unclear which animal served as the intermediate host that brought the bat coronavirus to human hosts. Whereas multiple studies suggest the Malayan pangolin (Manis javanica) as another host, 4−6 some studies have proposed that the pangolin may be a natural host rather than an intermediate host. 7, 8 During 2019-nCoV's infection of host cells, a critical virion component is the spike surface glycoprotein, also known as the S protein. Spike proteins constitute the outermost component in a coronavirus virion particle and are responsible for the recognition of angiotensin-converting enzyme 2 (ACE2), a transmembrane receptor on mammalian hosts that is utilized by the coronavirus to enter the host cells. 3, 9 Therefore, the spike protein largely determines the host specificity and infectivity of a coronavirus. In this Communication, we first analyzed the results of two recent studies, 10, 11 which have spurred numerous interests and discussions in the community and society regarding the sequence and structure of the spike protein in 2019-nCoV and the identification of its intermediate hosts. In particular, the study by Pradhan et al. reported the identification of four unique insertions that were shared only with HIV-1 and were "unlikely to be fortuitous in nature". 10 Although the work has been questioned by the scientific community, rumors and conspiracy theories based on these studies still widely circulate among the general public. 12 We therefore believe that there is an urgent need to systematically examine the bases and conclusions of these studies in serious scientific reports. To further examine the animal hosts of the 2019-nCoV spread, we next assembled the draft genome of a highly related coronavirus using metagenomic samples from Manis javanica. The alignment results of the assembled genome sequences, in particular, on the spike proteins, suggest the importance of pangolins in the evolution of 2019-nCoV and its transmission from bats to humans. Global protein sequence alignment of the full-length coronavirus spike proteins was performed by MUSCLE 13 and visualized by SeaView. 14 We used C-I-TASSER 15 to create structural models of the fulllength spike protein. Here C-I-TASSER is an extended pipeline of I-TASSER 16 and utilizes the deep convolutional neuralnetwork-based contact maps 17 to guide the Monte Carlo fragment assembly simulations. Because the RBD domain of the spike exhibits different conformations relative to the remaining portion of the protein, the DEMO pipeline 18 was then used to reassemble the domains and to construct a complex structure consisting of the spike trimer and the extracellular domain of human ACE2 using the ACE2-bound conformation 2 of the SARS-CoV spike glycoprotein (PDB ID: 6ACJ) as a template. Our complex modeling did not use the template originally used in the Pradhan et al. study (PDB ID: 6ACD) because it did not include the ACE2 receptor. As per the previous study, 11 the relative synonymous codon usage (RSCU) for codon j in a species is calculated as where k j is the number of codons synonymous to codon j (including j itself) and p j is the probability of the respective amino acid being encoded by codon j among all k j synonymous codons in the protein coding sequences (CDSs) of the whole genome. The difference in codon usage in two different species (a virus versus a vertebrate in our case) is defined by the squared Euclidean distance of RSCU, that is Here N = 61 is the number of codons that encodes amino acids, thereby excluding the three stop codons. X j and X j ′ are the RSCUs for codon j in the virus and in the vertebrate, respectively. In our report, the codon usages of all vertebrates are taken from the CoCoPUTS 19 database, which was last updated in January 2020. This database was therefore much more recent than the Codon Usage Database, 20 which was last We noted that these fragments are not bona f ide "insertions"; in fact, at least three out of the four fragments are also shared with bat coronavirus RaTG13 spike glycoprotein (NCBI accession: QHR63300.1), as shown in Table 1 . Nevertheless, we still refer to these fragments as "insertions" in this Communication for consistency with the original report. The receptor binding domain of the spike is marked by the solid box, which corresponds to residue positions 323−545 in the above alignment. A pair of arrows immediately following IS4 indicates the protease cleavage site by which spike proteins are cut into S1 and S2 isoforms. updated in 2007, that was used in the previous research. 11 To obtain the codon usage of coronaviruses, we imported the GenBank annotations of the three coronavirus genomes to SnapGene (GSL Biotech) to export the codon usage In a recent manuscript entitled "Uncanny Similarity of Unique Inserts in the 2019-nCoV Spike Protein to HIV-1 gp120 and Gag", 10 Pradhan et al. presented a discovery of four novel insertions unique to 2019-nCoV spike protein ( Figure 1 ). They further concluded that these four insertions are part of the receptor binding site of 2019-nCoV and that these insertions shared "uncanny similarity" to human immunodeficiency virus 1 (HIV-1) proteins but not to other coronaviruses. These claims resulted in considerable public panic and controversy in the community, 12 even after the manuscript was withdrawn. To investigate whether the conclusions by Pradhan et al. are scientifically precise, we reanalyzed the structural location and sequence homology of the four spike protein insertions discussed therein. Because the full-length structure of the spike protein in 2019-nCoV was not available at the beginning of this study, we used C-I-TASSER 15 to model its tertiary structure as part of our If there are multiple redundant hits for the same gene from different strains of the same species removed, then only one hit is shown. The sequence identity is calculated as the number of identical residues divided by the query length. Only the sequence portion aligned to the query is shown. In this table, we also list the closest BLAST hit from bat coronavirus RaTG13, which is known to be closely related to 2019-nCoV. 3 . efforts in the full genome structure and function analyses of 2019-nCoV, which are available at https://zhanglab.ccmb.med. umich.edu/C-I-TASSER/2019-nCoV/. The 2019-nCoV spike model was then assembled with the human ACE2 structure (PDB ID: 6ACJ) 22 by DEMO 18 to form a spike−ACE2 complex. In Figure 2A , we present a cartoon superposition of the C-I-TASSER model with a recently solved spike structure, 23 where the C-I-TASSER model shares a high structure similarity, with a TM score of 0.95, 24, 25 to the cryo-EM structure. Because the experimental structure covers only 75% of the residues in the full-length sequence, with several important residues on the receptor binding domain (RBD) of the spike protein missing, our following analysis will mainly be built on the C-I-TASSER reconstructed full-length model. We note that C-I-TASSER, also known as "Zhang-Server", is the top ranked automated server for protein structure prediction in the Critical Assessment of protein Structure Prediction round 13 (CASP13) challenge (http://www.predictioncenter.org/casp13/zscores_final. cgi?model_type=best&gr_type=server_only) among all 39 servers from the community. C-I-TASSER improves our previously developed I-TASSER structure prediction protocol 26 by incorporating a deep-learning-based contact map prediction. 17, 27 On all 121 CASP13 targets, the average TM score of the C-I-TASSER first model (0.674) is 8.0% higher than that of I-TASSER (0.624) and 0.15% higher than that of C-QUARK (0.673), which is our only other automated CASP13 server and was ranked in second place in CASP13. As shown in Figure 2B , all four insertions in the C-I-TASSER/ DEMO structural models are located outside the RBD of the spike protein, in contrast with the original conclusion made by Pradhan et al., which stated that the insertions are located on the interface with ACE2. Here it is important to note that following ACE2 receptor binding, the spike protein can be cleaved by host proteases such as cathepsin L (CTSL) to produce the S1 and S2 isoforms to facilitate viral entry into host cells. 28, 29 Because this cutting site immediately follows insertion 4 (IS4) (Figure 1 arrow) along the 2019-nCoV spike protein sequence, there is a possibility that IS4 could affect the cleavage of the spike protein. Regardless, all of the insertions are not directly related to receptor binding. To investigate the viral homologues of the four insertions, we further performed a BLAST sequence search of these four insertions against the nonredundant (NR) sequence database, restricting the search results to viruses (taxid: 10239) but leaving other search parameters at default values. The top five sequence homologues (including the query itself) identified for each insertion are listed in Table 1 . In contrast wit the previous claim that the four insertions are unique to 2019-nCoV and HIV-1, all four insertion fragments can be found in other viruses. In fact, an HIV-1 protein is among the top BLAST hits for only one of the four insertion fragments, whereas three of the four insertion fragments are found in bat coronavirus RaTG13. Moreover, partially due to the very short length of these insertions, which range from six to eight amino acids, the E values of the BLAST hits, which is a parameter used by BLAST to assess the statistical significance of the alignments and usually needs to be <0.01 to be considered significant, 30 are all >4, except for a bat coronavirus hit for IS2. These high E values suggest that the majority of these similarities are likely to be coincidental. Given that three out of the four insertion fragments are found in the bat coronavirus RaTG13, it is tempting to assume that these "insertions" may be directly inherited from bat coronaviruses. Currently, there are at least seven known human coronaviruses (2019-nCoV, SARS-CoV, MERS-CoV, HCoV-229E, HCoV-OC43, HCoV-NL63, and HCoV-HKU1), where many of them, including severe acute respiratory syndrome-related coronavirus (SARS-CoV) and Middle East respiratory syndrome-related coronavirus (MERS-CoV), were shown to be transmitted from bats. 3,31−34 To further examine the evolutionary relationship between the 2019-nCoV and the bat coronavirus in comparison with other human coronaviruses, we used MUSCLE to create a multiple sequence alignment (MSA), presented in Figure 3 , for all seven human coronaviruses and two bat coronaviruses, RaTG13 and RsSHC014, which have been considered to be the ancestors of 2019-nCoV and SARS-CoV, respectively. 3,31,34 Among the four "insertions" (ISs) of the 2019-nCoV, IS1 has only one residue different from the bat coronavirus, and three out of seven residues are identical to MERS-CoV. IS2 and IS3 are both identical to the bat coronavirus. For IS4, although the local sequence alignment by BLAST did not hit the bat coronavirus in Table 1 , it has a close evolutionary relation to the bat coronavirus in the MSA. In particular, the first six residues in the IS4 fragment "QTQTNS-PRRA" from 2019-nCoV are identical to RaTG13, whereas the last four residues, which were absent in the bat coronavirus or SARS-CoV, have at least 50% identity to MERS-CoV and HCoV-HKU1. Putting these together, we believe that there is a close evolutionary relation between 2019-nCoV and bat coronavirus RaTG13. The four insertions highlighted by Pradhan et al. in the spike protein are not unique to 2019-nCoV and HIV-1. In fact, the similarities in the sequence-based alignments built on these very short fragments are statistically insignificant, as assessed by the BLAST E values, and such similarities are shared in many other viruses, including the bat coronavirus. Structurally, these "insertions" are far away from the binding interface of the spike protein with the ACE2 receptor, as shown in Figure 2 , which is also contradictory to the conclusion made by Pradhan et al. Another early study attempting to understand the infection of 2019-nCoV was performed by Ji et al. 11 In this study, the authors analyzed the RSCU of 2019-nCoV and eight vertebrates, including two species of snakes (Bungarus multicinctus and Naja atra), hedgehog (Erinaceus europaeus), bat (Rhinolophus sinicus), marmot (Marmota), pangolin (Manis javanica), chicken (Gallus gallus), and human (Homo sapiens). Among these vertebrates, snakes have the smallest codon usage difference (squared Euclidean distance of RSCU) from 2019-nCoV and were therefore proposed by Ji This conclusion is, however, controversial among virologists due to the lack of prior biological evidence that zoonotic coronavirus can infect animals other than mammals and birds. 35 Moreover, recent studies showed preliminary evidence that pangolins are the likely hosts of 2019-nCoV-like coronaviruses, 4−6 further invalidating Ji et al.'s conclusion. Whereas the conclusion of snakes being intermediate hosts has been commonly questioned by the scientific community, it is still important to carefully examine the base and reliability of the RSCU approach, which should help prevent such biased analyses from misleading the community and the general public. In this Communication, we scrutinize the bioinformatics approach and the underlying biological assumptions through a large-scale replication of the RSCU analysis. The bioinformatics analysis performed in the Ji et al. study has several limitations. First, there are only ∼300 CDSs in the NCBI GenBank for the snake species (Bungarus multicinctus and Naja atra), which the authors chose for their analysis. These CDSs represent <2% of all protein coding genes in a typical snake genome; the genome of the king cobra (Naja hannah), for example, encodes 18 387 proteins according to UniProt (https://www.uniprot.org/proteomes/UP000018936). The limited numbers of known CDSs in Bungarus multicinctus and Naja atra mean that the RSCU statistics may not reflect the actual RSCU distribution in the whole genome. Second, the Codon Usage Database 20 used in the analysis of Ji et al. has not been updated since 2007; a reanalysis using a more recent codon usage database such as CoCoPUTs 19 is therefore needed. Third, only 8 vertebrates were analyzed in their study, whereas there are >100 000 vertebrates with at least one CDS in the NCBI GenBank database. Finally, there is no established evidence that viruses evolve their codon usage to resemble that of their animal hosts; 36 this calls for a careful benchmark of RSCU analysis in terms of its ability to rediscover known hosts of characterized viruses. To address these issues, we reimplemented the RSCU comparison algorithm proposed by Ji et al. to analyze the codon usage in the 2019-nCoV genome (NCBI accession Figure S1D−F) , roughly corresponding to 10% of all protein coding genes in a typical vertebrate genome. As shown in Figure 4A , snakes are not the vertebrates with the lowest RSCU distances to 2019-nCoV, suggesting that the implementation of RSCU analysis by Ji et al. was incomplete. More importantly, the data in Figure 4 show that animals unrelated to coronavirus transmission, such as frogs and snakes, consistently have smaller RSCU distances to known hosts of all three coronaviruses. For example, the top-ranking vertebrates with the lowest RSCU distances to the three different coronaviruses are two kinds of frogs (Megophrys feae and Liophryne schlaginhaufeni), whereas another frog (Xenopus laevis) has the smallest RSCU distances among all vertebrates with sufficient sequences. Part of the reason for the failure of RSCU in intermediate host identification, as shown in Supplementary Table S1, is that different coronaviruses, such as SARS-CoV and MERS-CoV, that are known to utilize different intermediate hosts (Paguma larvata and Camelus dromedarius, respectively), have almost no difference in RSCU (squared RSCU distance = 0.12). These data suggest that the RSCU analysis on its own is not specific enough to discriminate coronaviruses from different vertebrate hosts. In this regard, the failure is not merely due to the use of outdated databases or the small number of species included in the original analysis but is, in fact, caused by the incorrect biological assumption that coronaviruses will evolve their RSCU to resemble that of their hosts. In a recent study, 6 To further examine the possibility, we tried to reassemble a draft genome sequence of the coronavirus using the metagenomic samples of Manis javanica. To this end, we first collected a set of all publicly available metagenome samples for pangolin, including 11 samples from lung, 8 samples from spleen, 2 samples from lymph (NCBI accession PRJNA573298), 37 and 4 samples from feces (NCBI accession PRJNA476660), 38 from the NCBI Sequence Read Archive (SRA) database 39 using the prefetch command of SRA Toolkit version 2.10.3. These samples were converted to paired-end sequencing reads in FASTQ format by faster-dump. A quality check by FastQC version 0.11.9 showed that whereas the 4 samples from PRJNA476660 do not contain adaptor sequences, all 21 samples from PRJNA573298 contain Illumina universal adaptors. Therefore, for these 21 samples, Trimmomatic version 0.39 40 was used to remove adaptor sequences using the flag "ILLUMINACLIP:adapters.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36". To remove contaminations from the host and from human researchers, only read pairs that could be mapped to Manis javanica or Homo sapiens genomes by bowtie 41 version 2.3.5.1 were retained for further analysis. These sequences were converted from SAM format of bowtie2 back to FASTQ format by SAMtools 42 version 1.10 and bedtools 43 version 2.29.2. Following these quality-control processes, we next determined which of the 25 previously mentioned samples include a 2019-nCoV-like sequence by two searches at the protein and nucleotide levels. In the protein-level search, the 2019-nCoV spike protein sequence was searched by BLASTp 30 through protein sequences directly assembled from sequencing reads of a metagenome sample by Plass, a protein-level metagenome sequence assembler, 44 to identify if there were any close hits with an E value <0.01. Meanwhile, the nucleotide-level search selected samples where more than one pair of sequencing reads could be mapped to the 2019-nCoV genome (NCBI accession: MN908947.3) by bowtie. Both searches consistently reported that only the lung samples (SRA accessions: SRR10168376, SRR10168377, and SRR10168378) contain 2019-nCoV-like sequences. Therefore, the sequences were assembled into nucleotide and protein contigs by MEGAHIT and Plass, respectively. The assembled nucleotide and protein sequences were then aligned by BLASTn and BLASTp to the whole genome and the spike protein of 2019-nCoV, respectively, at an E-value cutoff of <0.01. Finally, we separately merged all nucleotide and protein alignments into a single pairwise alignment between 2019-nCoV and the Manis coronavirus (Manis-CoV); when multiple Manis-CoV hits cover the same 2019-nCoV region, the hit with the highest sequence identity to 2019-nCoV is used in the merged alignment. Figure 5A presents a sketch of the draft genome for the Manis-CoV as compared with the released 2019-nCoV genome. 45 Overall, the assembled sequences cover 73% of the 2019-nCoV genome with 91% sequence identity. More importantly, the protein sequences assembled from these Manis lung samples include a partial pangolin coronavirus spike protein that is 92% identical to the 2019-nCoV spike protein ( Figure 5B ). This sequence identity is relatively high, considering that spike proteins are critical for the coronaviruses to invade into host cells and have the largest diversity in coronavirus genomes due to evolutionary pressure to adapt to receptors on different hosts. Notably, there are only 5 residue positions in the Manis coronavirus that are different from 2019-nCoV on the spike receptor binding domain compared with 19 different residue positions between 2019-nCoV and bat coronavirus RaTG13 for the same domain ( Figure 5B, black box) . These data imply that pangolins such as Manis javanica can either be the intermediate hosts of 2019-nCoV for the transmission of bat coronaviruses to humans or serve as alternative natural hosts, together with bats, to provide the genetic material for the origin of 2019-nCoV. Nevertheless, considering that Manis javanica individuals with coronavirus infections are usually in poor or even critical health condition 37 and previously known natural coronavirus hosts (such as bats) are usually asymptotic after infection, thus allowing long-term virus−host coexistence and coevolution, we believe that it is more likely that Manis javanica is an intermediate host rather than a natural host. Approximately one-quarter of nucleotides are missing in our assembled Manis coronavirus draft genome, partly because compared with whole-genome sequencing, metagenome sequencing usually has a lower read depth and more assembly errors caused by the mixture of diverse species in the samples. A higher quality genome with better coverage should, in theory, be attainable if the Manis coronavirus can be isolated and cultured in vitro using a mammalian cell line and is subjected to wholegenome sequencing. Because of the scarcity of experimental and clinical data as well as the urgency to understand the infectivity of deadly coronaviruses, we have been increasingly relying on computational analyses to study the 2019-nCoV virus in terms of protein structures, functions, phylogeny, and interactions at both molecular and organismal levels. Indeed, within less than 1 month of the publication of the 2019-nCoV genome in January 2020, multiple bioinformatics analyses regarding 2019-nCoV have been either published or posted as preprints. Whereas such expeditious analyses provide much needed insights into the biology of the 2019-nCoV virus, there is a caution to avoid overinterpretation of the data in the absence of comprehensive benchmarks or follow-up experimental validations. In this Communication, we have investigated two recently published computational analyses regarding intermediate host identification and the analysis of spike protein insertions. In both cases, we found that the conclusions proposed by the original studies do not hold in the face of more comprehensive replications of these analyses. In particular, we found that the unique sequence "inserts" found by Pradhan et al. are, in fact, shared by multiple viruses, especially with the segments from the bat coronavirus RaTG13, revealing the close evolutionary relation to the latter species. In addition, our benchmark results showed that the data based on RSCU are not specific enough to discriminate the relation between coronaviruses and vertebrates, which contradicts the conclusion by Ji et al. regarding snakes as an immediate host of the 2019-nCoV. Finally, we assembled a draft genome of the 2019-nCoV-like coronavirus using the metagenomic samples from the lung of Manis javanica, which shows an overall coverage of 73% of 2019-nCoV with 91% sequence identity. In particular, the spike protein in the assembled genome, which is critical for the virus to recognize host receptors and therefore bears a high speed of variation, shares a high sequence identity with 2019-nCoV, with only 5 residue position differences compared with 19 differences between 2019-nCoV and bat coronavirus RaTG13. These data provide evidence of the possible evolutionary relations among RaTG13, the Manis coronavirus, and 2019-nCoV. Whereas the current evidence mainly points to the pangolin as the most likely intermediate host, it is possible for other animals to also serve as intermediate hosts for the following two reasons. First, coronaviruses are known to have multiple intermediate hosts. For example, SARS-CoV, of which the palm civet (Paguma larvata) is the most well-known intermediate host, is also reported to use a raccoon dog (Nyctereutes procyonoides) and a ferret badger (Melogale moschata) as intermediate hosts. 46 Second, the 91% sequence identity between the Manis coronavirus and 2019-nCoV is high enough to confirm an evolutionary relation between the two viruses but not high enough to consider them as the same viral species. To put this into perspective, the viral sequence from intermediate hosts of SARS-CoV and MERS-CoV are 99. 8 The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jproteome.0c00129. Figure S1 . Top 20 vertebrate species ranked in ascending order of squared Euclidean distance of RSCU to 2019-nCoV, SARS-CoV, and MERS-CoV. Table S1 . Squared Euclidean distances of RSCU among coronaviruses and representative vertebrate species (PDF) Sidorov, I. A.; Sola, I.; Ziebuhr, J. The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2 A distinct name is needed for the new coronavirus Identification of a pangolin niche for a 2019-nCoV-like coronavirus through an extensive meta-metagenomic search Identification of 2019-nCoV related coronaviruses in Malayan pangolins in southern China Evidence of recombination in coronaviruses implicating pangolin origins of nCoV-2019 Functional assessment of cell entry and receptor usage for lineage B β-coronaviruses, including 2019-nCoV Uncanny similarity of unique inserts in the 2019-nCoV spike protein to HIV-1 gp120 and Gag Cross-species transmission of the newly identified coronavirus 2019-nCoV Statement in support of the scientists, public health professionals, and medical professionals of China combatting COVID-19 SeaView Version 4: A Multiplatform Graphical User Interface for Sequence Alignment and Phylogenetic Tree Building Deep-learning contact-map guided protein structure prediction in CASP13 The I-TASSER Suite: protein structure and function prediction Ensembling multiple raw coevolutionary features with deep residual neural networks for contact-map prediction in CASP13 Assembling multidomain protein structures through analogous global structural alignments Codon and Codon-Pair Usage Tables (CoCoPUTs): facilitating genetic variation analyses and recombinant gene design Codon usage tabulated from international DNA sequence databases: status for the year 2000 Analysis of Codon Usage Cryo-EM structure of the SARS coronavirus spike glycoprotein in complex with its host cell receptor ACE2 Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation Scoring function for automated assessment of protein structure template quality How significant is a protein structure similarity with TM-score = 0.5? The I-TASSER Suite: protein structure and function prediction ResPRE: highaccuracy protein contact prediction by coupling precision matrix with deep residual neural networks Inhibitors of cathepsin L prevent severe acute respiratory syndrome coronavirus entry SARS coronavirus, but not human coronavirus NL63, utilizes cathepsin L to infect ACE2-expressing cells Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Bats Are Natural Reservoirs of SARS-Like Coronaviruses Bat Origins of MERS-CoV Supported by Bat Coronavirus HKU4 Usage of Human Receptor CD26 Evidence for an Ancestral Association of Human Coronavirus 229E with Bats Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus Evolution of relative synonymous codon usage in Human Immunodeficiency Virus type-1 Viral Metagenomics Revealed Sendai Virus and Coronavirus Infection of Malayan Pangolins (Manis javanica) The Fecal Metagenomics of Malayan Pangolins Identifies an Extensive Adaptation to Trimmomatic: a flexible trimmer for Illumina sequence data Fast gapped-read alignment with Bowtie 2 The Sequence Alignment/Map format and SAMtools BEDTools: a flexible suite of utilities for comparing genomic features Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold A new coronavirus associated with human respiratory disease in China Isolation and characterization of viruses related to the SARS coronavirus from animals in Southern China Alnaeem, A.; Peiris, M. MERS Coronavirus in Dromedary Camel Herd, Saudi Arabia Accelerating Scientific Discovery The authors declare no competing financial interest. We thank Drs. Gibert S. Omenn and Xiaoqiong Wei for critical review of this manuscript. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), 48