key: cord-1046291-jx69dzsx authors: Hassan, Sk. Sarif; Ghosh, Shinjini; Attrish, Diksha; Choudhury, Pabitra Pal; Uversky, Vladimir N.; Uhal, Bruce D.; Lundstrom, Kenneth; Rezaei, Nima; Aljabali, Alaa A. A.; Seyran, Murat; Pizzol, Damiano; Adadi, Parise; Soares, Antonio; El-Aziz, Tarek Mohamed Abd; Kandimalla, Ramesh; Tambuwala, Murtaza; Azad, Gajendra Kumar; Sherchan, Samendra P.; Baetas-da-Cruz, Wagner; Takayama, Kazuo; Serrano-Aroca, Ángel; Chauhan, Gaurav; Palu, Giorgio; Brufsky, Adam M. title: Possible transmission flow of SARS-CoV-2 based on ACE2 features date: 2020-10-09 journal: bioRxiv DOI: 10.1101/2020.10.08.332452 sha: 8b6b44ab923e265ed13c2ab90936e82480e255f2 doc_id: 1046291 cord_uid: jx69dzsx Angiotensin-converting enzyme 2 (ACE2) is the cellular receptor for the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) that is engendering the severe coronavirus disease 2019 (COVID-19) pandemic. The spike (S) protein receptor-binding domain (RBD) of SARS-CoV-2 binds to the three sub-domains viz. amino acids (aa) 22-42, aa 79-84, and aa 330-393 of ACE2 on human cells to initiate entry. It was reported earlier that the receptor utilization capacity of ACE2 proteins from different species, such as cats, chimpanzees, dogs, and cattle, are different. A comprehensive analysis of ACE2 receptors of nineteen species was carried out in this study, and the findings propose a possible SARS-CoV-2 transmission flow across these nineteen species. We have been acquainted with the term beta-coronavirus for about two decades when we first encountered the Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV) outbreak that emerged in 2002, infecting about 8000 people with a 10% mortality rate [1] . It was followed by the emergence of Middle East Respiratory Syndrome Coronavirus (MERS-CoV) in 2012 with 2300 cases and mortality rate of 35% [2] . The third outbreak, caused by SARS-CoV-2, was first reported in 5 December 2019 in China, Wuhan province, which rapidly took the form of a pandemic [3, 4] . To date, this new human coronavirus has affected 36 million people worldwide and is held accountable for over one million deaths [5] . SARS-CoV-2 is an enveloped single-stranded plus sense RNA virus whose genome is about 30kb in length, and which encodes for 16 non-structural proteins, four structural and six accessory proteins [6] . The four major structural proteins which play a vital role in viral pathogenesis are Spike protein (S), Nucleocapsid protein (N), Membrane protein (M), and Envelope protein (E) 10 [7, 8] . SARS-CoV-2 infection is mainly characterized by pneumonia [9] ; however, multi-organ failure involving myocardial infarction, hepatic, and renal damage is also reported in patients infected with this virus [10] . SARS-CoV-2 binds to the Angiotensin-converting enzyme 2 (ACE2) receptor on the host cell surface via its S protein [11, 12] . ACE2 plays an essential role in viral attachment and entry [13] . The study of the interaction of ACE2 and S protein is of utmost importance [14, 15] . The S1 subunit of the S protein has two domains, the C-terminal and the N-terminal domains, which fold independently, 15 and either of the domains can act as Receptor Binding Domain (RBD) for the interaction and binding to the ACE2 receptor widely expressed on the surface of many cell types of the host [16, 17] . The human ACE2 protein is 805 amino acid long, containing two functional domains: the extracellular N-terminal claw-like peptidase M2 domain and the C-terminal transmembrane collectrin domain with a cytosolic tail [18] . The RBD of the S protein binds to three different regions of ACE2, which are located at amino acids (aa) 24-42, 79-84, and 330-393 positions present in the claw-like peptidase domain 20 of ACE2 [13] . ACE2 modulates angiotensin activities, which promote aldosterone release and increase blood pressure and inflammation, thus causing damage to blood vessel linings and various types of tissue injury [19] . ACE2 converts Angiotensin II to other molecules and reduces this effect [20] . However, when SARS-CoV-2 binds to ACE2, the function of ACE2 is inhibited and in turn leads to endocytosis of the virus particle into the host cell [21] . Zoonotic transmission of this virus from bat to human and random mutations acquired by SARS-CoV-2 during human to 25 human transmission has also empowered this virus with the ability to undergo interspecies transmission and recently many cases have been reported stating that different species can be infected by this virus [22] . In this study, we aim to determine the susceptibility of other species, whether they bear the capability of being a possible host of SARS-CoV-2. We chose nineteen different species and analyzed the ACE2 protein sequence in relation to the human ACE2 sequence and determined the degree of variability by which the sequences differed from each other. We performed 30 a comprehensive bioinformatics analysis in addition to the phylogenetic analysis based on full-length sequence homology, polarity along with individual domain sequence homology and secondary structure prediction of these protein sequences. These finding could be emerged to six distinct clusters of nineteen species based on the collective analysis and thereby provided a prediction of the interspecies SARS-CoV-2 transmission. [23] . Nineteen species and their respective ACE2 protein accession IDs with length are presented in Table- 1. The nearest neighborhood phylogeny of the nineteen species derived from the NCBI public server based on ACE2 protein sequence similarity is shown in Fig.1 (A) [24] . ACE2 sequence similarity among the species derives six clusters as shown in (Fig.1 (B) ). The contact residues of the receptor-binding domain (RBD) of the spike protein (YP 009724390.1) of SARS-CoV-2 with the homo sapiens ACE2 interface are presented in Table-2 [13] . The three designated domains, D1 (aa 24-42), D2 (aa 79-84) and D3(aa 330-393) respectively contain the residues which bind to the RBD of the S protein. Examining amino acid substitutions: For human ACE2 receptor, substitutions were examined for all species and only those substitutions are accounted for, which occurred in the binding residues in the mentioned three domains D1, D2 and D3 [13] . Based character of the substitutions which interfered with the binding residues of the ACE2 across various species, two types were defined: substitutions affected transmission (M1), and substitutions did not affect transmission (M2). Multiple sequence alignments and associated phylogenetic trees were developed using the NCBI web-suite across all the individual binding domains D1, D2, and D3 of ACE2 of the nineteen species [25, 26] . Algorithmic clustering technique derives homogeneous subclasses within the data such that data points in each cluster are as similar as possible according to a widely used distance measure viz. Euclidean distance. One of the most commonly used simple clustering techniques is the K-means clustering [27, 28] . The algorithm is described below 60 in brief: Algorithm: K-means algorithm is an iterative algorithm that tries to form equivalence classes from the feature vectors into K (pre-defined) clusters where each data point belongs to only one cluster [27] . • Assign the number of desired clusters (K) (in the present study, K = 6). • Find centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without 65 replacement. • Keep iterating until there is no change to the centroids. • Find the sum of the squared distance between data points and all centroids. • Assign each data point to the closest cluster (centroid). • Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster. In this present study, nineteen species were clustered using Matlab by inputting the distance matrix derived from the feature vectors associated with the three domains of ACE2 across all species. The secondary structure of full-length ACE2 sequence of all species were predicted using the web-server CFSSP (Chou & Fasman Secondary Structure Prediction Server) [29] . This server predicts secondary structure regions from the protein sequence such as alpha-helix, beta-sheet, and turns from the amino acid sequence [29] . On 75 obtaining the full-length ACE2 secondary structures, individual domains D1, D2, and D3 were cropped for each species. Bioinformatics features: Several bioinformatics features viz. Shannon entropy, instability index, aliphatic index, charged residues, half-life, melting temperature, N-terminal of the sequence, molecular weight, extinction coefficient, net charge at pH7 and isoelectric point of D1, D2, and D3 domains of ACE2 for all nineteen species were determined using the web-servers Pfeature and ProtParam [30, 31] . Shannon entropy: It measures the amount of complexity in a primary sequence of ACE2. It was determined using the web-server Pfeature by the formula where p i denotes the frequency probability of a given amino acid in the sequence [30] . Instability index is determined using the web-server ProtParam and it estimates the stability of a protein in a test tube. A protein whose instability index is smaller than 40 is predicted as stable. A value above 40 predicts that the protein may be unstable [30] . Aliphatic index of a protein is defined as the relative volume gathered by aliphatic side chains (alanine, valine, isoleucine, 85 and leucine). It may be regarded as a positive factor for increasing the thermostability of globular proteins, such as ACE2 [30] . It was reported that the N-terminal of a protein is responsible for its function. For each domain sequence, N-terminal residue was determined using the Pfeature [30] . In vivo half-life: The half-life predicts the time it takes for half of the protein amount to degrade after its synthesis in 90 the cell. The N-end rule originated from the observations that the identity of the N-terminal residue of a protein plays an essential role in determining its stability in vivo [32] . Extinction coefficients: The extinction coefficient measures how much light a protein absorbs at a particular wavelength. It is useful to estimate this coefficient when a protein is getting purified [32] . Polarity sequence: Every amino acid in the domains D1, D2 and D3 of ACE2 were recognized as polar (P) and non-polar 95 (Q) and thus every D1, D2 and D3 for all the species turned out to be binary sequences with two symbols P and Q. Then homology of these sequences for each domain was made and consequently, phylogenetic relationship was drawn. Based on amino acid homology, secondary structures, bioinformatics and polarity of the three domains D1, D2 and D3 of ACE2, all nineteen species were clustered. Finally a cumulative set of nineteen species clusters was built, among which the 100 SARS-CoV-2 transmission may occur. First, we examined all the substitutions with similar properties and similar side chain binding atoms, signifying that the substitutions would not impede the SARS-CoV-2 transmission. Note that, all the mutations are considered concerning the human ACE2 domains D1, D2 and D3 (Fig.2 ). 105 Figure 2 : Substitutions in D1, D2 and D3 domains of ACE2 across eighteen species In D1 domain: out of eighteen species, eight species were found to possess a substitution at position 30 where D (Aspartate) was substituted by E (Glutamate) and four species were found to carry the D38E substitution. It was reported that in the aspartate side chain, the oxygen atom was involved in ionic-ionic interaction and the side-chain oxygen atom was also present in glutamate, so this substitution may not affect the protein-protein interaction properties [33, 34] . In the T27S substitution, Threonine and Serine both possess OH that participates in binding and in the H34L substitution, both 110 Histidine and Leucine use the NH group for interaction with another amino acid. Consequently, if we consider only the critical perspective for these substitutions, we can conclude that these changes would not impede the binding between the S and ACE2 protein. In D2 domain: L79I bears importance across eighteen species since both these amino acids (Leucine and isoleucine) share similar chemical properties. So, if we analyze the changes in amino acid residues based on their chemical properties, 115 which is the main contributing factor for protein-protein interaction, we can conclude that it will not significantly affect the binding between ACE-2 and RBD of the S protein. In D3 domain: out of eleven substitutions, three substitutions (R393K, K353H, and K353R) were observed of the similar type with similar side chain interacting atoms and therefore changes at these positions would not affect the interaction of ACE2 with that of the S protein. Secondly, across all nineteen species, homology were derived based on amino acid sequences, and consequently, associated phylogenetic trees were drawn (Fig.3) . Six clusters of the nineteen species were formed using the K-means clustering technique based on sequence homology of the three domains (Fig 4) . The clusters of species {S1, S2, S3} and {S6, S13} stayed together for the ACE2 full-length sequence homology and the combination of three domain-based sequence similarity. The species S16, S17, and S18 also followed the 125 same as observed. Further, it was observed that sequence homology of the D1, D2, and D3 domains clustered the species S15 into the cluster where S9, S10 and S12 belong, although S15 was similar to the ACE2 sequence of S8 and S9. Although the species S4 was very much similar to S9, S10, and S12 for full-length ACE2 homology, it clubbed with S5 and S11 concerning the three domain-based sequence spatial organizations. Also S7 was found to be staying in the proximity of S6 and S13 although S7 130 was very much similar to S5 and S11 based on ACE2 homology. For each domain of ACE2 of nineteen species, the secondary structure was predicted (Fig.5 • Gallus gallus and Danio rerio has a unique secondary structure in comparison to the others. These individual six clusters show six different secondary structures in D1 shared by sixteen species, which shows high similarities in their secondary structures, while the rest two have a unique secondary structure for D1 domain. Thus, these six clusters have similar secondary structures indicating that the species in the six clusters are closely related. Based on the similarity among the three domains, all nineteen species were clustered (Fig.6 ). From the clusters (Fig.6 ) based on the secondary structure of the three domains of ACE2, it was observed that the species S4 was clustered uniquely, though S4 is clustered with S9 and S19 based on ACE2 full-length sequence homology. Furthermore, S6 and S13 were found to be similar based on ACE2 homology, but they got clustered into two different clusters when the secondary structure of three domains was concerned. In contrast, the group of species {S1, S2, S3}, {S9, S12}, {S16, S18}, and {S5, S7S11} remained in the same clusters concerning to ACE2 homology as well as individual secondary 170 structures of the domains. Twelve bioinformatics features viz. Shannon entropy, Instability Index, aliphatic index, charged residues, half-life, melting temperature, N-terminal of the sequence, molecular weight, extinction coefficient, net charge at pH7, and isoelectric point of the D1, D2, and D3 domains of ACE2 for all nineteen species were determined (Fig.7 ). For each species, a twelve-dimensional feature vector was found (Fig.7) . For each domain D1, D2, and D3 domain, distance matrix was determined using the Euclidean distance . Note that here f i and g i denote the ith feature for the species S and T respectively. This distance matrices with heatmap representation for all three domains are presented in Figs. 8, 9 & 10. In addition, by inputting the distance matrix, using the K-means clustering technique, several clusters of species were formed for each domain (Figs. 8, 9 and 10 ). A final set of six clusters was formed using the K-means clustering method to have all three domains for the nineteen different species (Fig. 11) . Although the species S7 was clustered with the species S5 and S11 as per full-length ACE2 sequence 180 homology, S7 formed a unique singleton cluster when the bioinformatics features were taken into consideration. Similarly, the species S16 formed a singleton cluster though it was clustered with S17, S18, and S19 as per amino acid homology of ACE2. The sequence homology of ACE2 made the four species S16, S17, S18, and S19 into a single cluster but bioinformatics features placed the species S18 in a cluster where the other three species S1, S2, and S3 belonged. Based on bioinformatics features, S4 clustered together with S15 though the ACE2 receptor of S4 was sequentially similar to ACE2 of S9, S10, and 185 S12. The clusters {S1, S2, S3}, {S6, S13}, and {S9, S10, S12} were unaltered with respect to the full length ACE2 homology and bioinformatics features. In the D1 domain, it was observed that across eighteen species, polarity of thirteen amino acids among nineteen (24-42 190 aa) amino acids were found to be conserved. Based on the polarity and non-polarity nature of the amino acids, the species were arranged in a phylogenetic tree (Fig. 12) . It was found that Homo sapiens, Pan troglodyte, Macaca mulatta and Danio rerio were closer according to this analysis. ferrumequinum respectively were placed in close proximity based on their polarity and non-polarity of the amino acids in the protein sequence. However, Equus caballus, Felis cattus, and Mesocricetus auratus were placed separately since they did not show much resemblance based on polarity. Here the clusters {S1, S2, S3}, {S16, S17, S18}, {S8, S14}, {S6, S13}, and {S10, S12} remained invariant with regards to the homology of full length ACE2 as well as polarity sequence of the D1, D2 and D3 domains. Based on all the different clustered formed on the basis of amino acid homology, secondary structures, bioinformatics, also showed resemblance with cluster-2 (C-2) [Pteropus alecto, and Pteropus vampyrus], and cluster-6 (C-6) that comprised of Equus caballus (horse) only. Although, both C-2 and C-6 were also close to each other. Furthermore, a pooled analyses based on the two type of substitutions (one is affecting SARS-CoV-2 transmission (M1), 235 and the other one is SARS-CoV-2 non-affecting transmission(M2)) for all the six final clusters are presented in Table-3 . Based on the Table-3 information regarding the number of M1 and M2 substitutions, the intra-species transmission of SARS-CoV-2 were enlightened as follow: • C-1: None of the species bear any mutation in the binding residues and are conserved, so viral transmission is immaculate. • C-2: This cluster has an equal number of transmission affecting and transmission non-affecting types of substitutions. Therefore, both have an equal probability of getting infected from each other. • C-3: Here again, Gallus gallus, Pelodiscus sinensis, and Danio rerio have a similar ratio of S1 to S2, signifying possible flow of viral transmission within these three species. However, Salmo salar is unique and distant, and therefore, the probability of viral transmission is unlikely. • C-4: The species in this cluster have a similar number of transmission-affecting and transmission non-affecting type of substitutions which shows that the flow of viral transmission would be continuous among these three species. • C-5: Transmission between Felis catus and Mesocrietus auratus is highly likely which is the same for Manis javanica, Mustela putorius furo, and Rattus norvegicus as indicated by their similar number of substitutions. Therefore, the inter-transmission between these species is highly plausible. While Rhinolophus ferrumequinum has a relatively high 250 value of transmission affecting substitutions from all of the above, its susceptibility to getting infected from others species is uncertain. • C-6: A total of five transmission affecting substitutions in the three domains for human were observed. In this study, we amassed the ACE2 protein sequences of nineteen species to investigate the possible transmission of enabled us to estimate the similarity concerning amino acids and from that, we observed that Salmo salar (Salmon fish) was quite distant. It also gave us the idea that some of the amino acid substitutions in the binding residues occurring across the species with respect to human ACE2 were resulted in amino acids having similar binding properties, indicating that their interactions with RBD of the S protein will be similar to that of humans, thus making transmission across these 260 species feasible. It was observed that Homo sapiens and Pan troglodytes (Chimpanzee) have complete sequence similarity. In contrast, Macaca mulatta (Rhesus macaque) shared a high percentage of sequence identity except for two amino acid positions. However, no substitutions were observed in the binding of amino acid residues, making the viral transmission across these species highly likely. Again, Pteropus vampyrus (Large flying fox) and Pteropus alecto (Black flying fox) have precisely the same ACE2 sequence and thus signifying high viral transmission and that both of them have an equal chance 265 of getting infected with each other. Further analysis led us to present a possible transmission flow among the nineteen species, as illustrated in Fig. 14 . The multifaceted examination of the ACE2 protein indicated that interspecies SARS-CoV-2 transmission is quite possible and we have tried to provide a better insight into it by predicting the possible transmission among species within the same cluster and between clusters too. However, further in-depth analysis is necessary in the future for the identification of new hosts of 270 SARS-CoV-2 as well as for determination of possible ways to prevent inter-species transmission. SSH conceived the problem. SSH, DA, SG carried out the work. SSH analyzed the results and wrote the primary draft of the article. All authors reviewed, edited, and approved the final manuscript. The authors do not have any conflicts of interest to declare. Severe acute respiratory syndrome (sars) Outbreak news: severe acute respiratory syndrome (sars) Clinical characteristics of 140 patients infected with sars-cov-2 in wuhan, china Aerodynamic analysis of sars-cov-2 in two wuhan hospitals Coronavirus disease 2019 (covid-19): situation report A genomic perspective on the origin and emergence of sars-cov-2 Novel antibody epitopes dominate the antigenicity of spike glycoprotein in sars-cov-2 compared to sars-cov A testimony of the surgent sars-cov-2 in the immunological panorama of the human host Clinical course and outcomes of critically ill patients with sars-cov-2 pneumonia in wuhan, china: a single-centered, retrospective, observational study Covid-19 and multi-organ response, Current Problems in Cardiology Ace2 in the era of sars-cov-2: controversies and novel perspectives Exploring the demographics and clinical characteristics related to the expression of angiotensin-converting enzyme 2, a receptor of sars-cov-2 Structure of the sars-cov-2 spike receptor-binding domain bound to the ace2 receptor Structural variations in human ace2 may influence its binding with sars-cov-2 spike protein Covid-19-a theory of autoimmunity to ace-2 Uhal, Ace2, much more than just a receptor for sars-cov-2 Structural and simulation analysis of hotspot residues interactions of sars-cov 2 with human ace2 receptor Receptor and viral determinants of sars-coronavirus adaptation to human ace2 Sars-cov-2 receptor and regulator of the renin-angiotensin system: celebrating the 20th anniversary of the discovery of ace2 Angiotensin-converting enzyme 2 (ace2) is a key modulator of the renin angiotensin system in health and disease Quantum dotconjugated sars-cov-2 spike pseudo-virions enable tracking of angiotensin converting enzyme 2 binding and endocytosis Household pets and sars-cov2 transmissibility in the light of the ace2 intrinsic disorder status Ncbi reference sequence (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins The ncbi taxonomy database Ncbi blast: a better web interface The ncbi The global k-means clustering algorithm Fgka: A fast genetic k-means clustering algorithm Cfssp: Chou and fasman secondary structure prediction server Computing wide range of protein/peptide features from their sequence and structure Mfppi-multi fasta 335 protparam interface How to measure and predict the molar absorption coefficient of a protein Electrostatic interactions in protein structure, folding, binding, and condensation Strange bedfellows: interactions between acidic side-chains in proteins Keywords: ACE2; Viral spike receptor-binding domain; SARS-CoV-2; Transmission; Bioinformatics.