key: cord-0823962-mxfhtbmz authors: Lei, Zhixiong; Zhang, Dan; Yang, Ruiping; Li, Jian; Du, Weixing; Liu, Yanqing; Tan, Huabing; Liu, Zhixin; Liu, Long title: Substitutions and codon usage in SARS-CoV-2 in mammals indicate natural selection and host adaptation date: 2021-04-22 journal: bioRxiv DOI: 10.1101/2021.04.04.438417 sha: 08ce507f0c1a43c51e66e5908d4bc87cac416008 doc_id: 823962 cord_uid: mxfhtbmz The outbreak of COVID-19, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection, rapidly spread to create a global pandemic and has continued to spread across hosts from humans to animals, transmitting particularly effectively in mink. How SARS-CoV-2 evolves in animals and humans and the differences in the separate evolutionary processes remain unclear. We analyzed the composition and codon usage bias of SARS-CoV-2 in infected humans and animals. Compared with other animals, SARS-CoV-2 in mink had the most substitutions. The substitutions of cytidine in SARS-CoV-2 in mink account for nearly 50% of the substitutions, while those in other animals represent only 30% of the substitutions. The incidence of adenine transversion in SARS-CoV-2 in other animals is threefold higher than that in mink-CoV (the SARS-CoV-2 virus in mink). A synonymous codon usage analysis showed that SARS-CoV-2 is optimized to adapt in the animals in which it is currently reported, and all the animals showed decreased adaptability relative to that of humans, except for mink. A binding affinity analysis indicated that the spike protein of the SARS-CoV-2 variant in mink showed a greater preference for binding with the mink receptor ACE2 than with the human receptor, especially as the mutation Y453F and F486L in mink-CoV lead to improvement of binding affinity for mink receptor. Our study focuses on the divergence of SARS-CoV-2 genome composition and codon usage in humans and animals, indicating possible natural selection and current host adaptation. Introduction 46 SARS-CoV-2 is a β -coronavirus that emerged in 2019 and spread worldwide, leading 47 to an ongoing global pandemic [1, 2] . As of February 19 th 2021, the number of infected 48 cases reached 110 million, and more than 2.4 million deaths have occurred (Johns 49 Hopkins University statistics; https://coronavirus.jhu.edu/map.html). SARS-CoV-2 50 has a single-stranded positive-sense RNA genome containing 29,903 nucleotides and 51 consisting of 11 open reading frames (ORFs) encoding 27 proteins [3] . The S 52 glycoprotein is a fusion viral protein that functions in recognition of the host receptor 53 ACE2 [4] . 54 There is a broad host spectrum because SARS-CoV-2 binds a receptor common to 55 humans and animals [5] . To date, the following animals have been reported to be 56 susceptible to infection: cats, dogs, tigers, lions, ferrets, and mink [6] [7] [8] [9] [10] [11] [12] . SARS-CoV-2 57 infection of pets, including cats and dogs [8, 10] , was the earliest reported animal 58 infections in the epidemic. Later, in a report on SARS-CoV-2 infection in tigers, lions, 59 parameter of the most common residues at each location is fixed to 0, while the other 147 fitness parameters are limited to −20 < F < 20. 148 149 Statistical analyses were performed using ANOVA followed by Turkey's post hoc 151 test, and the data were considered significantly different if the p-value was less than 152 0.05. ***p<0.001, **p<0.01, *p<0.05. The figures were mapped by the software 153 PRISM GraphPad 5.0. 154 Sequence and analysis of SARS-CoV-2 isolated from animals 157 As of Feb 2 nd 2021, more than 400 thousand SARS-CoV-2 genome sequences had 158 been uploaded to the GISAID database. It is important to study the mutation rates and 159 selective pressures on the SARS-CoV-2 genome during the spread of the epidemic. 160 The results presented in Fig 1A show that the evolutionary entropy increased at 161 specific sites in the whole genome of SARS-CoV-2, indicating substitution and 162 selection capacity at these sites. In addition to humans, SARS-CoV-2 infects other 163 animals ( Fig 1B) and evolves in these animals. A phylogenetic tree was reconstructed 164 based on animal-derived whole genome consensus sequences compared with the 165 SARS-CoV-2 human isolate WIV04 (Fig 1C) . Most SARS-CoV-2 clade isolates from 166 the same animal clustered together, and the same clade contained sequences from all 167 the mink regardless of their geographic region. 168 The cluster of SARS-CoV-2 from mink (mink-CoV) has more substitutions compared 169 to the reference sequence WIV04 (Supplementary Table S2) , and the substitutions of 170 cytidine in mink-CoV account for nearly 50% of the substitutions, while in other 171 animals, cytidine accounts for only 30% of the substitutions (Fig 1D) . The 172 substitution of adenine in SARS-CoV-2 in other animals is threefold higher than that 173 in mink-CoV. To track how the substitutions occurred in the mink-CoV genome, we 174 recorded all the mutations in the mink-CoV genome in reference to the WIV04 175 genome. The results in Fig 1E & 1G show that the cytidine-to-uracil transition occurred more than 40% of the time and was eightfold higher than the 177 uracil-to-cytidine substitution. Notably, the substitutions of guanine and adenine were 178 more than threefold higher in nonsynonymous mutations than in synonymous 179 mutations ( Fig 1F) . 180 181 Mutational spectra of Spike protein in human and animal samples 182 The evolutionary entropy (Fig 2A) analysis revealed that most of the notable mutation 183 pressures on the Spike protein occurred primarily in three relatively narrow domains, 184 the N-terminal domain (NTD, green), receptor binding motif (RBM, purple), SD 185 (pink), and CH and CD (blue) domains. The variation in the spike gene was evident 186 when all the included sequences isolated from humans and animals were recorded in 187 our study, which led to the identification of a number of highly variable residues, 188 including L18F, A222V, S477N, P681H, S982A and D1118H (Fig 2B and 2C) . CAI was used to quantify the codon usage similarities between different coding 203 sequences based on a reference set of highly expressed genes [27] . To clarify the 204 optimization of SARS-CoV-2 in different hosts, we calculated the average CAI of the 205 SARS-CoV-2 whole genome ( Fig 2F) and spike region ( Fig 2G) . 206 Interestingly, SARS-CoV-2 in bat hosts has a higher value of CAI relative to 207 humans, while dogs had an obviously decreased CAI value compared to humans (Fig 208 2F ). The bias of codon usage in the spike mutants are shown in Supplementary Table 209 S3. Considering codon usage in the spike gene in different hosts, Fig 2G shows that 210 pangolins, cats, dogs, tigers, and lions all had a lower CAI value than humans. These 211 results indicated that SARS-CoV-2 optimized codon usage to adapt to the animals in 212 which infection has been reported, but all of them showed a downward trend in 213 adaptability relative to humans except for mink. 214 215 Recently, Wang et al. reported that the tyrosine-protein kinase receptor (UFO, also 217 called AXL) is a candidate receptor for SARS-CoV-2 infection of the respiratory system [28] . Here, the interaction of spike with UFO was predicted using the ZDOCK 219 sever (http://zdock.umassmed.edu/) after simulation with the structure of human and 220 mink UFO. The results showed that the spike interacts with human and mink UFO 221 through the amino acids Glu56, Glu59, His61, Glu70 and Glu85 (Fig 3B) , which form 222 electric charge attraction and hydrophobic interactions with residues K147, P251, 223 D253 and N148 on spike. All these residues were located on the NTD of spike (Fig 224 3A ). To distinguish the differentiation of receptor sequences between different 225 animals and humans, the ACE2 and UFO amino acid sequences in humans, mink, 226 ferrets, tigers, cats, and dogs were aligned ( Fig 3C) . The results showed that the 227 critical mutations H34Y, L79H and G354R appear in mink and ferret ACE2 (Fig 3C 228 upper), and the variations H61T, I68V and E85G are evident in the UFO sequences of 229 all the animals except for tigers (Fig 3C lower) . On the other hand, viral variation is 230 another important factor that should also be considered when analyzing infection 231 differences between animals and humans. Corresponding to the contact residues on 232 the receptors, alignment of the viral sequence contacts of UFO and ACE2 on spike 233 indicated that residues binding UFO are conserved (Fig 3D) , while residues at site 453, 234 which interact with those at position 34 in ACE2 (Fig 3E) , showed a higher binding 235 affinity for F453-Y34 in mink and ferrets than for Y453-H34 in humans ( Fig 3F) . The 236 interaction of L486-T82 showed increased binding energy in mink and ferrets (Fig 237 3F ). These variations indicate that the SARS-CoV-2 Spike shows a greater preference 238 for binding the mink receptor ACE2 than human ACE2 after this mutation occurs. 239 Amino acid substitutions within the SARS-CoV-2 Spike RBM may have contributed 242 to host adaption and cross-species transmission. N439K, S477N and N501Y were the 243 most abundant variations throughout the RBM regions (Fig 4A and 4B ). N439 does 244 not bind directly with ACE2 but functions in the stabilization of the 498-505 loop [29] , 245 but the N439K substitution is absent in animal CoVs (Fig 3D) . Previous 246 computational analysis combined with entropy analysis of the spike (Fig 2A) showed 247 that S477N may have decreased stability compared with the wild type [30] variation, other important mutations should also be considered in mink and human 259 prevalent strains, such as Y505H (Fig 4E) , which also affect binding with the ACE2 260 receptor and Histidine (CAU) has the similar codon fitness with Tyrosine (UAU) (Fig 261 4D) . 262 In addition to the viral codon adaptation, mutation factors must be considered for 263 virus prevalence. There was a lot lineages such as B. 1.1.7 preference on the host is widely recognized and is also one of the main natural selection forces for the coevolution of viruses and hosts [18] . In this study, we 295 compared the codon bias of SARS-CoV-2 in mink with that of SARS-CoV in ferrets. 296 Residues threonine (T) and tyrosine (Y) had similar codon biases in SARS-CoV-2 and 297 SARS-CoV (Fig 4C) , which both have the capability to infect mink and ferrets. The 298 N501T variation mostly appeared in mink, while the N501Y mutation present only in 299 humans cannot be explained from the perspective of codon bias and indicates that 300 these two variations belong to two separate lineages. 301 The WebLogo diagram in Fig 4C shows that SARS coronaviruses preferentially have 302 U-or A-ending codons. This is consistent with a previous report [38] , and the G or C 303 nucleotides in the third position of the preferred SARS-CoV-2 codons are not well 304 represented. This feature may lead to an imbalance in the tRNA pool in infected cells, 305 resulting in reduced host protein synthesis. The substitution rate of C-to-U was the 306 highest in most of the reported sequences in animal species (Fig 1D) . This may be 307 because the surrounding context of cytidine in the sequence strongly influences the 308 possibility of its mutation to U [39] . In the mink sequences, we observed an 8-fold 309 increase in C-to-U substitution compared with the U-to-C substitution, which was 310 higher than the reported 3.5-fold increase in mink [34] , suggesting host adaptation of 311 SARS-CoV-2 in mink over time and the ongoing outbreaks in multiple mink farms. In 312 mink, the variations in G and A with nonsynonymous substitutions were higher than 313 those with synonymous substitutions, which needs to be further analyzed. In addition, the sequences of other animal-CoVs are limited, such as those in the dogs and lions in 315 the GISAID database, which is a limiting factor for comparison of base substitutions. 316 CAI was used to measure the synonymous codon similarities between the virus and 317 host coding sequences. For each animal source of the SARS-CoV-2 sequence, we 318 calculated the average genome and spike gene values in the CAI (Fig 2F & 2G) . 319 Bat-CoV (RaTG13) and SARS-CoV-2 (from humans) had higher CAI values, which 320 indicates that the viruses adapt to their hosts (bat and human) with optimized or 321 preferred chosen codons, while the dog source of SARS-CoV-2 had lower CAI values, 322 suggesting that SARS-CoV-2 adapts to dogs with random codons. This finding was 323 consistent with the conclusion that, compared to dogs, humans are favored hosts for 324 adaptation [40] . The whole genome or spike sequence in mink-CoV had a similar 325 substitution level to human SARS-CoV-2, pointing to the ongoing adaptation of 326 SARS-CoV-2 to the new host and using the preferred chosen codons. 327 The spike protein is critical for virus infection and host adaptation. We observed that 328 three nonsynonymous mutations in the RBM domain, Y453F, F486L and N501T, 329 independently emerged but were rarely observed in human lineages; these residues are 330 directly involved in contact with the surface of the S-ACE2 complex and therefore are 331 relevant to new-host adaptation. Other mutations within the RBM domain should also 332 be monitored to prevent viral transmission and to further track the source. In addition 333 to the mutation of the RBD, variations in the cell epitope of the spike protein should also be considered, and monitoring of the potential consequences of cell epitope 335 variations in the process of viral transmission helps to adjust the vaccine strategy. The evolutionary entropy of specific sites on the spike protein from all the GISAID 532 sequences on February 1, 2021. (B) The WebLogo plots summarize the amino acid 533 divergence of Spike sequences characterized in this study. The single letter amino acid 534 (aa) code is used with the vertical height of the amino acid representing its prevalence From People to 416 Panthera: Natural SARS-CoV-2 Infection in Tigers and Lions at the Bronx 417 Zoo SARS-CoV-2 infection in farmed minks, the Netherlands Euro surveillance : bulletin Europeen sur les maladies 422 transmissibles = European communicable disease bulletin 2020 Genome Sequence of SARS-CoV-2 in a Tiger from a U.S. Zoological 427 Collection Several gorillas test positive for COVID-19 at California zoo-first 429 in the world Three snow leopards test positive for coronavirus, making it the 431 sixth confirmed animal species 434 et al. Clinical and Pathological Findings in SARS-CoV-2 Disease Outbreaks in 435 Farmed Mink (Neovison vison) Roles for Synonymous Codon Usage in Protein 438 Biogenesis Viral adaptation to host: a 441 proteome-based analysis of codon usage and amino acid preferences Estimation of the number of nucleotide substitutions in 444 the control region of mitochondrial DNA in humans and chimpanzees Molecular 447 Evolutionary Genetics Analysis across Computing Platforms SNiPlay: a web-based tool for detection, management and 451 analysis of SNPs. Application to grapevine diversity projects The codon Adaptation Index--a measure of directional 454 synonymous codon usage bias, and its potential applications CAIcal: a combined set of tools to 457 assess codon usage adaptation DAMBE5: a comprehensive software package for data analysis in 460 molecular biology and evolution A Tool to Obtain Structural Guidance in 463 Biocatalytic Investigations Mutation-selection models of codon substitution and 466 their use to estimate selective strengths on codon usage Predicting gene expression level from codon usage bias AXL is a candidate receptor for SARS-CoV-2 that 472 promotes infection of pulmonary and bronchial epithelial cells Structure of SARS coronavirus spike 475 receptor-binding domain complexed with receptor Evaluation of the Effect of D614g, N501y and S477n 478 Mutation in Sars-Cov-2 through Computational Approach Could Mustelids spur COVID-19 into a 481 panzootic Potential zoonotic sources of SARS-CoV-2 infections Further information on possible animal sources 486 for human COVID-19 Transmission of SARS-CoV-2 on mink farms between 491 humans and mink and back to humans An Overview of SARS-CoV-2 and 494 Animal Infection Companion animals likely do 498 not spread COVID-19 but may get infected themselves Evolution of codon usage patterns: the extent and 501 nature of divergence between Candida albicans and Saccharomyces cerevisiae. 502 Nucleic acids research 1992 SARS-CoV-2 Codon Usage Bias Downregulates 504 Host Expressed Genes With Similar Codon Usage Hypermutation in the Genomes of 507 SARS-CoV-2 and Other Coronaviruses: Causes and Consequences for Their 508 Short-and Long-Term Evolutionary Trajectories Analysis of codon usage of severe acute 511 respiratory syndrome corona virus 2 (SARS-CoV-2) and its adaptability in dog at each position in the polypeptide (aa 18, 222, 477, 501, 570, 614, 982