key: cord-0696716-jh8d8wbk authors: Cotten, Matthew; Robertson, David L.; Phan, My V.T. title: Unique protein features of SARS-CoV-2 relative to other Sarbecoviruses date: 2021-05-12 journal: bioRxiv DOI: 10.1101/2021.04.06.438675 sha: fd959612ba1d502585f4b64842e82736ad4b732c doc_id: 696716 cord_uid: jh8d8wbk Defining the unique properties of SARS-CoV-2 protein sequences, has potential to explain the range of Coronavirus Disease 2019 (COVID-19) severity. To achieve this we compared proteins encoded by all Sarbecoviruses using profile Hidden Markov Model similarities to identify protein features unique to SARS-CoV-2. Consistent with previous reports, a small set of bat and pangolin-derived Sarbecoviruses show the greatest similarity to SARS-CoV-2 but unlikely to be the direct source of SARS-CoV-2. Three proteins (nsp3, spike and orf9) showed differing regions between the bat Sarbecoviruses and SARS-CoV-2 and indicate virus protein features that might have evolved to support human infection and/or transmission. Spike analysis identified all regions of the protein that have tolerated change and revealed that the current SARS-CoV-2 variants of concern (VOCs) have sampled only a fraction (~31%) of the possible spike domain changes which have occurred historically in Sarbecovirus evolution. This result emphasises the evolvability of these coronaviruses and potential for further change in virus replication and transmission properties over the coming years. here. We sought to define the distance of any query virus genome from the early SARS-CoV-2 genome 74 that first began to infect humans in December 2019. To give two levels of resolution, we generated 75 overlapping 44 or 15 amino acid (aa) peptides from all early lineage B SARS-CoV-2 encoded proteins 76 then prepared pHMMs using HMMER-3 (see Figure 1a ). The resulting libraries of pHMMs were then 77 used to survey domain diversity across query coronaviruses relative to the initial 2019 SARS-CoV-2. For each pHMM match to a related sequence, a bit-score is generated which provides a metric for how 79 close the query sequence is to the pHMM (Figure 1b) . These bit-scores, when collected across an We next focused on the bat Sarbecoviruses with closest similarity to SARS-CoV-2 in at least 129 part of their genomes due to recombinant histories (see Supplementary Table 1 Global proteome similarities. As described in Figure 2 1.351, B.1.525, P.1, B.1.617.2 and A.23.1) , the patterns suggest that SARS-CoV-2 has a great deal In a second analysis we examined a peptide sequence spanning the important furin cleavage c. The proximal origin of SARS-CoV-2 Genome name GenBank or GISAID Year Reference RpYN06 EPI_ISL_1699446 2020 unpublished RmYN02 EPI_ISL_412977 We thank all global SARS-CoV-2 sequencing groups for their open and rapid sharing of sequence