key: cord-1022196-26bg2rbd authors: Schuster, Noah A. title: A theoretical analysis of the putative ORF10 protein in SARS-CoV-2 date: 2020-10-26 journal: bioRxiv DOI: 10.1101/2020.10.26.355784 sha: a29341d42e78d0c587648c6dde4804bf8fea7ad5 doc_id: 1022196 cord_uid: 26bg2rbd Found just upstream of the 3’-untranslated region in the SARS-CoV-2 genome is the putative ORF10 which has been proposed to encode for the hypothetical ORF10 protein. Even though current research suggests this protein is not likely to be produced, further investigations into this protein are still warranted. Herein, this study uses multiple bioinformatic programs to theoretically characterize and construct the ORF10 protein in SARS-CoV-2. Results indicate this protein is mostly ordered and hydrophobic with high protein-binding propensity, especially in the N-terminus. Although minimal, an assessment of twenty-two missense mutations for this protein suggest slight changes in protein flexibility and hydrophobicity. When compared against two other protein models, this study’s model was found to possess higher quality. As such, this model suggests the ORF10 protein contains a β-α-β motif with a β-molecular recognition feature occurring as the first β-strand. Furthermore, this protein also shares a strong phylogenetic relationship with other putative ORF10 protein’s in closely related coronaviruses. Despite not yielding evidence for the existence of this protein within SARS-CoV-2, this study does present theoretical examinations that can serve as platforms to drive additional experimental work that assess the biological relevance of this hypothetical protein in SARS-CoV-2. An initial outbreak of coronavirus disease 2019 in Wuhan, China has resulted in a massive global pandemic causing, at the time this manuscript was prepared, over 37,000,000 confirmed cases and 1,070,000 deaths. 1 The virus responsible, SARS-CoV-2, contains a single-stranded positive-sense RNA (+ssRNA) genome that is~29.8 kilobases in length. 2 In particular, the 3'-terminus contains several ORFs encoding four main structural proteins: the spike (S), membrane (M), envelope (E), and nucleocapsid (N) proteins. 3 In addition, many shorter length ORFs have been detected and proposed to encode for approximately nine different accessory proteins (3a, 3b, 6, 7a, 7b, 8, 9b, 9c, 10). 2, 4 Although prior work has detected for the expression of RNA transcripts corresponding to majority of these proteins, transcripts belonging to ORF10 have not been observed. 4, 5 Furthermore, Pancer et al. 2020 had identified multiple SARS-CoV-2 variants with prematurely terminated ORF10 sequences. 6 In their study, they found disease was not attenuated, transmission was not hindered, and replication proceeded similarly to strains possessing intact OFR10 sequences. 6 Altogether, current research has led to uncertainty regarding the biological relevance of ORF10 in SARS-CoV-2, with suggestions that its genome annotation should be revised. As an unintended consequence, the putatively encoded protein's structure and function has remained largely unresolved. Found upstream of the 3'-untranslated region (3'-UTR), ORF10 (117 nt long) supposedly encodes for a protein that is 38 amino acids in length. 7, 8 Although there are sequences homologous to this protein in other closely related coronaviruses (CoVs), none possess any known structure. There is also no experimentally derived crystal structure for the ORF10 protein; however, previous studies that attempted to predict the secondary structure and/or model the protein reveal one α-helix and, depending on the study, two β-strands. 8, 9, 10 This protein was also found to lack significant levels of disorder; although, there is bioinformatic evidence that suggests a short molecular recognition feature (MoRF) spans residues 3-7. 11 The ORF10 protein is also hydrophobic with the α-helix identified as a possible transmembrane helix (TH). 9, 12 The greater implications are not clear, but the ORF10 protein contains high numbers of immunogenic promiscuous cytotoxic T-cell epitopes, primarily on the α-helix. 8, 9 Assuming there exists a likely function, the ORF10 protein was computationally shown to closely associate with members of the CUL2 ZYG11B complex. 10 It was suggested this protein directly interacts with the substrate adapter ZYG11B, thereby overtaking the complex and modulating certain aspects of ubiquitination to enhance viral pathogenesis. 10 Even so, the mechanisms and residues involved in this interaction have not been described in any great detail. Alternatively, ORF10 may act itself or serve as a precursor for other RNAs in regulating gene expression/replication, translation efficiency, or interfering with antiviral pathways. 5 From an evolutionary standpoint, ORF10 was found to be under the influence of strong positive selection (dN/dS = 3.82) without relaxed constraints. 12 This leads to the possibility ORF10 might encode a conserved and functional protein; however, additional sequence data is necessary to confirm this hypothesis. Overall, there still remains an extremely few number of research efforts spent on investigating the putative ORF10 protein within SARS-CoV-2. As such, more detailed studies could further elucidate its structure and in turn provide additional insights as to a possible function for this protein. Herein, this study uses a strictly bioinformatic approach to theoretically characterize and construct the putative ORF10 protein in SARS-CoV-2. The reference sequence corresponding to the ORF10 protein in SARS-CoV-2, along with twenty-two variants, were acquired from NCBI's Protein Database; furthermore, sequences from closely related bat and pangolin CoVs were acquired as well (Table 1 ). Sequences were aligned using MUSCLE on the MEGA-X v10.1.7 software. 13, 14 Both clustering methods were changed to neighbor joining while the other default settings were maintained. Alignment reliability was determined by overall mean distance and calculated using the p-distance substitution model. If the overall mean distance was found to be ≤ 0.7 then the alignment was considered reliable. 15 A single protein tree was constructed using the neighbor-joining method and visualized on MEGA-X. 16 In the construction of this tree, uniform rates and partial deletion were used along with a site coverage cutoff of 95%. The phylogeny was tested using the bootstrap method. This tree was not rooted with any specified outgroup. Protein disorder predictions were performed using previously described methods. 11,17 To detect phosphorylation sites, the DEPP server ( http://www.pondr.com/cgi-bin/depp.cgi ) was used. The grand average of hydropathicity (GRAVY) was determined by ProtParam, along with a hydrophobicity plot generated by ProtScale. 18 For structural predictions, three different online servers (PSIPRED v4.0, Jpred v4.0, and NetSurfP v2.0) were used to determine the probability scores of residues existing in either secondary structures or coils. 19, 20, 21 These values were then averaged to obtain individual residue scores that corresponded to each structural state (helix, strand, coil). The highest score of the three states indicated the placement of that were then averaged and the most constant range of scores were used to designate the span of amino acids likely to be associating within the TH. Lastly, the propensity of residues involved in protein-binding interactions were evaluated using SCRIBER. 24 The webserver IntFOLD was employed to make use of an ab initio modeling approach in constructing the ORF10 protein. 25 Models were evaluated based on IntFOLD's quality and confidence scoring. The best model was then refined using the 3Drefine webserver. 26 Among the five generated post-refinement, the model possessing the highest QMEAN Z-score was viewed as most favorable. 27 UCSF Chimera was used to visualize the model. 28 Electrostatic surfaces were generated based on the AMBER ff14SB charge model. Hydrophobicity surfaces were produced according to the default Kyte-Doolittle scale. Quality and structural comparisons against two previously constructed ORF10 protein models were also performed. 8, 29 3. Results The alignment had an overall mean distance of 0.059, corresponding to 94.10% identity for the entire alignment. The longest conserved region spanned residues 15-19, followed by two shorter regions spanning residues 11-12 and 33-34 ( Figure 1a ). Few amino acid differences occurred between SARS-CoV-2 and closely related bat and pangolin CoVs. For example, at amino acid site twenty-five SARS-CoV-2 and Pangolin-CoV MP789 possess arginine and serine, respectively. Missense mutations were identified among every SARS-CoV-2 ORF10 protein variant; furthermore, one mutation per variant was observed ( Figure 1a ; Table 2 ). Phylogenetic analysis revealed the ORF10 protein in Pangolin-CoV GX-P1E formed a lineage with the greatest evolutionary distance whereas ORF10 proteins in the remaining CoVs shared much closer relationships with significantly smaller distances ( Figure 1b ). A per-residue disorder plot for the ORF10 protein indicated that disorder scores within the C-terminal half (20-38) vary whereas the N-terminal half (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) held more uniform disorder scores ( Figure 2 ). No phosphorylation sites were identified in this protein (Table S1 ). The hydrophobicity plot revealed two hydrophobic regions spanning residues 3-19 and 28-36, along with a single hydrophilic region spanning residues 20-27 ( Figure 3a) . The GRAVY score for the reference SARS-CoV-2 ORF10 protein was 0.637; however, GRAVY scores for all twenty-two variants fluctuated above and below the reference (Figure 3b ). For mutations resulting in similar amino acid chemistry, polar residues that mutated into different polar residues had increased GRAVY scores. Interestingly, nonpolar residues that mutated into different nonpolar residues had decreased GRAVY scores except in SARS-CoV-2var9 and var10. For mutations that involved shifts in polarity, all polar to nonpolar mutations had increased GRAVY scores; however, all nonpolar to polar mutations resulted in decreased GRAVY scores except for SARS-CoV-2var8 and var20. As for protein-binding propensity, residues in the N-terminal half held greater propensity scores than residues found in the C-terminal half (Figure 4 ). According to secondary structure predictions, there exists a single α-helix and β-strand spanning residues 6-21 and 28-34, respectively; furthermore, one TH was predicted to span residues 7-18 ( Figure 5 ; Table S2; Table S3 ). IntFOLD yielded five models with varying confidence and quality scores (Table S4) . The model having a low P-value (3.33e-03) and high quality score (0.3714) was subjected to refinement. Of the five models made post-refinement, the model with a QMEAN Z-score of -0.66 was selected as the most favorable (Table S5) . Structurally, the model presented with a β-α-β motif spanning residues 3-31, along with a 3/10-helix spanning residues 34-37 ( Figure 6a ). An electrostatic surface map revealed two regions of positive charge and one region of negative charge (Figure 6b ). Electropositive regions were influenced by R20 and R24 whereas the electronegative region was influenced by D31. As expected, a majority of the protein's surface was hydrophobic; however, hydrophilic regions did appear (Figure 6c ). For example, the residues spanning 20-27 presented with surface coloring that reflected hydrophilic character, as was shown in the hydrophobicity plot as well. Prior to this study, two ORF10 protein models were built using different methods. The first model was constructed using the QUARK web server whereas the second model was built using the I-TASSER web server. 8, 29 Both models were extensively different from the model produced within this study, mainly in terms of structure and topology ( Figure 7) . Furthermore, the QMEAN Z-score for both models (QUARK: -3.88 / I-TASSER: -2.63) were significantly lower indicating high levels of error and extremely poor model quality. The phylogenetic results are similar to those by Hassan et al . 2020 in that none of the variants resulted in the formation of distinct clades. 9 At most, the phylogenetic data in this study suggests the ORF10 protein in Pangolin-CoV GX-P1E is most distantly related to all other ORF10 proteins. With such high similarity observed in the remaining ORF10 proteins, describing relationships between specific sequences are not entirely possible when based solely on sequence data. Although unlikely, perhaps structural comparisons of each protein would assist in revealing further relationships not highlighted by sequence information. As for conserved portions in the ORF10 protein, a possible explanation exists. Assuming the ORF10 protein is synthesized, these conserved regions could be vital to protein structure and/or function. Majority of these residues possess high protein-binding propensity scores; therefore, they may be essential in facilitating specific protein interactions. Mutations in these regions were not created, thus it is unclear if these residues would be essential in maintaining protein structure. Protein disorder predictions revealed the ORF10 protein is ordered. It is only near each terminus that residues begin expressing moderate to high flexibility. These results align well with those presented by Giri et al. 2020; however, results in this study do contradict those by Hassan et al. 2020. 9, 11 Their study indicates a greater number of flexible residues, including several disordered residues found within the C-terminus, exist in the ORF10 protein. 9 This disagreement is likely attributed to the method of prediction, where Hassan et al . 2020 used a single prediction algorithm unlike this study which used multiple prediction algorithms to evaluate disorder. 9 In doing so, per-residue disorder scores were not over estimated by a single algorithm; therefore, more conservative but reliable predictions were achieved. 17 The data in this study further indicates that mutations within the C-terminus instill changes in residue flexibility; however, such changes did not result in major shifts towards disorder. Conversely, mutations within the N-terminus had caused minuscule changes in residue flexibility, but as was noted in the C-terminus, these changes do not result in shifts towards disorder. Based on GRAVY scoring, mutations in the ORF10 protein had some obvious effects on hydropathicity. For example, polar to nonpolar mutations had increased GRAVY scores which implies an increase in the protein's hydrophobic character. However, certain shifts in polarity cannot be explained as easily. For example, nonpolar to polar mutations presented decreases in GRAVY scores with the exception of SARS-CoV-2var8 and var20. If anything, nonpolar to polar mutations should lower the GRAVY score due to decreases in the protein's hydrophobic character. At this stage, the underlying cause for higher GRAVY scores in these variants remains unclear and requires further investigation. Regardless of the increases and decreases in scoring, the ORF10 protein still remained quite hydrophobic. As such, these mutations are not likely to have major effects on structure and/or function; however, Hassan et al . 2020 does indicate that majority of these mutations, regardless of their change in polarity, are deleterious and will decrease the protein's structural integrity. 9 Modeling of these mutations would help determine whether or not they would negatively affect the protein's structure. The prediction of secondary structures for the ORF10 protein describe the presence of one α-helix and one β-strand. When compared to the structural assignment of residues in the ORF10 protein model, only 58% of residues match. For example, secondary structure predictions suggest the α-helix spans residues 6-21 whereas in the protein model the α-helix spans residues 10-22. Nonetheless, both results indicate that an α-helix and β-strand do exist in this protein. A second β-strand, occuring before the α-helix, comprised residues 3-8 and was only observed within the model. According to Giri et al. 2020, a MoRF was predicted to exist in residues 3-7; furthermore, secondary structure predictions for this particular region held the lowest prediction scores. 11 A unique feature associated with MoRFs is their ability to exist in semi-disordered (i.e. flexible) states until binding, after which they can assume a designated secondary structure. 30 Altogether, it is possible residues 3-7 (or 3-8) function as a β-MoRF that upon binding to a different protein, will assume the structure of a β-strand. Residues found in this region also have high protein-binding propensity scores to support this notion. The ORF10 protein, assuming that it's produced, would most likely be a membrane protein. Whether it is integrated within or peripheral to the membrane remains unclear, but given the extent of nonpolar residues it would seem much more likely to be integrated inside a membrane. In particular, CoVs encode multiple viroporins. 31 For example, in SARS-CoV the 8a protein oligomerizes to form ion channels inside mitochondrial membranes that induce apoptosis by depolarizing membrane potential. 12, [31] [32] [33] The ORF10 in SARS-CoV-2 may oligomerize in such a way that polar and electrically charged residues coat the inside of a pore that facilitates transport of ions or small molecules for viral replication, virulence, and/or pathogenicity. A prior study found the ORF10 protein colocalized with ORF8 and ORF7b proteins in the endoplasmic reticulum (ER); however, its role in the ER was not described. 34 In this case, the ORF10 protein could associate with ER membranes as a means to assist in replication. Experimental work is necessary to evaluate this finding. Alternatively, the ORF10 protein could interact with another protein. As previously stated, the ORF10 protein was suggested to associate with the substrate adapter ZYG11B. 10 This interaction is likely mediated by residues in the N-terminal half on account of the β-MoRF and higher residue protein-binding propensity scores. Collectively, this study does not provide nor discredit prior evidence regarding the existence of this protein in SARS-CoV-2; however, this study does provide an interesting theoretical examination of this hypothetical protein. In summary, the putative ORF10 protein appears ordered and hydrophobic; furthermore, this protein may possess a β-MoRF followed by an α-helix and subsequent β-strand. This protein also shares strong phylogenetic relationships with ORF10 proteins in other closely related CoVs. Lastly the protein model generated in this study is of higher quality than previously generated models. Hopefully, these results can provide a foundation to drive the experimental work necessary to further assess the biological relevance of the putative ORF10 protein in SARS-CoV-2. I would like to thank the previous researchers who had deposited the sequences used in this study onto the NCBI Protein Database. Without access to those sequences, the research described in this manuscript could have never been accomplished. G2D NP → P SARS-CoV-2var22 V30A NP → NP Table 2 . Twenty-two SARS-CoV-2 ORF10 protein variants and their corresponding mutations, along with the changes in polarity associated with each mutation. "NP" -Nonpolar. "P" -Polar. . Protein-binding propensity scores for amino acids within the reference SARS-CoV-2 ORF10 protein. The value of each score is shown above its corresponding box. As a general rule, a higher score reflects an increased likelihood that a specific residue is to engage in protein-binding. Coronavirus disease (COVID-19) Weekly Epidemiological Update and Weekly Operational Update . World Health Organization Genome Composition and Divergence of the Novel Coronavirus (2019-nCoV) Originating in China Emerging coronaviruses: Genome structure, replication, and pathogenesis The Architecture of SARS-CoV-2 Transcriptome Direct RNA sequencing and early evolution of SARS-CoV-2 The SARS-CoV-2 ORF10 is not essential in vitro or in vivo in humans Genomic characterization of a novel SARS-CoV-2 ORF10: Molecular Insights into the Contagious Nature of Pandemic Novel Coronavirus 2019-nCoV Notable sequence homology of the ORF10 protein of SARS-CoV-2, Human SARS and Bat SARS-Like Coronaviruses Coding potential and sequence conservation of SARS-CoV-2 and related animal viruses MUSCLE: multiple sequence alignment with high accuracy and high throughput MEGA X: Molecular Evolutionary Genetics Analysis across computing platforms Phylogenetic Trees Made Easy: A How-To Manual The neighbor-joining method: A new method for reconstructing phylogenetic trees A Bioinformatics Approach to the Function, and Evolution of the Nucleoprotein of the Order Mononegavirales The Proteomics Protocols Handbook Protein secondary structure prediction based on position-specific scoring matrices JPred4: a protein secondary structure prediction server NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning Markov model: Application to complete genomes A Combined Transmembrane Topology and Signal Peptide Prediction Method SCRIBER: accurate and partner type-specific prediction of protein-binding residues from proteins sequences IntFOLD: an integrated web resource for high performance protein structure and function prediction 3Drefine: Consistent Protein Structure Refinement by Optimizing Hydrogen Bonding Network and Atomic Level Energy Minimization Toward the estimation of the absolute quality of individual protein structure models UCSF Chimera-a visualization system for exploratory research and analysis Analysis of molecular recognition features (MoRFs) Molecular Evolution of Human Coronavirus Genomes Open reading frame 8a of the human severe acute respiratory syndrome coronavirus not only promotes viral replication but also induces apoptosis A Systemic and Molecular Study of Subcellular Localization of SARS-CoV-2 Proteins The author declares that no conflicts of interest exist.