key: cord-0864082-1nn6izc6 authors: Vahed, Majid; Calcagno, Tess M; Quinonez, Elena; Mirsaeidi, Mehdi title: Impacts of 203/204: RG>KR mutation in the N protein of SARS-CoV-2 date: 2021-01-14 journal: bioRxiv DOI: 10.1101/2021.01.14.426726 sha: f23ff8d6ca52119f344074d2f844966e68a0c796 doc_id: 864082 cord_uid: 1nn6izc6 We present a structure-based model of phosphorylation-dependent binding and sequestration of SARS-CoV-2 nucleocapsid protein and the impact of two consecutive amino acid changes R203K and G204R. Additionally, we studied how mutant strains affect HLA-specific antigen presentation and correlated these findings with HLA allelic population frequencies. We discovered RG>KR mutated SARS-CoV-2 expands the ability for differential expression of the N protein epitope on Major Histocompatibility Complexes (MHC) of varying Human Leukocyte Antigen (HLA) origin. The N protein LKR region K203, R204 of wild type (SARS-CoVs) and (SARS-CoV-2) observed HLA-A*30:01 and HLA-A*30:21, but mutant SARS-CoV-2 observed HLA-A*31:01 and HLA-A*68:01. Expression of HLA-A genotypes associated with the mutant strain occurred more frequently in all populations studied. Importance The novel coronavirus known as SARS-CoV-2 causes a disease renowned as 2019-nCoV (or COVID-19). HLA allele frequencies worldwide could positively correlate with the severity of coronavirus cases and a high number of deaths. R203K and G204R. Additionally, we studied how mutant strains affect HLA-specific antigen 23 presentation and correlated these findings with HLA allelic population frequencies. We discovered 24 RG>KR mutated SARS-CoV-2 expands the ability for differential expression of the N protein 25 epitope on Major Histocompatibility Complexes (MHC) of varying Human Leukocyte Antigen 26 (HLA) origin. The N protein LKR region K203, R204 of wild type (SARS-CoVs) and CoV-2) observed HLA-A*30:01 and HLA-A*30:21, but mutant SARS-CoV-2 observed HLA-28 A*31:01 and HLA-A*68:01. Expression of HLA-A genotypes associated with the mutant strain 29 occurred more frequently in all populations studied. 30 The novel coronavirus known as SARS-CoV-2 causes a disease renowned as 2019-nCoV 32 (or COVID-19) . HLA allele frequencies worldwide could positively correlate with the severity of 33 coronavirus cases and a high number of deaths. presentation of COVID-19 ranging from an asymptomatic presentation to disabling multi-organ 46 failure has led to the study of differential host-pathogen interaction specifically relating to 47 population-based Human Leukocyte antigen (HLA) allele frequencies (4, 5) . 48 ORF1a express 11 nonstructural proteins (Nsps) from Nsp1 to Nsp11, with the genes of ORF1b 50 expressing proteins from Nsp12 to Nsp16 (6, 7). Major structural proteins including Spike (S), 51 envelope (E), membrane (M) and nucleocapsid (N) proteins are encoded by other ORFs. The N 52 proteins of SARS-CoV are highly basic structural proteins localized in the cytoplasm and the 53 nucleolus of Trichoplusia ni BT1 Tn 5B1-4 cells (8). Previous studies have indicated the N 54 proteins of other coronaviruses are extensively phosphorylated and bound to viral RNA to form a 55 helical ribonucleoprotein (RNP) that comprises the viral core structure (9). Recently, mutations in 56 N segment of SARS-CoV2 have been reported (10). Two replacements in positions R203K and 57 G204R of N proteins have been found in several countries, but their potential effects in the protein 58 structure have not been discussed. 59 We present a structure-based model of phosphorylation-dependent binding and 61 sequestration of SARS-CoV-2 nucleocapsid protein and the impact of two consecutive amino 62 acid changes R203K and G204R. Additionally, we studied how mutant strains affect HLA-63 specific antigen presentation and correlated these findings with HLA allelic population 64 frequencies. 65 ClustalW multiple sequence alignment was employed to align the LKR (68 nucleotides 68 long) of CoV N protein aligned for bat/pangolin models and human models SARS-69 COVs/MERS-COV. Notably, The Gly at position 204 is conserved among the closely related 70 coronaviruses Bat coronavirus pangolin, SARS-CoV, and SARS-CoV-2, but variable for SARS-71 CoV2n ( fig. 1(a) ). The clustering trees of coronavirus are displayed in figure 1 (b) . Sequence 72 alignments suggested that other coronavirus N proteins might share the same structural 73 organization based on intrinsic disorder predictor profiles and secondary structure predictions 74 (Fig. 2) . 75 A plot of N-proteins residues indicates the local model quality by plotting knowledge-based 77 energies as a function of an amino acid sequence position. A detailed energy calculation of wild 78 and mutant types revealed residues 203/204 are located in the highest energy level area (Fig. 3) . 79 The mutant type showed slightly high free energy at residues 203/204 a.a. KR compared to the 80 wild type; suggesting enhanced structural flexibility and increased the tendency for the formation 81 of a coil or a bend in the secondary structure. The relative orientation of NTD and CTD, as well 82 as the conformations of the disordered regions (N-arm, LKR, and C-tail), are drawn randomly to 83 reflect the dynamic nature of the N protein ( Fig. 3 ) 84 We also analyzed predictable phosphorylation sites of N-protein by employing the NetPhos 86 3.1 server (http://www.cbs.dtu.dk/services/NetPhos/, accessed on 14 Sep 2020). The linker 87 region of SARS-CoV N-protein (LKR) contains a Ser/Arg-rich region with a high number of 88 putative phosphorylation sites (a.a. 172-206). The sites of contiguous amino acid changes of 89 203R>K and 204G>R are located in the SR-rich region which is known to be intrinsically 90 disordered. We predicted a nonspecific kinase phosphorylation site at Ser202 and a specific 91 CDK5, RSK, and GSK3 phosphorylation sites at Ser206, all of which are close to the RG>KR 92 mutation (Fig. 4) . When Ser202/206 and Thr205 are phosphorylated, charge neutralization of 93 the nearby positively charged sidechains likely takes place. The G204R mutation decreases the 94 conformational entropy of neutralization by increasing the positive charge in the vicinity of 95 negatively charged P-Ser202/205 phosphate groups (Fig. 5) . 96 Subcellular localization of the N-protein was predicted using the DeepLoc 1.0 neural network 98 algorithm. The resultant values of 0.861 and 0.913 obtained for wild and mutant N protein 99 respectively suggest the protein is predominantly found in the nucleus (Fig. 6 a,b) . The N-protein 100 position prediction graph (Fig.6 ,c,d) confirmed a long peak in mutation areas K203 and R204, 101 defined based on SARS-CoV-2 data. 102 We used the Immune Epitope Database (IEDB) to determine linear B-cell hosted epitopes 104 utilizing the incorporated Chou & Fasman Beta-Turn prediction module threshold 1.07. We 105 supplied the FASTA sequence of the targeted protein as an input assuming all default parameters. 106 LKR (a.a.170-206) region of N-protein shown significant antigenic epitopes with potential binding 107 to B lymphocytes cells (Fig. 7) . IEDB software also predicted epitopes based on N-protein 108 conformation and residue exposure, and independently graphed a broad peak in mutation areas 109 K203 and R204, the likely epitope regions defined based on SARS-CoV-2 data (Fig. S2) . 110 We found that MHC polymorphism typically results in differential MHC epitope 112 recognition within the N-protein LKR K203 R204 region of wild type (SARS-CoV-2) as 113 compared to mutant (SARS-CoV2). Wild type (SARS-CoV-2) observed HLA-A*30:01 and HLA 114 31:01 predictions, but mutant SARS-CoV-2 observed HLA-A*31:01 and HLA-A*68:01 115 predictions (Table 1 ,2). The frequency of HLA class I representation within South American, 116 Japanese, and Iranian populations was recorded for both wild type and mutant strains. HLA class 117 I molecules associated with mutant strains occurred more frequently in all three population groups, 118 with the most significant increase seen in the South American population (18.38% in wild type 119 versus 28.25% in the mutant). 120 Variations in Host Human Leukocyte Antigen (HLA) gene expression can influence 122 antigenic presentation of coronavirus epitopes. The Nucleocapsid (N) proteins of many 123 coronaviruses are highly immunogenic epitopes expressed abundantly during infection. Using 124 genome sequencing and prediction models, we characterized a mutation in the N-protein which 125 affected the viral structure, function, and immunogenicity. The N protein wild type is known to be a representative antigen for the T-cell response in 141 a vaccine setting, inducing SARS-specific T-cell proliferation and cytotoxic activity (14, 15) . 142 We discovered RG>KR mutated SARS-CoV-2 expands the ability for differential expression of Low HLA-A*68:01 expression could correlate with the low number of COVID-19-attributable 175 deaths seen in Japan as compared to other industrialized countries. Importantly, we found that 176 RG>KR mutated SARS-CoV-2 expands the ability for differential expression of the N protein 177 epitope on Major Histocompatibility Complexes (MHC) of varying Human Leukocyte Antigen 178 (HLA) origin. 179 We used BLASTP programs from the NCBI database search (23) to find the LKR N-182 protein sequence (43 nucleotides long) of all SARS-CoV-2. Conserved and varied residues were 183 identified by using the WebLogo program (24-26). Multiple alignments were performed between 184 full-length N-protein sequences on the EMBL-EBI server. Clustal Omega is used to apply mBed 185 algorithms for guide trees. ClustalW alignment tools executed to output alignment format (27). 186 We analyzed all available sequences available up to September 07th, 2020. 187 The atomic coordinates of the N-terminal domain (NTD) and C-terminal domain (CTD) 189 were obtained from the structure that is available in a Protein Data Bank 190 (http://www.rcsb.org/pdb) (PDB ID: 6M3M, 6WZO)(28, 29). The tertiary structure of the full 191 419 a.a. sequencing of N-proteins was predicted using the IntFOLD5 server (PYMOL). 192 Sequences from residues 1-419 for N-protein native Sequence ID: YP_009724397.2 and mutant 193 sequence ID: QIQ08827.1 were used in this study. All of the structures were visualized using 194 PYMOL Chimera software Version 1.7.4 (30). The calculation procedure was almost the same 195 as that in our previous works (31-34). 196 ProSA was used to measure the energy distribution of the N-protein structure (35, In this subsection, we used the Immune Epitope Database (IEDB) (36) to determine linear 200 We supplied the FASTA sequence of the targeted protein as an input considering all default 202 parameters. We also used the Discotope 2.0 (38) method to predict epitopes based on N-protein 203 conformation and residue exposure. 204 We used the TepiTool, a T cell Epitope Tool that is used for MHC class I and II binding 206 predictions. The IEDB team's recommendations were selected as defaults to automatically select 207 the top peptides (39). In the MHC-I Binding Prediction feature, the default value provided is 1.0, 208 i.e. all peptides with percentile rank ≤ 1.0 will be selected as predicted peptides. The list of 209 representative alleles from different HLA supertypes was selected by the panel of 27 allele 210 reference sets. The peptide selection criterion in this approach will always be predicted percentile 211 rank. MHC-II binding prediction Results from the default value provided is 10.0, i.e. all peptides 212 with percentile rank ≤ 10.0 will be selected as predicted peptides by using the panel of 26 most 213 frequent alleles. 214 The subcellular localization of wild and mutant N proteins was predicted using DeepLoc-216 Control of COVID-19 The continuing 2019-nCoV epidemic threat of novel 224 coronaviruses to global health -The latest 2019 novel coronavirus outbreak in Wuhan Novel Wuhan The development of vaccines against SARS 261 corona virus in mice and SCID-PBL/hu mice Effects of a SARS-associated coronavirus vaccine in monkeys HLA targeting efficiency correlates with 270 human T-cell response magnitude and with mortality from influenza A infection 273 Nucleoprotein of influenza A virus is a major target of immunodominant CD8+ T-cell responses Kedzierska 278 K. 2016. Towards identification of immune and genetic correlates of severe influenza disease in 279 Indigenous Australians G-rich VEGF aptamer as a potential inhibitor 312 of chitin trafficking signal in emerging opportunistic yeast infection Recognition of errors in three-dimensional structures of proteins Immune epitope database analysis resource Reliable B cell epitope predictions: 322 impacts of method development and improved benchmarking TepiTool: A Pipeline for Computational Prediction of T 324 Cell Epitope Candidates Figure 6. Hierarchical tree-predicted subcellular localizations of N-protein using neural 413 networks algorithm. (a) Wild (203R),(204G). (b) Mutant (203K),(204R). (c) N