key: cord-1055136-pr9n0r1g authors: Lee, Jonathan T.; Yang, Qingyi; Gribenko, Alexey; Perrin, B. Scott; Zhu, Yuao; Cardin, Rhonda; Liberator, Paul A.; Anderson, Annaliesa S.; Hao, Li title: Genetic surveillance of SARS-CoV-2 Mpro reveals high sequence and structural conservation prior to the introduction of protease inhibitor Paxlovid date: 2022-03-30 journal: bioRxiv DOI: 10.1101/2022.03.29.486331 sha: bb2585eb9899de5794061545173b013f43eaa528 doc_id: 1055136 cord_uid: pr9n0r1g SARS-CoV-2 continues to represent a global health emergency as a highly transmissible, airborne virus. An important coronaviral drug target for treatment of COVID-19 is the conserved main protease (Mpro). Nirmatrelvir is a potent Mpro inhibitor and the antiviral component of Paxlovid™. The significant viral sequencing effort during the ongoing COVID-19 pandemic represented a unique opportunity to assess potential nirmatrelvir escape mutations from emerging variants of SARS-CoV-2. To establish the baseline mutational landscape of Mpro prior to the introduction of Mpro inhibitors, Mpro sequences and its cleavage junction regions were retrieved from ∼4,892,000 high-quality SARS-CoV-2 genomes in GISAID. Any mutations identified from comparison to the reference sequence (Wuhan-hu-1) were cataloged and analyzed. Mutations at sites key to nirmatrelvir binding and protease functionality (e.g., dimerization sites) were still rare. Structural comparison of Mpro also showed conservation of key nirmatrelvir contact residues across the extended Coronaviridae family (alpha-, beta-, and gamma-coronaviruses). Additionally, we showed that over time the SARS-CoV-2 Mpro enzyme remained under purifying selection and was highly conserved relative to the spike protein. Now, with the EUA approval of Paxlovid and its expected widespread use across the globe, it is essential to continue large-scale genomic surveillance of SARS-CoV-2 Mpro evolution. This study establishes a robust analysis framework for monitoring emergent mutations in millions of virus isolates, with the goal of identifying potential resistance to present and future SARS-CoV-2 antivirals. Importance The recent authorization of oral SARS-CoV-2 antivirals, such as Paxlovid, has ushered in a new era of the COVID-19 pandemic. Emergence of new variants, as well as selective pressure imposed by antiviral drugs themselves, raise concern for potential escape mutations in key drug binding motifs. To determine the potential emergence of antiviral resistance in globally circulating isolates and its implications for the clinical response to the COVID-19 pandemic, sequencing of SARS-CoV-2 viral isolates before, during, and after the introduction of new antiviral treatments is critical. The infrastructure built herein for active genetic surveillance of Mpro evolution and emergent mutations will play an important role in assessing potential antiviral resistance as the pandemic progresses and Mpro inhibitors are introduced. We anticipate our framework to be the starting point in a larger effort for global monitoring of the SARS-CoV-2 Mpro mutational landscape. Introduction extracted from the genome-wide alignments. To prepare for selection analysis, sequences with 156 entries of N or with deletions (-) were filtered out and STOP codons were replaced with triplet of identified using Biovia Discovery Studio Visualizer (version 4.5, Dassault Systèmes) and then 177 manually inspected to confirm. All structural models of the M pro protein were rendered using the 178 Biovia Discovery Studio Visualizer software. (Table S1 ). Twenty-six amino acids 193 were selected as active site residues because they have at least one heavy atom within 4.5Å of the 194 common ligand PRD_002214. PRD_002214 is a Michael acceptor-based peptidomimetic 195 inhibitor, known as N3, developed previously to target M pro from multiple CoVs (21-24). Since 196 then, this inhibitor has been used in broad CoV M pro enzymatic and co-crystallographic studies, 197 including the first reported SARS-CoV-2 M pro crystallographic structure (33). The sequence homology comparison of these 26 amino acid residues in M pro across different 199 CoVs is shown in Fig. 1A . The key interaction amino acids are also indicated by arrows colored 200 by their location at the binding site (Fig. 1B) . The catalytic site residues (His41 and Cys145), as 201 well as the S1 pocket residues (His163, Glu166 and His172), which tightly interact with P1 202 pyrrolidinone lactam of nirmatrelvir and N3 ligands, were identical in each of the CoV M pro 203 sequences. Amino acids at S2 and S4 pockets showed slightly more diversity compared to those 204 at S1. The S2 Met49 or Met16 residues become Leu in other β-CoV proteases or Thr in α-CoV 205 proteases (Fig. 1A) . The S4 amino acids indicated by the green arrows in Fig. 1A showed even 206 greater diversity compared to those in S2. Though the S2 and S4 amino acids are not completely 207 conserved across different proteases, they still share high sequence similarity. Superposition of the crystal structures of the 12 CoV M pro enzymes illustrated that while they are from different increasing, with the rise for the S protein being more dramatic compared to M pro and RdRp (Fig. 231 3B). 232 The key driver for the evolution of SARS-CoV-2 and numerous VOCs has primarily been 233 adaptive amino acid changes observed in the S protein that have enabled evasion of vaccine-234 elicited immunity or neutralization by mAb therapeutics (34) (35) (36) (37) (38) (39) (40) . Other than the selection 235 imposed due to its essential function in viral replication, and unlike S, M pro has not been 236 subjected to vaccine or antiviral pressure to evolve. It is expected that essential function proteins 237 like M pro are under purifying (negative) selection with a signature non-synonymous-to-238 synonymous substitution ratio (dN/dS) of less than 1. We conducted a selection analysis using 239 three independent downsampled datasets of three genes, M pro , RdRp and S, with ~25K sequences 240 in each dataset. The overall mean dN/dS (ω) for M pro , RdRp and S were 0.427, 0.397, and 0.543, 241 respectively. They were all lower than 1 and the dN/dS ratios for M pro and RdRp were lower than 242 that for S, suggesting that M pro and RdRp were under stronger purifying selection compared to S. The nucleotide diversity (π) of M pro was estimated as 6.58E-04, which was lower than that for 244 RdRp (1.02E-3) and S (2.63E-3). Variation of the codon-based dN/dS ratio in M pro was also 245 examined using a Bayesian sliding window model (Fig. S1 ). Overall, the codon-based dN/dS 246 profile was similar across three independent downsampled datasets. The mean dN/dS ratio across 247 305 codons in M pro ranged from 0.117 to 0.750, except at residue 46 from one of the 248 downsampled datasets. The regions near residues 144 and 289 had lower dN/dS ratios compared 249 to other regions of the protein, indicating that amino acid changes in these regions were not 250 favored, and implying that these domains might play critical roles in M pro function. From examination of the M pro gene across >4.8 million SARS-CoV-2 genomes, the most K88R and G15S (Fig. 4) . P132H, with the highest frequency of 6.15%, is exclusively associated 254 with the Omicron VOC (B.1.1.529 or BA.1/2). Prior to the enormous influx of Omicron cases, 255 the frequency of P132H was as low as 0.012%. All prevalent M pro mutations with 256 occurrences >5,000 are listed in Table S2 , together with their geographic and genetic lineage 257 distribution. These are associated with different emergent VOC/VOIs. None of the prevalent 258 mutations mapped to residues critical for nirmatrelvir activity (e.g., proximity of nirmatrelvir 259 binding pocket as shown in Fig. 1 , or dimerization interface, as shown in Fig. S2 ). Genetic diversity of M pro within variants of concern/interest (VOCs/VOIs) 262 In addition to the five current VOCs, two current VOIs (Lambda and Mu) and three former VOIs 263 (Eta, Iota, and Kappa) have been identified by the WHO (7). Although emerging variants of 264 SARS-CoV-2 are defined by accumulation of mutations in S (8), it is critical to monitor 265 mutational changes in other viral proteins, such as M pro . All M pro protein mutations were 266 retrieved for each individual VOC/VOI. Aside from the Beta, Lambda and Omicron variants, the 267 majority of isolates from each of the remaining VOCs/VOIs had M pro sequences that were 268 identical to the reference sequence (Wuhan-hu-1) (Fig. 5A ). The P132H mutation was detected 269 in >98% of Omicron isolates, whereas the most prevalent mutations in Lambda and Beta isolates 270 were G15S and K90R, respectively (Fig. 5A ). K90R is a conservative substitution and is not 271 expected to induce changes in the 3D structure of the protease, while Gly15 is referred to as a 272 "C' residue" of the N-terminal α-helix (41, 42), a position with heavy preference for Gly. G15S According to the co-crystal structure of M pro bound to nirmatrelvir reported earlier (11), nine key 287 residues were identified: His41, Met49, Gly143, Cys145, His163, His164, Met165, Glu166 and 288 Gln189 ( Fig 6A) . His41 and Cys145 are catalytic residues, while the remaining residues 289 establish direct contacts with nirmatrelvir. Any changes in these residues may affect inhibitor 290 binding. Examination of >4.8 million SARS-CoV-2 genomes illustrated that these nine residues 291 within M pro were highly conserved, with substitution frequencies of < 0.028% (Fig. 6B ). Among 292 these nine contact residues, one amino acid residue (His163) was not found to be mutated, and 293 five residues (His41, Gly143, Cys145, His164, Glu166) were extremely conserved with ≤ six 294 isolates identified that carry alternative amino acids. Met49, Met165, and Gln189 had more 295 amino acid changes but still at a frequency of <0.028%. Another factor that would significantly affect M pro activity and catalytic efficiency is divergence 297 from the consensus substrate recognition sequence, which always contains Gln directly upstream 298 of the cleavage position (position P1). Preceding this (position P2) is a hydrophobic amino acid. At cleavage sites within the SARS-CoV-2 reference isolate Wuhan-hu-1, this is most commonly Table S3 . 305 We investigated the mutation frequency of >4.8 million isolates at the 11 M pro substrate cleavage 306 sites and neighboring residues along ORF1ab to assess sequence conservation. In total, 445 307 unique amino acid changes were identified within five residues of the cleavage sites (Table S4) . Despite being the most conserved amino acid among the 11 recognition sites on the Wuhan-hu-1 309 reference, the P1 Gln was not the most conserved residue among the examined isolates. Rather, 310 both the P2 and P1' positions had fewer mutations overall. In total, 7,282 instances of 311 substitution at position P1 were observed with >98% of those cases being Gln to His (Table S4) . Over 5,000 cases of this mutation were at the Nsp8-Nsp9 junction, with no more than 1,000 313 changes from the Gln consensus at P1 detected at any of the other 10 cleavage sites (Table S4) . (Table S4) . 320 M pro dimerization is critical for enzyme function and the strength of the interprotomer contact 321 can directly affect protease activity (44-47). Given the importance of dimerization, we performed 322 analysis of amino acid residue conservation at this interface (Table 1) . That interface is formed 323 by the N-terminal tail of each protomer inserted between the two subunits of the enzyme, with 324 many residues forming a complex network of interactions. Seventeen residues predicted to 325 impact dimerization through interaction with one another were identified (Fig. S2 ). As predicted 326 from the dimerization requirement for enzyme activity, these residues were also highly 327 conserved with a mutation frequency of <0.11% across the >4.8 million SARS-CoV-2 genomes 328 examined (Table 1) previously (11). In addition to the residues forming nirmatrelvir binding sites, variation in 353 residues at the M pro dimer interface was also monitored, as self-association is critical for protease 354 activity. Although not all residues at the interface have been proven to be functionally important, 355 it is conceivable that amino acid substitutions at positions that are spatially close to each other may introduce favorable or unfavorable interactions. In turn, this could result in changes in 357 subunit association and, correspondingly, an impact on enzyme activity and/or nirmatrelvir 358 binding. Our selection analysis on M pro demonstrated that the protein is under strong purifying selection 360 with a non-synonymous-to-synonymous mutation ratio (dN/dS) of less than 1. This is consistent (57). Therefore, as more sampled viral isolates undergo 394 nirmatrelvir selection, and as more sequences become available in GISAID, our analysis 395 workflow is prepared to detect the emergence of potential escape mutations. Moving forward, 396 genomic surveillance of M pro will be needed to continuously assess risk for antiviral resistance, 397 specifically in the context of Paxlovid treatment of patients with active SARS-CoV-2 infection. In conclusion, results of our extensive sequence analysis across nearly 4.9 million global SARS- This research received no specific grant from any funding agency in the public, commercial, or 423 not-for-profit sectors. This study was solely sponsored by Pfizer, Inc. All authors disclose that 424 they are employees of Pfizer and some of the authors are shareholders in Pfizer, Inc. Liu Table S2 . Tables Table 1 . Mutation breakdown at M pro dimerization interface residues. Table S4 . Mutation frequency at M pro cleavage sites and neighboring residues. Table S1 . B) Percent sequence identity, similarity and RMSD (Cα) of 26 amino acids at the nirmatrelvir binding site for β-CoVs, α-CoVs, and IBV-CoV (γ-CoV). Identity and similarity values range from 50-100 and RMSD (Cα) values range from 0.30-1.02 in their respective color mapping scales. Only P132H, characteristic of the Omicron variant, exceeds 100,000 cases and no residues interact with nirmatrelvir (shown in red). The full geographic and lineage breakdown of these mutations can be found in Table S2 . A) M pro variant distribution is shown for each VOC/VOI. The five most prevalent sequences for each lineage are shown as colored bars (blue, gold, red, purple, green), with the cumulative remaining sequences in gray. The most prevalent sequence (blue) corresponds to the Wuhan-Hu-1 sequence (WT) and is found in all but three lineages. For these remaining lineages (Omicron, Lambda and Beta), each characteristic non-synonymous substitution is designated as a pattern: P132H (hashes), G15S (diamonds), and K90R (squares). B) Relative mutation frequency among Delta variant isolates. Position of the four most prevalent mutation sites found in this variant (K88, K90, I259, A260) are shown on the protein structure (WT). C) Relative mutation frequency among Omicron variant isolates. Position of the three most prevalent mutation sites (K90, P132, T169) are shown on the protein structure. Figure S1. Median dN/dS ratio (ω) and 95% credibility interval along the M pro gene. The ratio of non-synonymous-to-synonymous substitutions (d N /d S ) was calculated using MCMC from three independent subsampling sets of the GISAID sequences (red, blue, and green) to assess the M pro sequence stability. Points represent the d N /d S at each codon and dotted lines represent the average d N /d S for the gene. Vertical gray lines indicate codons for contact residues. Codons with d N /d S above 1 (dotted line) indicate a greater probability for non-synonymous mutations, while those below 1 are more conserved and less favored for amino acid changes. The CI alludes to higher d N /d S values around residues 46 and 132 (the second peak aligns with P132H). US (15%), Switzerland (7.2%) 8%) B.1.1.7 (12.3%), AY.44 (11.9%), AY.122 (10.4%), B.1.177 (5.2%) B.1.617.2 (24.1%), AY.122 (23%) US (11.5%), UK (9.6%), Canada (8.9%) US (36.9%), UK (9.6%), Switzerland (7.2%) 8%), UK (30.6%) AY.4 (13.6%), B.1.617.2 (9