key: cord-1045350-8gyhpoey
authors: Kidera, Akinori; Moritsugu, Kei; Ekimoto, Toru; Ikeguchi, Mitsunori
title: Allosteric regulation of 3CL protease of SARS-CoV-2 and SARS-CoV observed in the crystal structure ensemble
date: 2021-10-27
journal: J Mol Biol
DOI: 10.1016/j.jmb.2021.167324
sha: 62acab4b536f83d384b3d417c7b8b7fe57e8ede1
doc_id: 1045350
cord_uid: 8gyhpoey

The 3C-like protease (3CLpro) of SARS-CoV-2 is a potential therapeutic target for COVID-19. Importantly, it has an abundance of structural information solved as a complex with various drug candidate compounds. Collecting these crystal structures (83 Protein Data Bank (PDB) entries) together with those of the highly homologous 3CLpro of SARS-CoV (101 PDB entries), we constructed the crystal structure ensemble of 3CLpro to analyze the dynamic regulation of its catalytic function. The structural dynamics of the 3CLpro dimer observed in the ensemble were characterized by the motions of four separate loops (the C-loop, E-loop, H-loop, and Linker) and the C-terminal domain III on the rigid core of the chymotrypsin fold. Among the four moving loops, the C-loop (also known as the oxyanion binding loop) causes the order (active)–disorder (collapsed) transition, which is regulated cooperatively by five hydrogen bonds made with the surrounding residues. The C-loop, E-loop, and Linker constitute the major ligand binding sites, which consist of a limited variety of binding residues including the substrate binding subsites. Ligand binding causes a ligand size dependent conformational change to the E-loop and Linker, which further stabilize the C-loop via the hydrogen bond between the C-loop and E-loop. The T285A mutation from SARS-CoV 3CLpro to SARS-CoV-2 3CLpro significantly closes the interface of the domain III dimer and allosterically stabilizes the active conformation of the C-loop via hydrogen bonds with Ser1 and Gly2; thus, SARS-CoV-2 3CLpro seems to have increased activity relative to that of SARS-CoV 3CLpro.

During structure-based drug discovery, the three-dimensional structure of a therapeutic target protein is often determined redundantly in the form of a complex with various drug candidates, which reveals their binding modes at the atomic level [1] [2] [3] [4] . Consequently, an abundance of structural information has been accumulated for some target proteins in Protein Data Bank (PDB) and other databases [5] [6] [7] [8] [9] [10] . Ligand interactions can alter the structure of a receptor protein in different ways to produce structural variation in the protein [11, 12] When examining protein structures, it becomes clear that ligand binding is not the sole cause of structural variation in crystals: crystal packing and sequence alteration also affect structure [13, 14] . Within a set of PDB entries for a protein, different space groups or different crystal packing are frequently found. Crystallographic experiments often use proteins with altered sequences either for mutagenesis experiments or in sample preparation. These factors generate the structural ensemble of crystal structures, which is termed a "crystal structure ensemble."

According to the concept of protein dynamics, conformational selection [15] , and linear response theory [16] , any structural changes of a protein, whether they occur naturally or artificially, are reflections of its intrinsic dynamics. Therefore, the crystal structure ensemble can be understood as a sampled subset of the true structural ensemble, although the samples may be limited and biased to some extent because they are not randomly sampled. In respect of the ensemble's reliability, the structural variations in the ensemble are statistically significant observations derived from repeated measurements. For these reasons, the crystal structure ensemble is an important source for the study of protein dynamics [17] .

In the present study, we assembled a crystal structure ensemble of 3C-like protease (3CL pro , also known as main protease) from severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) as well as the highly homologous 3CL pro of SARS-CoV, the etiological agent for SARS in 2002 (Fig. 1) . 3CL pro is a cysteine protease belonging to the C30 family (coronavirus 3C-like proteases) of the PA clan (serine/cysteine proteases with a chymotrypsin fold) [7] ; it plays a crucial role in viral replication by cleaving the polyprotein to release functional proteins [18] [19] [20] . The COVID-19 pandemic caused by SARS-CoV-2 poses an urgent challenge for the development of antiviral agents [21] [22] [23] . The inhibitor of 3CL pro is a potential candidate as an antiviral agent. The crystal structures of 3CL pro complexed with drug candidate molecules have been solved intensively and accumulated in PDB since the first structure was released in February 2020 (PDB:6lu7 [24] ; Supporting data S1 and S2). Here, we also utilize the abundant structure data of 3CL pro from SARS-CoV because of its 96% sequence identity and superimposable structures (Fig. 1) . The compiled data in this study are of version 10/25/2020, which contains SARS-CoV-2 3CL pro (83 entries; 113 independent chains) and SARS-CoV 3CL pro (101 entries; 145 independent chains) (Supporting data S1 and S2; see Data used in this study in Materials and Methods for details).

3CL pro is known for its highly dynamic nature. The function of 3CL pro is regulated by the structural transition between the active and collapsed states of the catalytic loop (termed the "C-loop" hereafter; also known as oxyanion binding loop) [25] . The activation requires dimerization through the additional C-terminal helical domain (domain III) [26] . The ligand binding sites of 3CL pro are located in flexible loops [7] . These dynamical behaviors were investigated in detail based on the crystal structure ensemble. The results of each entry are summarized in Supporting data S1 and S2 for the ligand-bound and ligand-free chains, respectively. First, the overall dynamics is explored to identify the elements of the dynamic structure. Second, the dynamic regulation of the catalytic function in the C-loop is studied. Third, the ligand binding sites and the influence of ligand binding on the protein structure are investigated, and the ligand-induced activation is explained. Finally, the effects of the mutation between the two 3CL pro species on the dimeric structure and catalytic function are examined.

Overall motions in the crystal structure ensemble of 3CL pro

To understand the dynamic regulation of the catalytic function of 3CL pro , the crystal structure ensemble was analyzed according to the following two steps (see Materials and Methods for more details). As the first step, the elements of motion regulating the catalytic function were identified by applying the Motion Tree to the crystal structure ensemble [27, 28] , which is a hierarchical cluster analysis of the variance of interresidue distances (therefore, it is superposition free). In the Motion Tree based on all 258 independent chains, each cluster is defined as a group of residues moving cooperatively, represented by a branch of the Motion Tree; thus, they are termed "moving cluster" (Fig.  2A) . The Motion Tree delineates the dynamics of 3CL pro as a set of moving clusters consisting of four flexible loops and domain III situated on the rigid core of the chymotrypsin fold (Fig. 2B) . The moving clusters are as follows: the "C-loop" (residues 138-143; the catalysis-related loop, also known as the L1 or oxyanion binding loop [29] ; the target fragment for functional regulation) , "E-loop" (residues 166-178; loop starting at E166, L2), "H-loop" (residues 38-64; a mostly helical loop), "Linker" (residues 188-196; a linker to domain III, L3), and "domain III" (residues 200-300). All of these clusters appear to move independently of each other (Table S1) . As detailed below, three of the four loops, namely the C-loop, E-loop, and Linker, contain major ligand binding sites. Domain III has been reported to play a crucial role in the dimerization required for the activation [25, 26] and is discussed below in relation to mutation effects and the allosteric regulation of catalytic function. The rigid core consists of the core part of domain I and II (residues 13-17, 33-37, 71-111, 129-137, 154-165, 179-187, and 197-199 ; chymotrypsin fold) and the dimer interface region (residues 3-12, 18-32, 65-70, 112-128, and 144-153) including the N-finger (residues 1-7) of the essential structural component for dimerization and activity [25, [30] [31] [32] , whereas residues 1 and 2 are highly flexible and play an important role in the functional regulation.

As the second step, the motions of each moving cluster were delineated by calculating the principal components (PCs) of the motions relative to the core region for the four loops and domain III (Fig. 2C) . Domain III was treated in two ways, one as the motion within a protomer (Fig. 2C ) and the other as the motion of the dimer (see Fig. S1 and Supporting Text 1 and 2 for details). The motion of domain III within a protomer is a typical domain motion against the core region, as DynDom, a domain motion analysis program, identifies it as the domain motion [33] .

Hydrogen bond between Gly143 and Asn28 regulates the formation of the oxyanion hole: a marker distinguishing active and collapsed states

The conformational change of the C-loop (residues 138-143) between the active and collapsed states switches on and off the catalytic activity of 3CL pro by forming and collapsing the oxyanion hole stabilizing the catalytic intermediate [25, 34] . This activation mechanism of the C-loop conformational change is largely identical to that of zymogen activation in serine proteases, such as chymotrypsinogen (inactive/collapsed) to chymotrypsin (active) [35] , although the activation of 3CL pro also requires dimerization. The representative active and collapsed structures are shown in Fig. 3A . The comparison of these structures suggests that the hydrogen bond (HB) between Gly143 and Asn28 is a sensitive signature of the oxyanion hole (hydrogen bonds are defined by LigPlot [36] ). In the active state, HB 143-28 together with HB 145-28 correctly arranges the main-chain NH groups of Gly143 and Cys145 to form the oxyanion hole. When the C-loop is collapsed or Gly143 moves downward as shown in the right panel of Fig. 3A (PC1(C) in Fig. 2C is drawn conversely as the activation process), HB 143-28 is cleaved but HB 145-28 remains intact. Consequently, Gly143 is separated from Cys145 to break down the oxyanion hole. As shown below, the formation of HB 143-28 is highly correlated with the C-loop conformation. Contrastingly, HB 145-28 exists in almost all chains (257 of 258 chains), with the exception of the mutant N28A (PDB:3fzd [37] ), as both Cys145 and Asn28 belong to the stable dimer interface region. The importance of Asn28 is confirmed by the mutant N28A, in which the activity is completely lost and the affinity for dimerization is largely impaired [37] . Therefore, in the present study, we define the conformational states of the C-loop by HB 143-28: the active state occurs when HB 143-28 exists, whereas the collapsed state occurs when HB 143-28 is absent. The role of the two hydrogen bonds, HB 143-28 and HB 145-28, may correspond to those of the salt bridge between Asp194 and Ile16 (existing only in chymotrypsin; Ile16 has become the N-terminus due to the autolysis) and the hydrogen bond between Ser195 and Gly43 (existing in both chymotrypsin and chymotrypsinogen) [35] . Actually, any hydrogen bond listed in Table 1 (explained below), particularly HB 140A-1B formed along with dimerization, can be assigned to correspond to HB 194-16 in chymotrypsin. Figure 3A also shows that the catalytic dyad, Cys145 and His41, maintains the close arrangement in both states. Indeed, the distance between the two residues fluctuates by only a small extent (see the distance histogram in Fig. S3 ) because His41 is located at the less mobile hinge region of the H-loop.

When expanding the focus from the oxyanion hole to the whole C-loop, the conformation is described by various structural characteristics: the continuous principal component PC1 for the C-loop, PC1(C) (Fig. 2C) , and five hydrogen bonds between the C-loop and residues surrounding the C-loop, HB 138-172, HB 139-126, HB 140A-1B (A and B indicate the two protomers), and HB 141-118, along with HB 143-28 (Fig. 3A ). Among these HBs, HB 138-172 and HB 140A-1B are the hydrogen bonds with the moving clusters, i.e., the E-loop and the flexible fragment of the N-finger coupled with dimerization, respectively. These dynamic couplings with the C-loop are discussed below. The formation of these hydrogen bonds in each conformational state of the C-loop are summarized in Table 1 ; their formation was highly correlated with those of HB 143-28 (as indicated by the rate of agreement at the bottom of Table 1 ), except for HB 140-1 (explained below). Therefore, they represent almost the same structural information on the C-loop as that of HB 143-28; in other words, the conformational change occurs cooperatively involving these HBs as a structural transition. PC1(C) can be divided into the two categories with the threshold value of −1, namely the active state for PC1(C) > −1 and the collapsed state for PC1(C) < −1, for which the threshold value was roughly optimized to give a high correlation. Figure 3B shows the distribution of PC1(C), in which the active state has a definite value, whereas the collapsed state shows large variation. Thus, the structural change in the C-loop between the two states can be regarded as an order-disorder transition (SupportingText 3 for details).

A rather poor agreement, 0.81, was found in the interprotomer HB 140-1. It has been reported that HB 140-1 is involved in an important interprotomer interaction coupled with dimerization to stabilize the active state, as well as another interprotomer hydrogen bond between Glu166A and Ser1B that occurs concurrently with HB 140-1 (160 chains have both of HB 140-1 and HB 166-1; 21 chains have only HB 140-1; and 5 chains have only HB 166-1) [25, 38] . Nevertheless, as shown in Table 1 , fewer active chains have HB 140-1 than have the other HBs. This can be explained by the uncharged NH group of Ser1 produced by some amino acids appended to Ser1 [39] (data are summarized in Supporting data S1 and S2). The uncharged N-terminus weakens the interaction to Phe140 O to destabilize the polar contact (83% (= 54/65) of chains with the uncharged N-terminus do not have HB 140-1) and mimics the state prior to the cleavage of the polyprotein at the N-terminus. When only the chains with the innate charged N-terminus are counted in the statistics, the agreement with HB 143-28 increases up to 0.95, i.e., the same level as the other indices.

It is still necessary to clarify the reason why many chains in the active state do not need stabilization via HB 140-1 (Table 1 ). This situation differs between ligand-free chains and ligand-bound chains. For ligand-free chains, 10 of 13 active/ligand-free chains without HB 140-1 lose two more hydrogen bonds of the three HBs, 138-172, 139-126, and 141-118, on average (Supporting data S2). Thus, the absence of HB 140-1 destabilizes the C-loop to produce a marginally active state. The active/ligand-bound chains without HB 140-1 show a completely different feature, i.e., the other four HBs remain intact as well as PC1(C) > −1 (Supporting data S1). This can be explained by ligand-induced activation [40, 41] ; the ligand interactions with the C-loop always contain interactions between a moiety mimicking the main-chain carbonyl group of the P1 site and the main-chain NH groups of Gly143 and Cys145, indicating that Gly143 and Cys145 are in the position of the active state. Table 1 shows that the probability of the chains found in the active state is 0.98 (= 166/(166+4)) in ligand-bound chains, whereas the probability decreases to 0.73 (= 64/(64+24)) in ligand-free chains. Stabilization by ligand molecules is also confirmed by the collapsed/ligand-bound chains; they are in the collapsed state because they do not have any ligand interaction at the C-loop (Supporting data S1; see the discussion on ligand binding below).

The variety of ligand binding is another subject of the crystal structure ensemble; the dataset contains 167 chains complexed with 92 different ligands (Supporting data S1). Ligand binding was analyzed in terms of the ligand binding sites defined by LigPlot [36] . As shown in Supporting data S3, binding is characterized by whether each residue has polar/nonpolar/covalent interactions with the ligand; 25 residues in 3CL pro are identified as the major binding sites shared by 29-150 chains of the 167 ligand-bound chains. The major binding sites consist of hydrogen bonding residues known as the substrate binding subsites (S1-S6) [25, 34] and their neighboring residues forming the nonpolar contacts. Another 16 minor sites (see the caption of Supporting data S3) were found adjacent to the major binding sites and contained only 51 ligand-residue contacts in total. Notably, the major binding sites mostly overlap with the moving clusters defined by the Motion Tree, i.e., the C-loop, E-loop, Linker, and H-loop, and these binding sites are named as the following five "binding clusters": the dimer interface region, H-loop, C-loop, E-loop, and Linker with some minor changes to the assignment (Supporting data S3). This observation indicates that the ligand binding sites, composed of the moving four loops, have a highly dynamic character that contrasts with a druggable rigid binding pocket; thus, it may not be easy to achieve high affinity without a covalent interaction at Cys145 (139 of the 167 ligand-bound chains have covalently bound ligands; Table 2A and Supporting data S3).

To illustrate the structural details of ligand binding, we classified the ligands into the three types: the peptide substrate, peptide-mimic compound, and nonpeptide compound (see the caption of Table 2A for the definitions). As shown in Table 2A , the peptide/peptide-mimic compounds have more binding sites than are found in nonpeptide compounds, which is due to the size difference (average molecular weight: 578 for peptide/peptide-mimic compounds; 322 for nonpeptide compounds). The representative structures of the complexes are shown in Fig. S5 . The peptide substrates are recognized by hydrogen bonds with the subsites: S1:C-loop (F140, G143, S144, and C145); E-loop (H163 and H164); H-loop (H41); S3:E-loop (E166); S4:Linker (Q189); and S6:Linker (Q192) (Fig. S5A) . The peptide-mimic compounds are also bound at the peptide moieties by the subsites, although the hydrophobic side-chains do not necessarily have a definite orientation (Fig. S5B ). In contrast, the nonpeptide compounds show a large variety of binding poses (Fig. S5C) . However, when the polar interactions are focused, the subsites correctly make hydrogen bonds with the polar atoms of the nonpeptide compounds (Fig.  S5D) . Consequently, the variation in binding sites is strictly limited to a small set of the subsites and their adjacent nonpolar residues.

We also examined the influence of ligand binding on the structure of 3CL pro . The effects of the C-loop have been discussed above; here, the conformations of the E-loop and Linker are examined. The conformation of the H-loop is not influenced by ligand binding but does exhibit intrinsic flexibility (Fig. S6 ). Figure 4A plots the PC1 values of the E-loop against those of the Linker, the distributions of which are not correlated (correlation coefficient: 0.181; Table S1 ). However, when the conformations of the E-loop and Linker are classified in terms of the binding cluster (Supporting data S3), a weak but definite binding pose dependence of the conformation is observed. Here, the binding poses are classified as "EL" (binding both the E-loop and Linker), "E" (binding only the E-loop), "w" (weak; binding neither the E-loop nor Linker), and ligand-free (see the caption of Supporting data S3 for the definition). A one-dimensional histogram drawn along the collective variable, PC1(E)-PC1(L), more clearly shows the binding pose dependence (Fig. 4B ). In the ascending order of PC1(E)-PC1(L), the ligand-free, "w," "E," and "EL" poses appear in the histogram. The histograms of the ligand-free and "w" poses almost overlap because the ligand binding of the "w" pose does not have an interaction with either the E-loop or Linker. However, the four monomeric crystal structures are situated in the histogram at the smallest extreme because the largely skewed position of domain III makes PC1(L) have largely negative values (Supporting data S2). These data indicate that the value of PC1(E)-PC1(L) increases when the binding sites expand more to the E-loop and Linker. The representative structures of these groups show that when the collective variable increases, the E-loop shifts downward to accommodate larger ligands (Fig. 4D ). At the same time, the N-terminal (upper) part of the Linker shifts inward to make more interactions with the ligand, whereas the C-terminal (lower) part that does not participate in binding moves outward. The collective variable PC1(E)-PC1(L) correctly represents these motions in one-dimension. As shown in Fig. 4C , the motion along the collective variable of PC1(E)+PC1(L), perpendicular to PC1(E)-PC1(L), is an opening/closing motion of the two loops that exhibit almost no binding pose dependence. This suggests that PC1(E)+PC1(L) represents the intrinsic fluctuations. The representative structures are shown in Fig. S7 . The majority of the nonpeptide compounds have the binding pose "w" because of their small size (Table 2B) .

We also investigated the influence of the E-loop on the C-loop, or on catalytic activity. As shown in Table 1 , one of the hydrogen bonds stabilizing the active state of the C-loop, HB 138-172, represents the direct interaction between the C-loop (Gly138) and E-loop (His172). As explained above, the E-loop makes downward motions depending on the size of the bound ligand (the larger the ligand becomes, the more the E-loop shifts downward). Figure 5A shows the representative structures with the E-loop situated at the lower position due to ligand binding and with the E-loop at the upper position in the ligand-free chain (PDB: 7brpA and 7bro, respectively); the former structure forms a hydrogen bond between His172 and Gly138, whereas the latter structure does not form this bond. Statistics for the PC1(E) dependence of the hydrogen bond formation are shown in Fig. 5B ; a monotonous increase was observed for the probability of formation of HB 138-172 with increasing PC1(E) (the downward motion). Simultaneously, the probability of the C-loop in the active state increases. To avoid the influence of ligand-induced activation, we also calculated the same quantities for ligand-free chains. The difference between the values for all chains and those for the ligand-free chains can be ascribed to the effect of ligand-induced activation, but the behavior of the increase with PC1(E) is found in both chains. Therefore, the downward motion of the E-loop stabilizes HB 138-172, which then stabilizes the active state of the C-loop. The interrelation among the three features, the bound ligand size, the motion of the E-loop, and the formation of HB 138-172, suggests that the activity of 3CL pro is maximized in large native substrates.

In the above sections, we did not distinguish between SARS-CoV 3CL pro and SARS-CoV-2 3CL pro in our analysis of the crystal structure ensemble. However, since there are 12 amino acid alterations between the two 3CL pro , the influence of the mutations is discussed in this section. Ten sites among the twelve mutations are mostly located on the surface of the core part of domains I and II (solvent accessibility: ~0.8); therefore, they have no substantial influence on the structure. However, two mutations, T285A and I286L, are located on the interface of the domain III dimer and have substantial effects on structure. Figure 6A shows the distributions of the interprotomer Cα distance between Thr(Ala)285A and Thr(Ala)285B; the distance of SARS-CoV-2 3CL pro is much shorter than that of SARS-CoV 3CL pro , as has already been observed in the structure of a triple mutant S284-T285-I286/A of SARS-CoV 3CL pro [42] . In Fig. 6C , the representative configurations of the interface of the domain III dimer are compared; the interprotomer hydrophobic contacts are formed among Ala285A(B), Ala285B(A), and Leu286B(A) in SARS-CoV-2 3CL pro , whereas Thr285 of SARS-CoV 3CL pro is distant from its counterpart. The smaller size of the alanine side-chain relative to that of threonine enables a shorter interprotomer distance. Furthermore, the hydrophobic packing of a pair of the alanine/leucine residues has greater affinity than that of a weak hydrogen bond between the two hydroxyl groups of threonine (a hydroxyl group may preferably make a hydrogen bond with water).

We now focus on the details of the dynamics that occur along with the change in distance 285-285. As shown in Fig. 6C , Thr(Ala)285 is in a 16-residue-long loop (residues 276-291). However, as illustrated in the Motion Tree ( Fig. 2A) , the fragment involving Thr(Ala)285 (residues 280-287) is in a section of the moving cluster that constitutes the rigid core of domain III. Thus, this fragment is rigid and does not change the conformation independently of domain III, probably due to its winding shape with intraloop hydrogen bonds. Therefore, the difference in distance is not caused by the internal motion of the loop but rather by the rigid body motion of domain III. The respective distributions of PC2(domain III dimer), a parameter describing the configuration of the domain III dimer (Fig. S1) , of SARS-CoV 3CL pro and SARS-CoV-2 3CL pro are largely separated, similar to the distributions of distance 285-285 (Fig. 6B) ; domain III of SARS-CoV-2 3CL pro has more closed arrangements relative to the open arrangements of SARS-CoV 3CL pro . Indeed, the mode structure of PC2(domain III dimer) agrees well with the direction of motion in Thr(Ala)285 departing from the counterpart of the other protomer (Fig. S1 ). In Fig. S8 , a clear correlation between PC2(domain III dimer) and distance 285-285 is also shown.

We investigated whether the difference in the configuration of the domain III dimer, observed between SARS-CoV-2 3CL pro and SARS-CoV 3CL pro (Fig. 6A) , affected the conformation of the C-loop or influenced catalytic activity. First, we analyzed the experimental data. Kinetic experimental data [24, 43, 44] ) indicate that SARS-CoV-2 3CL pro has a larger catalytic efficiency than SARS-CoV 3CL pro , although the kinetic parameters differ greatly among the three studies, 3-fold greater efficiency [43] and slightly greater efficiency [24, 44] . Furthermore, it was reported that the T285A mutant of SARS-CoV 3CL pro had ~1.4-fold higher activity than the wild-type and that the triple mutant S284-T285-I286/A showed ~3.7-fold higher activity [44] . Given that distance 285-285 is reduced to 6.2 Å in the triple mutant S284-T285-I286/A (for the seven mutant dimers), down from the 7.7-Å of the average of all SARS-CoV 3CL pro (Fig. 6A) , it is reasonable to conclude that the motions of domain III influence catalytic activity.

Based on the analysis of the crystal structure ensemble, we assessed the connection between domain III and the C-loop ( Fig. 7A and 7C ). The probability of finding the C-loop in the active state, as well as the probability of HB 140-1 being formed, decreases with increasing distance 285-285, particularly over 8.5 Å (Fig. 7A) , which clearly indicates the allosteric coupling between domain III and the C-loop. The chains with an uncharged N-terminus (which is highly unlikely to form HB 140-1) accumulate in the region of distance 285-285 over 8.5 Å (Fig. 7A) . This observation can be understood as a causal relationship in which the absence of HB140-1 due to the uncharged N-terminus induces the opening motion of domain III dimer.

Here, we propose a possible scenario to explain these observations, using the structures illustrated in Fig. 7C together with a variety of experimental evidence. The uncharged N-terminus tends to break HB 140-1, as discussed above; thus, it destabilizes the active state in the C-loop. However, since neither Phe140 nor Ser1 directly interacts with domain III, it is necessary to identify a factor linking domain III and HB 140-1. As a possible factor, we identified an intraprotomer/interdomain hydrogen bond between Asn214 OD1 and Gly2 N (HB 214-2). This unique interdomain interaction forms and breaks in accordance with the position of domain III or distance 285-285; the opening motion along PC2(domain III dimer) separates Asn214 from Gly2 (Fig. S1) , and distance 214-2 is well correlated with PC2(domain III dimer) (Fig. S9A ). As shown in Fig. 7A , the probability of formation of HB 214-2 decreases with increasing distance 285-285. In contrast, the other interdomain interactions occurring at the hinge region of the domain motion are stably maintained almost independently of the domain III position (Fig. S9B ). Although these stable interactions do not operate as a switch, they have significant contributions to stabilizing the dimer structure; the mutations at the residues illustrated in Fig. S9B (Arg4, Ser123, Ser139, Glu290, Arg298, and Gln299) impair dimerization and catalysis [46, 47] , and the mutations R298A (PDB:2qcy and 3m3t) and S139A (PDB:3f9e) produce the monomeric crystal structures.

The connection between HB 140-1 and HB 214-2 is explained as follows. The absence of HB 140-1 allows Ser1 to move freely and to be separated from Phe140. This conformational change accompanies the shift of Gly2 to destabilize HB 214-2 (the probability of formation of HB 214-2 decreases from the unconditional value of 0.59 (= 150/254) for all chains to 0.19 (= 14/73) under the condition without HB 140-1). Finally, the influence of HB 214-2 to the domain III dimer can be observed in the mutant N214A disabling to form HB 214-2. The structures of N214A (PDB: 2qc2 and 3m3s; both are SARS-CoV 3CL pro ) have large values of distance 285-285 (8.6 Å and 8.4 Å, respectively; the average distance of SARS-CoV 3CL pro is 7.7 Å). These structures suggest that the loss of HB 214-2 tends to open the domain III dimer. In summary, we observed the following relationship: the absence of HB 140-1 decreases the probability of formation of HB 214-2 and then the absence of HB214-2 opens the interface of the domain III dimer.

The uncharged N-terminus is chemically unchangeable and makes the absence of HB 140-1 independent of the other components such as the configuration of the domain III dimer and HB 214-2. Therefore, the role of HB 140-1 in functional regulation is most evidently observed in chains with an uncharged N-terminus (Fig. 7A) . Conversely, in the natively charged N-terminus, the occurrence of HB 140-1 is changed reversibly under the influence of other components. However, this complicated situation occurs in the native condition and should also be investigated. Furthermore, the ligand-induced activation is another factor obscuring the influence of domain III on the C-loop because the C-loop conformation is determined by interactions with the ligand molecule. Hence, we used the ligand-free chains with charged N-termini to recalculate the quantities shown in Fig. 7A , although the number of chains was significantly reduced from 254 to 62. Figure 7B shows the results of the recalculation (the values for > 8.5 Å are not shown because the number of chains in this distance range was not sufficient to calculate statistical quantities). The numbers of chains with HB 140-1 and HB 214-2 were shown to decrease with distance 285-285; these values did not differ largely from those shown in Fig. 7A . However, the number of chains in the active state, or chains with HB 143-28, clearly showed a monotonous decrease with distance 285-285 from 1.0 (distance < 5.5 Å) to 0.69 (~7.5-8.5 Å); these data contrast with those in Fig. 7A showing values kept close to unity due to ligand-induced activation. Overall, these data clearly demonstrate that the opening motion of the domain III dimer has a destabilizing effect on the active state of the C-loop through the dissociation of HB 214-2 and HB 140-1.

Based on the results presented above, we compared the numbers of formation of the hydrogen bonds for SARS-CoV 3CL pro and SARS-CoV-2 3CL pro using 62 ligand-free chains with a natively charged N-terminus (Table 3) . We found that the closed configuration of the domain III dimer in SARS-CoV-2 3CL pro results in a 0.54 greater probability of formation of HB 214-2 than the probability for SARS-CoV 3CL pro , as well as a 0.25 increase in the probability of formation of HB 140-1 and a 0.14 increase in the probability of occurrence of the active state. Although the influence is reduced by half for each interaction step connecting the four structural elements (the domain III dimer, HB 214-2, HB140-1, and C-loop), our analyses suggest that SARS-CoV-2 3CL pro has slightly increased activity over that of SARS-CoV 3CL pro , which is largely consistent with the experimental data.

The crystal structure ensemble, consisting of 258 independent chains, successfully describes the structural dynamics of SARS-CoV 3CL pro and SARS-CoV-2 3CL pro as well as elucidates the allosteric regulation of catalytic function. The structural dynamics is characterized by the motion of the four loops (the C-loop, E-loop, H-loop, and Linker) and domain III on the rigid core. Among the four loops, the C-loop causes the order (active)-disorder (collapsed) transition, which is regulated cooperatively by the five hydrogen bonds with the surrounding residues. Three of the loops, the C-loop, E-loop, and Linker, constitute the major ligand binding sites with a limited variety of binding residues including the subsites. Ligand recognition at the main-chain NH groups of Gly143 and Cys145 induces the formation of an oxyanion hole-like structure to produce the active conformation of the C-loop (i.e., ligand-induced activation). Ligand binding also causes the ligand size dependent conformational changes to the E-loop and Linker, which further stabilize the C-loop through HB 138-172. Mutation T285A from SARS-CoV 3CL pro to SARS-CoV-2 3CL pro significantly closes the interface of the domain III dimer and affects the stability of the C-loop conformation allosterically via HB140-1 and HB 214-2. Because of this allosteric regulation, the closed arrangement of the domain III dimer in SARS-CoV-2 3CL pro increases the stability of the active state of the C-loop and yields a slightly higher activity than that of SARS-CoV 3CL pro .

As a reference to the present results, the crystal structures of MERS-CoV 3CL pro , 3CL pro of Middle East respiratory syndrome-related coronavirus (2012) belonging to the same protease family (C30), were analyzed in a similar manner as in SARS-CoV and SARS-CoV-2 3CL pro . It was found that the structural properties were basically the same as those of SARS-CoV and SARS-CoV-2 3CL pro . The details are summarized in Supporting text 4.

As a scheme for the analysis of the crystal structure ensemble, the identification of the overall dynamic structure should precede the PCA, and then the PCA is applied separately to various moving parts of the protein. This scheme is to avoid the application of the PCA to the whole protein molecule. It is because the PCA tends to produce a mode structure in which the mode vector has non-zero elements at all atoms considered in the analysis, representing correlation in motion extending to the whole molecule, due to the orthonormal condition in the eigenvalue problem. Therefore, it is difficult to describe a localized motion by the PCA of the whole protein molecule. Motion Tree using the variance of the residue distances enables us to identify the moving clusters of any size from a domain level to a residue level without any prior knowledge (see below and Fig.  2A) . Each of the moving clusters thus found is then separately subjected to the PCA ( Fig.  2C and Fig. S1 ). Here, it is important that the PCA does not exclude the translation and rotation motions of the moving clusters; the motion should be defined as a relative motion against the core region of the protein via superimposition onto the core region.

The crystal structure ensemble was constructed for 3CL pro of SARS-CoV-2 and SARS-CoV based on the PDB data of the version of 10/25/2020. The following entries were not used in the analysis: the entries from the PanDDA analysis (115 entries), the entry with domain swapping (PDB:3iwm), and the entries containing only domain III (PDB: 2k7x, 2liz, and 3ebn). The compiled data contains 83 entries/113 independent chains for SARS-CoV-2 3CL pro and 101 entries/145 independent chains for SARS-CoV 3CL pro . These are listed in Supporting data S1 and S2 for the ligand-bound and ligand-free entries, respectively. The data after 10/25/2020 until 7/25/2021 were summarized in Supporting data S4, S5 and S6, which correspond to Supporting data S1, S2 and S3, respectively (SARS-CoV-2 3CL pro : 154 entries and 226 chains; SARS-CoV 3CL pro : 5 entries and 6 chains). The analyses of Supporting data S4, S5 and S6 were summarized in Fig. S10 .

We developed a method to define the building blocks moving cooperatively, which we achieved through hierarchical clustering of interresidue distances (for pairwise comparisons) or their variances (for the comparison of many entries) and subsequent construction of a dendrogram, namely the Motion Tree [27, 28] .

The Motion Tree illustrates, in a hierarchical manner, a pair of clusters at each node that moves reciprocally with the amplitude of the tree height of the node named "Motion Tree (MT) score." Because of the straightforward application to the structure ensemble without the need for a structural superposition procedure, a comprehensive understanding of the structural dynamics of various protein molecules can be achieved [17, [48] [49] [50] .

We compared 258 chains in a Motion Tree using the variance-based scheme. The variance of distance fluctuation, {D mn }, used as a metric for hierarchical clustering, is calculated as D mn = <Δd 2 mn > 1/2 , where d mn is the distance between Cα atoms of residues m and n, ∆d mn is the associated deviation from the mean distance, and <…> is the average over the structural ensemble. We did not include highly mobile Ser1 and Gly2, as well as C-terminal residues 301-306, in the analysis because these residues are often in the list of missing residues. Since 3CL pro is in the homodimeric form, D mn and the resulting clusters have to be symmetrical upon the exchange of the two protomers. However, asymmetric dimers exist in the crystal structure ensemble; they are considered to be under the influence of crystal packing or in different states of ligand binding. For the purpose of removing these influences, D mn was symmetrized using the duplicated structures of AB and BA, where AB is the original dimer and BA is the dimer with the protomers exchanged. Because of the symmetrization, two equivalent clusters corresponding to each protomer exist in the Motion Tree. This symmetrizing operation was also applied to the calculation of the PCs for the domain III dimers. *These values were obtained by LigPlot [36] . # Peptide-mimic compounds are defined when a ligand contains more than two peptide moiety, and it is admitted that a carbonyl group is replaced by alcohol, and that Cα is in a part of an aromatic group. $ The number in the parenthesis is the number of chains with covalently-bound ligands. The number of nonpeptide ligands contains those bound at residues other than Cys145. The probability was calculated as the number of chains having the HB (the number in the parenthesis) divided by the total number of chains listed in the column "All". The column of "active" is those having HB 143-28 as in the above definition. ; the probability of finding a chain with the uncharged N-terminus due to some appended amino acids (solid brown curve with diamonds); the probability of finding a chain with HB 140-1 (broken blue curve with squares); the probability of finding a chain with HB 214-2 (dotted blue curve with triangles). The total number of chains at each interval of distance 285-285 is, in ascending order, 59, 57, 32, 78, 13, and 15. (B) As in (A), but for ligand-free chains with natively charged N-termini without an appended amino acid. The intervals of 8.5 ~ 9.5 and > 9.5 are not presented because the number of chains in these intervals are not sufficient to calculate statistics. The total number of chains at each interval of distance 285-285 is, in ascending order, 18, 18, 6, and 16. (C) Two 3CL pro structures that explain the scenario for the accumulation of chains with uncharged N-termini at large distance 285-285, drawn after superposition at the core region. These structures are PDB:6lu7 (SARS-CoV-2): A chain (green) and B chain (cyan); distance 285-285 = 5.314 Å, and PDB:7kfi (SARS-CoV-2) having A chain (salmon) and B chain (yellow) with transparency; distance = 9.858 Å. Structures are superimposed at the core region of 6lu7A and 7kfiB. Only the key parts are illustrated. The entry 6lu7 has a natively charged N-terminus, whereas 7kfi has an appended sequence at the N-terminus (Gly(-2), Ala(-1), and Met0) drawn as lines. 6lu7 has both HB 140-1 and HB 214-2 (red broken lines). However, 7kfi does not have either HB 140-1 or HB 214-2 and the C-loop is collapsed (Phe140 is oriented to the other direction) because of the uncharged N-terminus. Flexible Ser1 of 7kfi induces a shift of the position of Gly2 to break HB 214-2. The absence of HB214-2 causes the motion of domain III to separate Ala285. The upper right diagram shows a schematic diagram of the coupling between the C-loop and domain III via HB 140-1 and HB 214-2.

Highlights ・Dynamics of SARS-CoV/SARS-CoV-2 3CLpro was analyzed using 184 crystal structures.

・Dynamics was identified as the motions of the four flexible loops and the domain III.

・The catalytic activity is regulated by five hydrogen bonds with the catalytic loop.

・Ligand binding causes a ligand size dependent conformational change to the two loops.

・Mutation T285A between the two coronaviruses affects both structure and activity.

High-throughput crystallography for lead discovery in drug design

The genesis of high-throughput structure-based drug discovery using protein crystallography

The process of structure-based drug design

Keynote review: Structural biology and drug discovery

Representativity of target families in the protein data bank: impact for family-directed structure-based drug discovery

How Structural Biologists and the Protein Data Bank Contributed to Recent FDA New Drug Approvals

The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database

An online resource for GPCR structure determination and analysis

Nuclear receptors database including negative data (NR-DBIND): A database dedicated to nuclear receptors binding data including negative data and pharmacological profile

KLIFS: an overhaul after the first 5 years of supporting kinase research

The role of dynamic conformational ensembles in biomolecular recognition

Exploring the role of receptor flexibility in structure-based drug discovery

A large data set comparison of protein structures determined by crystallography and NMR: Statistical test for structural differences and the effect of crystal packing

Homology modeling in drug discovery: current trends and applications

Folding funnels and binding mechanisms

Protein structural change upon ligand binding: Linear response theory

Inter-lobe motions allosterically regulate the structure and function of EGFR kinase

Targeting the dimerization of the main protease of coronaviruses: A potential broad-spectrum therapeutic strategy

The SARS-CoV-2 main protease as drug target

Identification of SARS-CoV-2 3CL protease inhibitors by a quantitative high-throughput screening

Research and development on therapeutic agents and vaccines for COVID-19 and related human coronavirus diseases

Therapeutic options for the 2019 novel coronavirus (2019-nCoV)

Pharmacologic treatments for coronavirus disease 2019 (COVID-19) A review

Structure of M-pro from SARS-CoV-2 and discovery of its inhibitors

The crystal structures of severe acute respiratory syndrome virus main protease and its complex with an inhibitor

Structure of coronavirus main proteinase reveals combination of a chymotrypsin fold with an extra alpha-helical domain

A hierarchical description and extensive classification of protein structural changes by motion tree

Motion tree delineates hierarchical structure of protein dynamics observed in molecular dynamics simulation

pH-dependent conformational flexibility of the SARS-CoV main proteinase (M-pro) dimer: Molecular dynamics simulations and multiple X-ray structure analyses

Severe acute respiratory syndrome coronavirus 3C-like proteinase N terminus is indispensable for proteolytic activity but not for enzyme dimerization -Biochemical and thermodynamic investigation in conjunction with molecular dynamics simulations

Critical assessment of important regions in the subunit association and catalytic action of the severe acute respiratory syndrome coronavirus main protease

The N-terminal octapeptide acts as a dimerization inhibitor of SARS coronavirus 3C-like proteinase

The DynDom database of protein domain motions

Cysteine proteases and their inhibitors

Mechanisms of zymogen activation

LIGPLOT -a program to generate schematic diagrams of protein ligand interactions

Mutation of Asn28 disrupts the dimerization and enzymatic activity of SARS 3CL(pro)

Mechanism for controlling the dimer-monomer switch and coupling dimerization to catalysis of the severe acute respiratory syndrome coronavirus 3C-like protease

Crystal structures of the main peptidase from the SARS coronavirus inhibited by a substrate-like aza-peptide epoxide

Crystal structures reveal an induced-fit binding of a substrate-like aza-peptide epoxide to SARS coronavirus main peptidase

Mutation of Glu-166 blocks the substrate-induced dimerization of SARS coronavirus main protease

Dynamically-driven enhancement of the catalytic machinery of the SARS 3C-like protease by the S284-T285-I286/A mutations on the extra domain

Feline coronavirus drug inhibits the main protease of SARS-CoV-2 and blocks virus replication

Crystal structure of SARS-CoV-2 main protease provides a basis for design of improved alpha-ketoamide inhibitors

The catalysis of the SARS 3C-like protease is under extensive regulation by its extra domain

Quaternary structure of the severe acute respiratory syndrome (SARS) coronavirus main protease

Correlation between dissociation and catalysis of SARS-CoV main protease

Domain motion enhanced (DoME) model for efficient conformational sampling of multidomain proteins

Comprehensive analysis of motions in molecular dynamics trajectories of the actin capping protein and its inhibitor complexes

Allosteric response to ligand binding: Molecular dynamics study of the N-terminal domains in IP3 receptor

Toru Ekimoto and Mitsunori Ikeguchi wrote the paper

The authors declare that they have no conflict of interest.

The authors declare that they have no conflict of interest.