key: cord-0555720-joasxqb2 authors: Nguyen, Duc D; Gao, Kaifu; Chen, Jiahui; Wang, Rui; Wei, Guo-Wei title: Unveiling the molecular mechanism of SARS-CoV-2 main protease inhibition from 92 crystal structures date: 2020-05-27 journal: nan DOI: nan sha: c434b5cc31bc9d7b15fe05af1076e08625b91710 doc_id: 555720 cord_uid: joasxqb2 Currently, there is no effective antiviral drugs nor vaccine for coronavirus disease 2019 (COVID-19) caused by acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Due to its high conservativeness and low similarity with human genes, SARS-CoV-2 main protease (M$^{text{pro}}$) is one of the most favorable drug targets. However, the current understanding of the molecular mechanism of M$^{text{pro}}$ inhibition is limited by the lack of reliable binding affinity ranking and prediction of existing structures of M$^{text{pro}}$-inhibitor complexes. This work integrates mathematics and deep learning (MathDL) to provide a reliable ranking of the binding affinities of 92 SARS-CoV-2 M$^{text{pro}}$ inhibitor structures. We reveal that Gly143 residue in M$^{text{pro}}$ is the most attractive site to form hydrogen bonds, followed by Cys145, Glu166, and His163. We also identify 45 targeted covalent bonding inhibitors. Validation on the PDBbind v2016 core set benchmark shows the MathDL has achieved the top performance with Pearson's correlation coefficient ($R_p$) being 0.858. Most importantly, MathDL is validated on a carefully curated SARS-CoV-2 inhibitor dataset with the averaged $R_p$ as high as 0.751, which endows the reliability of the present binding affinity prediction. The present binding affinity ranking, interaction analysis, and fragment decomposition offer a foundation for future drug discovery efforts. Starting in late Dec, 2019, the COVID-19 pandemic caused by new severe acute respiratory syndrome coronavirus (SARS-CoV-2) has infected more than 5.7 million individuals and has caused more than 353,000 fatalities in all of the continents and over 213 countries and territories by May 27th, 2020. To date, there is no specific drug nor vaccine against COVID-19. Under the current global health emergency, researchers around the world have engaged in the investigation of the different drug targets of SARS-CoV-2, such as the main protease (M pro , also called 3CL pro ), papain-Like protease (PL pro ), RNA-dependent RNA polymerase (RdRp), 5 -to-3 helicase protein (Nsp13) to seek potential cures for this serious pandemic. The main protease, one of the best-characterized targets for coronaviruses, attracts lots of research attention because it is very conservative and distinguished from any human gene. A recent study shows that although the overall sequence identity between SARS-CoV and SARS-CoV-2 is just 80%, the M pro of SARS-CoV-2 shares 96.08% sequence identity to that of SARS-CoV [1] . Therefore, we hypothesize that a potent SARS M pro inhibitor is also a potent SARA-CoV-2 M pro inhibitor. At this moment, more than 300 potential SARS-CoV M pro inhibitors with its binding affinities are available in ChEMBL database [2] which can be considered as the potential SARS-CoV-2 M pro inhibitors. Recently, total 94 crystal structures of SARS-CoV-2 M pro with its ligand complexes are released on the Protein Data Bank (PDB) [3] . Among them, 92 crystal structures have no available binding affinities reported for various reasons. However, the central dogma of drug design and discovery concerns the molecular mechanism and binding affinity of drug target interactions. Knowing the binding affinities and their ranking of 92 SARS-CoV-2 M pro inhibitors is of great significance for the future design of anti SARS-CoV-2 drugs. In this work, for the first time, we predict the binding affinities of these 92 M pro -inhibitor complexes by reformulate mathematics-deep learning (MathDL) models, which have been the top competitor in D3R Grand Challenges, a worldwide competition series in computer aided drug design in the past three years [4] . We generate reliable poses for 87 M pro inhibitors with binding affinities but without complex structures. Together with 32 other complexes, we compose a set of 119 M pro -inhibitor complexes, which is paired with 17,382 protein-ligand complexes in PDBbind 2019 general set. These datasets are utilized to construct 11 MathDL models in single-task and multitask settings [4] . One of these 11 MathDL models has been validated by using the PDBbind v2016 core set benchmark, achieving the top performance over all exiting scoring functions. The other ten MathDL models have cross-validated on a set of 119 M pro -inhibitor complexes, showing an averaged Pearsons correlation coefficient of 0.75. In a nutshell, the present work provides reliable binding affinity predictions and ranking of 92 SARS-CoV-2 inhibitors that have crystal structures. It also offers data curation and validated models for exploring potential SARS-CoV-2 M pro inhibitors. Furthermore, this work explores different possible binding regions on the SARS-CoV-2 main protease and decode the most favorable molecular fragments for the inhibitor design. This section is devoted to the utilization of our MathDL models developed in Section 3.3 to predict the binding affinities and their ranking of SARS-CoV-2 inhibitors that do not have reported experimental affinities. To reduce the role of 3D pose prediction errors in our model, we use the SARS-CoV-2 inhibitors with X-ray structures available in the PDB for our study. We manually search these ligands on the PDB and arrive at Table 3 ). In this experiment, we develop a MathDL model optimized from PDBbind v2016 core set (see Section 3.3.1), five MathDL-ALL and five MathDL-MT models obtained from 5-fold study on the SARS-CoV BA set (see Section 3.3.2). The final predicted binding affinity is the consensus of these 11 models. The top ten inhibitors indicated by our models are shown in Table 1 . The top potent SARS-CoV-2 inhibitor found by our MathDL models is Michael acceptor inhibitor N3 in complex 7bqy. Designed by Yang and his colleagues [5] , N3 is found viral activities against different coronavirus M pro such as SARS-CoV and MERS-CoV [5, 6] . Specifically, the dissociation constant K i of N3 was found to be 9.0 µM against SARS-CoV [5] . Our MathDL reveals that N3 still inhibits SARS-CoV-2 main protease with an even better affinity at 0.24 µM. This finding is consistent with the literature work [7] showing that N3 is a potent inhibitor of COVID-19 virus M pro . It is worth pointing out N3 and some other molecules in the SARS-CoV PBD-noBA dataset are covalent inhibitors (see discussion in Section 2.2.2), however our models only predict the non-covalent binding affinity which is measured before the enzyme deactivation. Except for complex 6w63 deposited by Mesecar [8] , the rest of structures reported in Table 1 are from PanDDA [9] . The binding affinities of those PanDDA ligands varying from -7.95 kcal/mol to -8.66 kcal/mol indicate that these fragments can be utilized as starting points when designing more potent SARS-CoV-2 M pro inhibitors. On the other hand, some fragments may not be useful to inhibit the SARS-CoV-2 M pro functions such as U0S (PDBID: 5rgj), T5Y (PDID: 5rf4), and T0V (PDBID: 5re8) with predicted affinity being -3.49 kcal/mol, -4.35 kcal/mol, and -4.55 kcal/mol, respectively. Another factor that can affect their inhibitor abilities is the binding region where its residues and conformation do not bolster the binding mechanism. The predicted binding affinities of all 92 complexes in SARS-CoV PBD-noBA dataset from various MathDL models are presented in Table S8 in Supporting Information. In this table, we also supply the synthetic accessibility score (SAS), partition coefficient log P , and solubility log S for each small molecule. While the SAS and log P are obtained via RDkit [10] , the log S values are calculated by Alog PS 2.1 [11] . Based on the crystal structure information of 92 complexes in SARS-CoV PDB-noBA set, we have identified 13 distinct binding site regions of the SARS-CoV-2 main protease as illustrated in Figure 1 . Those binding pockets are denoted by P i , i = 1, 2, . . . , 13. Figure 2a reveals that binding pocket P 1 is the most common binding region of the SARS-CoV-2 main protease, which attracts around 73.9% of ligands in the SARS-CoV PDB-noBA data set of 92 complexes. This finding is no surprise since the binding pocket P 1 shares similar active sites to its predecessor, i.e. SARS-CoV M pro . Specifically, P 1 encompasses His141 and Cys145 catalytic dyad which are imperative to the substrate-binding mechanism [5] . In additions, the substratebinding residues Tyr161 and His163 [12] are covered in P 1 . Binding pockets P 2 , P 3 , P 5 , P 7 , P 8 , and P 10 are the least favor sites consisting of only one ligand. The rest of the binding pockets involve no more than 5 ligands. To study the correlation of the binding regions to the binding free energy, we present the box plot in Figure 2b to illustrate the energy values through their quartiles. The prevailing binding pocket P 1 is the best region on the SARS-CoV-2 M pro for inhibitor design with the median binding energy being -7.39 kcal/mol. N3 is the best inhibitor candidate for the binding site P 1 with predicted affinity found to be -9.03 kcal/mol. Other binding regions such as P 10 , and P 12 are less common but show their adequate effects on the binding mechanism with their best energy binding affinities calculated at -6.79 kcal/mol and -7.12 kcal/mol, respectively. These potential binding sites can guide drug combination to inhibit coronavirus M pro effectively. By looking further into the interactions between the top inhibitors and the main protease, we have found that N3 forms the most of hydrogen bonds with 9 interactions and 1 covalent bond to the nearby residues as listed in Table 2 and depicted in Figure 3a . All of the hydrogen bonds and covalent bonds found in other complexes in the top 4, namely 5ren, 5rfr, and 5rg1, are associated with N3's interactions, which justifies the most potent binding of N3 in the main protease complex and confirms the robustness of our MathDL models. We notice that two inhibitors T2V and T81 share one common hydrogen bond to Gly143 and one common covalent bond between their Carbons to γ-Sulfur of Gly143 in M pro (see Table 2 and Figures 3b, 3c). Compared to T2V, T81 has one more hydrogen bond but it is a weak one between Brom's T81 and donor from Ser46 at a distance of 3.10Å. Therefore, T2V and T81 have very similar predicted binding energies at -8.66 kcal/mol and -8.65 kcal/mol, respectively. This examination manifests how well our models preserve and capture the physical and chemical properties described in intermolecular bonding interactions. Furthermore, the ligand T9J that binds to M pro in complex 5rg1 with a slightly worse binding energy at -8.57 kcal/mol forms different hydrogen bonds in comparison to two previously mentioned inhibitors (see Table 2 ). Since our models only concern the non-covalent binding affinity, the lack of covalent bond in 5rg1's interactions does not downgrade its binding strength. With two relatively large hydrogen bonding distances (O2-His163: 3.05Å, O3-Glu166: 3.38Å (see Figure 3d )), the binding affinity of 5rg1 is still comparable to the top inhibitors indicating the important roles in acquiring the hydrogen bonds to these residues in the main protease's binding process. In the top 10 inhibitors as listed in Table 1 , there are two non-covalent inhibitors, namely 5rg1 and 6w63. The rest belongs to the class of targeted covalent inhibitors (TCI) in which the Michael acceptor inhibitor interacts with the protein residues, i.e., cysteine, to form a covalent complex successfully neutralizing target's function. However, the major disadvantage of TCIs is the association with the high toxicity risks [13] . TCIs' strong covalent bond can dramatically and irreversibly modify the unintended protein targets in the human body. As a result, the top covalent inhibitors in SARS-CoV PBD-noBA dataset may have little chance to become approved market drugs in comparison to their counterparts such as T9j in 5rg1 and X77 in 6w63. Due to the popularity of the binding site P 1 among 92 interested inhibitors, we mainly analyze the interaction network around the residues in that region. Out of 68 molecules binding to P 1 , there are 62 inhibitors forming at least one hydrogen bond to the nearby amino acid in the SARS-CoV-2 main protease. We have identified 15 different residues in the binding pocket P 1 composing hydrogen bonds to these small molecules. Figure 4 illustrates the frequency of these 15 residues across 62 inhibitors. Based on Figure 4 , Gly143 residue is the most attractive site to form the hydrogen bond. It appears in 71% of 62 intermolecular bonding interactions, followed by Cys145 residue with a frequency of 43.5%. In contrast, Glu166 residue occupies 19.4%, while His163 residue occurs in 17.7% . It is worth noting when these molecules form a hydrogen bond with Cys145, they also constitute another hydrogen bond with Gly143. In all cases, both these residues share the same acceptor. Besides the hydrogen bond network, 45 ligands in the SARS-CoV PDB-noBA dataset form a covalent bond to γ-Sulfur of Cys145. All the top 3 inhibitors are equipped with that covalent bond, whereas only 2 in the top 10 ligands do not have it (see Table S8 in Supporting Information). Furthermore, we are interested in the binding energy distribution associated with the interaction network. Figure 5 depicts the violin plot of that distribution across four categories, namely no H-bond (no hydrogen bond), H-bond (at least one hydrogen bond), no cov. bond (no covalent bond), and cov. bond (at least one covalent bond). Hydrogen bond interactions that are expected to play an important role in the binding mechanism were well captured in our MathDL models. Specifically, while the average energy of inhibitors having none hydrogen bond is -6.2 kcal/mol, the average energy of ones with hydrogen bond is as low as -7.36 kcal/mol. It is noted that our MathDLs only measure the non-covalent binding affinity. The covalent bond appearing at the final covalent complex is not properly accounted for in our framework. Therefore, it is expected that our models sometimes overestimate the covalent-bond inhibitors over the non-covalent-bond candidates. Figure 5 reveals molecules in the group of covalent bonds generally are predicted with lower binding energy with an average being -7.55 kcal/mol in comparison to -6.74 kcal/mol averagely measured on ones without covalent bonds. Figure 6 : Fragment frequencies based on BRICS decomposition of 68 inhibitors of binding site pocket P 1 . L i is the link atom of a certain type described in [14] . To design the lead molecules, it is of importance to have promising fragments from existing inhibitors against the drug targets. Therefore, in the present work, we study all the fragments decomposed from 68 inhibitors attached to the binding site P 1 . To carry out this task, we utilize BRICS algorithm [14] via RDkit [10] . In BRICS model, there are 16 chemical environments indicated by linkers denoted by L 1 , L 2 ,. . . ,L 16 . The BRICS decomposition gives raise to a total of 126 unique fragments, which are all presented in Table S9 in Supporting Information. Figure 6 illustrates top 12 common fragments in terms of their frequencies. Noting that the top fragment, L 1 -C(C)=O, often constitutes a hydrogen bond with Gly143 and in many cases forms a covalent bond with Cys145. Our deep learning-based scoring function, MathDL, was trained on public databases including PDBbind [15] and ChemBL [2] . The PDBbind sets contain all complexes with crystal structures deposited in the PDB with the binding affinities not limited to Kd, Ki, and IC 50 reported in the literature. In this work, we employ the PDBbind v2019, the latest version of its generation. The v2019 version of the PDBbind consists of 17,679 protein-ligand complexes. However, the data preprocessing of the MathDL [19] only retains 17,382 complexes. ChemBL is another manually curated database of bioactive molecules. Currently, ChemBL contains more than 2 million compounds in the SMILES string format. Excluding 30 main protease inhibitors in PDBbind data, we have found other 277 small molecules on ChemBL with reported Kd/IC 50 . Additionally, we have found 4 other SARS-CoV main protease inhibitors in [18] , and presently 3 SARS-CoV-2 main protease inhibitors from [16] . In total, there are 314 ligands bound to SARS-CoV/SARS-CoV-2 main protease having the experimental binding affinities; among them, there are 32 crystal structures. For compounds without the crystal structures, MathPose [4] will be utilized to generate their 3D conformations. The predicted 3D coordinates of these structures are presented in the SDF format and available in Supporting Information. Currently, there are roughly 92 ligands forming crystal complexes with SARS-CoV-2 main protease on PDB without the report of the experimental inhibitor activities. Most of them are deposited by the PanDDA analysis group. To serve model validation purposes, we classify the selected data into five different groups as listed in Table 3 . Specifically, PDBbind v2019 is the biggest set in this compilation with its PDBIDs and experimental binding affinities listed in Table S1 in Supporting Information. PDBbind v2016 core set is a subset of PDBbind v2019 and is formed by 290 complexes representing all protein classes in the refined set of PDBbind v2016 [15, 20] . The PDBIDs of all complexes in the PDBbing v2016 core set are provided in Table S2 . We also collect all M pro complexes of SARS-CoV/SARS-CoV-2 on the PDB, denoted by SARS-CoV PDB, which results in a total of 136 structures (see Table S3 ). Among them, there are 32 ligands with the report of experimental binding affinities denoted by SARS-CoV PDB-BA (see Table S4 ). Furthermore, we are interested in the set of SARS-CoV-2 M pro complexes in the aforementioned SARS-CoV PDB set but their affinities are not presented or undisclosed. We call this set SARS-CoV PDB-noBA with PDBIDs listed in Table S5 . To enrich our training data targeting SARS-CoV/SARS-CoV-2 main protease inhibitors, we gather some inhibitors reported on the literature [2, 18] . For those compounds with only 2D information, we limit ourselves to ones having the similarity score based on the path-based fingerprint FP2 no lower than 0.6 to at least one inhibitor in the SARS-CoV PDB set. As a result, we arrive at a set of 87 structures named SARS-CoV 2D (see Table S6 ). Combining SARS-CoV PDB-BA and SARS-CoV 2D data sets, we finalize a reliable database focusing on SARS-CoV/SARS-CoV-2 main protease inhibitors. Table S7 in Supporting Information presents the PDBIDs as well as the experimental binding energies of these ligands. The MathDL models developed in this work are reformulated from our early model bearing the same name. MathDL was designed for the prediction of various druggable properties of 3D molecules [4] . In the past three years, MathDL has been proved to be the top competitor in D3R Grand Challenges (https://drugdesigndata.org/about/grand-challenge), a worldwide competition in computer-aided drug design. I the present work, we have, for the first time, develop a multitask MathDL (MathDL-MT) to handle the M pro inhibitor dataset. We have also extended our earlier MathDL by including all different datasets (MathDL-All). Figure 7 depicts the framework of the MathDL in which the element-specific algebraic topological representations are integrated with the convolutional neural network (CNN) aiming to predict varied druggable properties such as toxicity, binding affinities, etc. Algebraic topology studies the topological spaces with the use of abstract algebra, which can dramatically simplify the geometric complexity. Persistent homology (PH) is one of the algebraic topology approaches which has the capacity to track the multiscale topological information over different scales along with filtration by characterizing independent components, rings, and higher dimensional voids in space [21] . In this section, we will briefly review the algebraic topology-based representations. Additionally, since we are dealing with the protein-ligand system, therefore, the biological considerations will take into account as well. Simplex. The q-simplex denoted as σ q is the convex hull of q + 1 affinely independent points in R n (n ≥ k). For example, the 0, 1, 2, and 3-simplex is considered as a vertex, an edge, a triangle, and a tetrahedron, respectively. We call the convex hull of each non-empty subset of q + 1 points the face of σ q , and each points are also called the vertices. Simplicial complex. A set of simplices is a simplicial complex denote K which satisfies that every face of a simplex σ q ∈ K is also in K and the non-empty intersection of any two simplices in K is the common face for both. A formal sum of q-simplices in simplicial complex K with Z 2 coefficients is a q-chain is a. A set of all q-chains of the simplicial complex K equipped with an algebraic field (typically Z 2 ) is called a chain group and denoted as C q (K). The boundary operator is defined by ∂ q : C q (K) → C q−1 (K) to relate the chain groups. More specifically, we denote σ q = [v 0 , v 1 , · · · , v q ] for the q-simplex spanned by its vertices, and then the boundary operator can be represented as: Here, The sequence of chain groups connected by boundary operators is called the chain complex and expressed as: The q-cycle group Z q (K) and the q-boundary group B q (K) are defined as Z q (K) = ker(∂ q ) = {c ∈ C q (K) | ∂ q c = ∅} and B q (K) = im(∂ q+1 ) = {∂ q+1 c | c ∈ C q+1 (K)}. The q-th homology group is the quotient group H q (K) = Z q (K)/B q (K). Moreover, the rank of q-th homology group can be computed as rankH q (K) = rankZ q (K) − rankB q (K), which is denoted as the q-th Betti number β q . To be notice that the q-th Betti number count the number of q-dimensional holes that can not be continuously deformed to each other. A filtration of a simplicial complex K is a nested sequence of subcomplexes of K such that ∅ = K 0 ⊆ K 1 ⊆ K 2 · · · ⊆ K m = K. Then the p-persistent qth homology group of K t is defined as: Here the rank of H p q (K t ) counts the number of q-dimensional holes in K t that are still alive in in K t+p , which is called the p-persistent qth Betti number. The persistent homology not only records the topological information at a specific configuration, but also tracks the changes along with the filtration parameters. More specifically, the topological changes will be preserved in the persistent barcodes. In MathDL, we make use of the persistent homology barcodes by dividing them into bins and calculating the birth, death, and persistence incidents in each bin to enrich our algebraic topological representations. The protein-ligand complex is structural and also biological. The persistent homology provides a theoretical approach to encode high-dimensional spatial data of proteinligand complexes into algebraic topological representations. In this section, we address the biological considerations for biomolecular complexity. There are many kinds of interactions that exist in the proteinligand complex, such as electrostatics, hydrogen bonds, and hydrophobic effects. Although persistent homology can capture the interactions among the nearest neighbors, the long-range interactions will be hindered. This difficulty can be avoided via the deployment of the element-specific attention [19] . There are 4 commonly atom types in protein, namely C, N, O, S, and there are 11 commonly atom types in ligand, including C, N, O, S , P, F, Cl, Br, I, H, B. We include Boron in the ligand atom type consideration since it appears in more than 200 small compounds in our training data. The general framework of MathDL is depicted in Figure 7 under exemplified steps. For the details of feature descriptions as well as the deep learning architecture, interested readers are referred to our previous work [19] . MathPose, a 3D pose predictor which converts SMILES strings into 3D poses with references of target molecules, was the top performer in D3R Grand Challenge 4 (GC4) in predicting the poses of 24 betasecretase 1 (BACE) binders [4] . For one SMILES string, around 1000 3D conformations can be generated by various docking software tools such as GOLD [22] , Autodock Vina [23] , and GLIDE [24] . Moreover, a selected set of known complexes is re-docked by three aforementioned docking software packages to generate at least 100 decoy complexes per input ligand used in the machine learning training set. The machine learning labels will be the calculated root mean squared deviations (RMSDs) between the decoy and native structures for the training data of the pose selection task. Furthermore, MathDL models will be set up and applied to select the top-ranked pose for the given ligand. Besides the GC4 challenge, our models have outperformed state-of-the-art scoring functions at the docking power challenge on CASF-2007 and CASF-2013 benchmarks [20] . Those established results attest to the credibility of our MathPose on the 3D structure prediction of small molecules. In this validation task, we will testify our model against 290 complexes in the PDBbing v2016 core set. This is a prevalent test set to assert the scoring ability of a binding affinity prediction model and has attracted lots of research groups to devote the effort to improve the Pearson's correlation coefficient (R p ) and Kendall's [15, 19, [25] [26] [27] . TopBPcon., the consensus model in our published work [19] , attains the highest Rp at 0.861. The current MathDL is followed with the second highest Rp at 0.858 and RMSE = 1.56 kcal/mol. The third place in the list is another TopBP model, TopBP-DL, solely based on the deep learning architectures and its reported Rp is 0.848 [19] . It is noted that all of the machine learning based scoring functions in this comparison were trained on the PDBbind v2016 refined set of 3767 complexes except for our MathDL. Explicitly, MathDL is trained on a much larger training set consisting of 17,211 complexes picked out from the PDBbind v2019 set and SARS-CoV BA set. tau (τ ) on this core set performance [15, 28, 29] . In the current work, we merge the PDBbind v2019, SARS-CoV PDB-BA, and SARS-CoV 2D sets but removing the duplicates and excluding the PDBbing v2016 core set complexes to attain a training set of 17211 complexes. MathDL with the architecture described in Section 3.2.1 is trained on those complexes. The resulting model is utilized to predict the binding affinity of 290 structures in the PDBbing v2016 core set. With the purpose of exploring the most optimal model for this benchmark, MathDL is trained for 1000 epochs. Then, we pick the epoch based on the root-mean-squared error (RMSE) of the PDBbing v2016 core set prediction. We have found that MathDL achieves the smallest RMSE in this experiment at 140 epochs. Specifically RMSE, R p , and τ metrics on the v2016 core set are 1.56 kcal/mol, 0.858, and 0.671, respectively. Meanwhile, the training accuracy is 0.387 kcal/mol in terms of RMSE and its Pearson's correlation coefficient is R p = 0.994. These performances reveal that our MathDL converges very fast and with only 140 epochs and maintains a good balance between training and testing accuracies. This is a state-of-the-art performance since our MathDL is ranked in the second place in comparison to 33 other scoring functions (see Figure 8 ). It is noted that the top model is TopBP con. published in our previous work [19] with R p = 0.861. TopBP con. is the consensus of gradient boosted tree and deep learning-based models. If only the deep learning framework is considered, the performance of TopBP (denoted by TopBP-DL) on the core set of PDBbind v2016 is R p = 0.848. It is worth mentioning that except for our MathDL, all machine learning-based scoring functions listed in Figure 8 were trained on the PDBbind v2016 refined set of 3767 complexes. As mentioned above, the current MathDL is compiled on a much larger training set comprised of 17211 complexes selected from PDBbind v2019 and SARS-CoV BA data. Even the present MathDL has not outperformed its predecessor, i.e., TopBP con. , MathDL is still a preference model since it is trained on a diverse data set covering various protein families and different binding energy ranges. As a result, it is expected to deliver more reliable predictions on the SARS-CoV-2 inhibitor, especially when this main protease family is not included in the training data of previous TopDL models. The resulting MathDL model is labeled as MathDL-Core2016 and is utilized to predict affinities of complexes in SARS-CoV PDB-noBA in Section 2.1. In this section, we testify the performance of our MathDL against 119 inhibitors in the SARS-CoV BA set aforementioned in Table 3 . Among those ligands, there are 32 X-ray crystal structures and the rest is in 2D SMILES string. We employ MathPose to predict 3D structures of those 2D ligands. To carry out the validation, we randomly split the SARS-CoV BA set into 5 non-overlapped folds. In each fold prediction task, MathDL trains on the partial data of SARS-CoV BA in conjunction with PDBbind v2019 set. This situation results in two different ways of training our MathDL model. The first approach is a traditional MathDL architecture with the training set combining both SARS-CoV BA and PDbbind v2019 complexes. The second model makes use of multi-task learning [30] . In each epoch, the weights of the MathDL architecture are learned through the information from PDBbind v2019 set, then only the fully connected layers are trainable when learning SARS-CoV BA structures. Finally, we come up with 10 different MathDL models in which the traditional MathDL frameworks are labeled as MathDL-All-i and multi-task MatDL is named MathDL-MT-i with i running from 1 to 5. In each model, after 100 epochs, we start monitoring which epoch that helps our model achieve the smallest RMSE on the test set. Table 4 reveals that MathDL-All models are well trained with the averaged accuracy RMSE=0.208 kcal/mol, Pearson's correlation coefficient R p =0.996, and Kendall's tau τ = 0.953. Their averaged performances on test data across 5-fold of the SARS-CoV BA set are found to be R p =0.751, τ = 0.553, and RMSE=0.867 kcal/mol. These results endorse the reliability of these models in the binding affinity prediction of SARS-CoV/SARS-CoV-2 inhibitors. Table 4 also lists the training and testing performances of five multi-task learning models. The averaged training performance of the MathDL-MT model is R p =0.997, τ =0.956 and RMSE=0.191 kcal/mol. The accuracy of the multi-task architecture on the test sets is similar to MathDL-All with R p =0.747, τ =0.55, and RMSE=0.862 kcal/mol. With these promising results, it is encouraging to carry out MathDL models to predict unknown binding affinities of SARS-CoV/SARS-CoV-2 inhibitors. SARS-CoV-2 main protease (M pro ) is the most favorable target for COVID-19 drug discovery due to its conservative nature and low similarity with human genes. Structure and binding affinity of protein-drug complexes are of paramount importance for understanding the molecular mechanism in drug discovery. However, there are only two SARS-CoV-2 M pro inhibitor structures available with binding affinities, highlighting current challenges in COVID-19 drug discovery. This work presents the reliable binding affinity prediction and ranking of 92 M pro -inhibitor crystal structures that have no reported experimental binding affinity. We first curate a set of 314 M pro inhibitors with binding affinities from public resources, such as PDBbind, ChemBL and the scattered literature. Among these inhibitors, 87 are retained based on their high similarity with available M pro -inhibitor complex structures and built with three dimensional (3D) poses using our MathPose [4] . Together with 32 another SARS-CoV or SARS-CoV-2 M pro -inhibitor complexes, we compose a training set of 119 reliable SARS-CoV-2 M proinhibitor complexes. Our earlier MathDL models are reformulated to accommodate 119 new complexes and 17,382 complexes from the PDBbind v2019 general set in both single-task and multitask settings, which have never been available before. The resulting MathDL models are rigorously validated via PDDbind v2016 core set benchmark in which it outperforms state-of-the-art models in the literature. Most importantly, our MathDL achieves promising cross-validation accuracies on the SARS-CoV family inhibitors with the averaged Pearson's correlation coefficient as high as 0.75. Additionally, the present work unveils that Gly143 of M pro is the most attractive region to form a hydrogen bond, followed by Cys145, Glu166, and His163. There are 45 inhibitors interacting with SARS-CoV-2 M pro to form covalent complexes. Those covalent bonds are mostly composed between dicarbon monoxide groups in inhibitors and γ-Sulfur on Cys145. There are only two non-covalent complexes in our top 10 ranked, namely 5rg1 and 6w63. To provide a potential resource for lead molecule design, we employ the BRICS algorithm to decompose all the inhibitors of the prominent binding site on M pro and obtain 126 unique fragments. The predicted binding affinities and their ranking of 92 M pro -inhibitor crystal structures, the bonding analysis, and the fragment decomposition have significantly extended current knowledge and understanding of SARS-CoV-2 M pro and inhibitor interactions and thus offered valuable information toward COVID-19 drug discovery. • SupportingTables.xls: Spreadsheets contain information for all supporting tables from S1 to S8. • FileS1.zip: 3D structures generated by our MathPose for 87 ligands in SARS-CoV 2D set. Evolution of the novel coronavirus from the ongoing wuhan outbreak and modeling of its spike protein for risk of human transmission Chembl web services: streamlining access to drug discovery data and utilities The Protein Data Bank MathDL: Mathematical deep learning for d3r grand challenge 4 Design of wide-spectrum inhibitors targeting coronavirus main proteases Structure of main protease from human coronavirus nl63: insights for wide spectrum anti-coronavirus drug design Structure of Mpro from COVID-19 virus and discovery of its inhibitors Drug development and medicinal chemistry efforts toward sars-coronavirus and covid-19 therapeutics PanDDA analysis group deposition of SARS-CoV-2 main protease fragment screen Open-source Cheminformatics Virtual computational chemistry laboratory-design and description Reversible unfolding of the severe acute respiratory syndrome coronavirus main protease in guanidinium chloride Managing the challenge of chemically reactive metabolites in drug development On the art of compiling and using 'drug-like' chemical fragment spaces Comparative assessment of scoring functions: The CASF-2016 update Crystal structure of SARS-CoV-2 main protease provides a basis for design of improved α-ketoamide inhibitors Discovery of baicalin and baicalein as novel, natural product inhibitors of SARS-CoV-2 3CL protease in vitro Development of broad-spectrum halomethyl ketone inhibitors against coronavirus main protease 3CL-pro Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening AGL-Score: Algebraic Graph Learning Score for Protein-Ligand Binding Scoring, Ranking, Docking, and Screening Development and validation of a genetic algorithm for flexible docking AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy Development of a protein-ligand extended connectivity (PLEC) fingerprint and its application for binding affinity predictions DG-GL: Differential geometry-based geometric learning of molecular datasets Onionnet: a multiple-layer intermolecular-contactbased convolutional neural network for protein-ligand binding affinity prediction A review of mathematical representations of biomolecular data Protein-Ligand absolute binding affinity prediction via 3D-convolutional neural networks Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Networks This work was supported in part by NIH grant GM126189, NSF Grants DMS-1721024, DMS-1761320, and IIS1900473, Michigan Economic Development Corporation, Bristol-Myers Squibb, and Pfizer. The authors thank The IBM TJ Watson Research Center, The COVID-19 High Performance Computing Consortium, and NVIDIA for computational assistance.