key: cord-0435209-mnc8nt90 authors: Chenthamarakshan, Vijil; Hoffman, Samuel C.; Owen, C. David; Lukacik, Petra; Strain-Damerell, Claire; Fearon, Daren; Malla, Tika R.; Tumber, Anthony; Schofield, Christopher J.; Duyvesteyn, Helen M.E.; Dejnirattisai, Wanwisa; Carrique, Loic; Walter, Thomas S.; Screaton, Gavin R.; Matviiuk, Tetiana; Mojsilovic, Aleksandra; Crain, Jason; Walsh, Martin A.; Stuart, David I.; Das, Payel title: Accelerating Inhibitor Discovery for Multiple SARS-CoV-2 Targets with a Single, Sequence-Guided Deep Generative Framework date: 2022-04-19 journal: nan DOI: nan sha: 16216cd73e1ee6843c45fcb2be980905eb85753f doc_id: 435209 cord_uid: mnc8nt90 The COVID-19 pandemic has highlighted the urgency for developing more efficient molecular discovery pathways. As exhaustive exploration of the vast chemical space is infeasible, discovering novel inhibitor molecules for emerging drug-target proteins is challenging, particularly for targets with unknown structure or ligands. We demonstrate the broad utility of a single deep generative framework toward discovering novel drug-like inhibitor molecules against two distinct SARS-CoV-2 targets -- the main protease (Mpro) and the receptor binding domain (RBD) of the spike protein. To perform target-aware design, the framework employs a target sequence-conditioned sampling of novel molecules from a generative model. Micromolar-level in vitro inhibition was observed for two candidates (out of four synthesized) for each target. The most potent spike RBD inhibitor also emerged as a rare non-covalent antiviral with broad-spectrum activity against several SARS-CoV-2 variants in live virus neutralization assays. These results show that a broadly deployable machine intelligence framework can accelerate hit discovery across different emerging drug-targets. De novo molecular design, the proposing of novel compounds with desired properties, is a challenging problem with applications in drug discovery and materials engineering. For instance, a key objective in the drug discovery workflow is to identify candidate molecules, known as hits, that can interact with and inhibit a known drug-target protein with measurable activity. Searching for hit compounds that serve as the chemical starting points for further design of drug candidates typically involves high-throughput screening of libraries containing standard chemical compounds or smaller chemical fragments. Success rates for this method of hit discovery are between 0.5 and 1 percent 1 , depending upon the size of the library screened (typically on the order of 10 4 entries) and target characteristics. This low success rate is in part due to the immense search space, now estimated to span between 10 33 -10 80 feasible molecules 2 , from which only a minute fraction typically possesses the traits sought. Exhaustive enumeration of this vast chemical space is infeasible. Consequently, the cost of developing a single new drug is high, reaching up to $2.8 billion, while the duration from concept to market typically exceeds a decade 3 . Overview of our inhibitor discovery workflow driven by CogMol, a sequence-guided deep generative framework. (a-b) illustrate molecular VAE training on large-scale chemical SMILES (x) data and mapping of the existing protein-ligand affinity relations on the VAE latent space (z) by training a binding predictor, respectively. For the latter, we leverage pre-trained neural network (NN) embeddings of a large volume of protein sequences. (c) shows Controllable Latent Space Sampling or CLaSS, which samples from the model of VAE latent vectors by using the guidance from a set of molecular property predictors (e.g., protein binding), such that for a given target protein sequence, sampled z vectors corresponding to strong target binding affinity are accepted while vectors corresponding to weak target binding affinity are rejected. The accepted z vectors are then decoded into molecular SMILES. (d) Candidates are then ranked and filtered according to chemical properties, docking score to target structure, and predicted retrosynthetic feasibility and toxicity. (e) A small set of prioritized molecules are synthesised, followed by wet lab testing in specific in vitro assays to confirm target inhibition. (f) In the present case, for each target, of the four molecules tested, two showed promising levels of inhibition. We also report approximate sample sizes and timeline for each stage of our discovery workflow. Note the timeline does not include the training and testing of the generative and predictive machine learning models. In addition to the need for thousands of screening experiments, the initial selection of the library frequently requires detailed structural information on the target protein of interest, which is often not readily available. Further, discovery is often performed using hand-crafted rules and heuristics to link existing fragments and/or to avoid impractical synthetic pathways. Therefore, a more efficient approach is urgently needed, to enable distillation of novel and promising molecules from the vast chemical space. This approach will enable experimental validation of a small selection of candidates, resulting in a higher hit discovery rate at reduced time and cost. Deep learning-based generative models have the potential to enable discovery of novel molecules with desired functionality in a "rule-free" manner, as they aim to first learn a dense, continuous representation (hereafter referred to as a latent vector) of known chemicals and then modify the latent vectors to decode into new molecules. Such models thus offer access to previously unexplored chemical space unrestricted by conscious human bias. However, for the task of target-specific drug-like inhibitor design, an "inverse molecular design" 4 approach must be utilized, where the navigation through the learned chemical representation is guided by molecular property attributes, such as target inhibition activity and drug-likeliness. In the scenario of designing inhibitors for a new target, a sufficient amount of exemplar molecules is required, which is likely unavailable and requires costly and time-consuming screening experiments to obtain. As the majority of existing deep generative frameworks (see Sousa, et al. 5 for a review of generative deep learning for targeted molecule design) still rely on learning from targetspecific libraries of binder compounds, they limit exploration beyond a fixed library of known and monolithic molecules, while preventing generalization of the machine learning framework toward more novel targets. As a result, while some studies 6-8 that use deep generative models for target-specific inhibitor design have been experimentally validated, rarely have such models demonstrated sufficient versatility to be broadly deployable across dissimilar protein targets, without having access to detailed target-specific prior knowledge (e.g., target structure or binder library). Our work demonstrates the real-world applicability of a single, deep-generative inhibitor design framework across different target proteins, while only requiring more readily available target sequence information to guide the design. Here, we employ a recently published generative deep learning model, CogMol 9 , to propose novel and chemically viable inhibitor designs for two important and distinct SARS-CoV-2 targets -the main protease (M pro ) and the receptor binding domain (RBD) of the spike (S) protein. In this study, the deep generative framework, built upon large-scale data of chemical molecules, protein sequences, and protein-ligand binding, serves as a foundation for inhibitor molecule design, as it can be extrapolated to new target sequences not present in the original training data. A set of novel molecules targeting SARS-CoV-2 proteins, which was designed by CogMol, was shared under the Creative Commons license in April 2020 in the IBM COVID-19 Molecule Explorer platform 10 . Here, we provide experimental validation of the broad utility and readiness of the CogMol deep generative framework, by synthesizing and testing the inhibitory activity of a number of prioritized designs against SARS-CoV-2 M pro and RBD of the spike protein. We further demonstrate the applicability of the binding affinity predictor model used in the CogMol framework by subjecting it to virtual screening of a library of lead-like chemicals and successfully identifying three compounds that were ultimately confirmed to be bound at the active site of the M pro by crystallographic analysis, one of which showed micromolar inhibition. To our knowledge, the present study provides the first validated demonstration of a single generative machine intelligence framework that can propose novel and promising inhibitors for different drug-target proteins with a high success rate, while only using protein sequence information during design. The demonstrated broad-spectrum antiviral activity of the designed spike inhibitor against the SARS-CoV-2 variants of concern further establishes the potential of such a deep generative framework to accelerate and automate the hit discovery cycle, a process known to suffer from low yield and high attrition rates. The overall inhibitor discovery pipeline is described in Figure 1 and consists of three main steps: (a-c) candidate design in a target-conditioned manner using the deep generative model framework, (d) in silico screening for candidate prioritization, and (e) wet lab validation of prioritized molecules. For de novo molecule design, we used the deep generative framework CogMol as a foundation, which enables the design of inhibitor molecules for different targets, without requiring training or fine-tuning the model on target-specific data. Hereafter, we refer machine-designed novel compounds as de novo compounds throughout rest of the paper. CogMol works as follows: first, it uses a variational autoencoder (VAE) 11 , a popular class of generative models, as the base generative model (Figure 1a) . A VAE is comprised of an encoder-decoder pair. The encoder neural network maps the simplified molecular-input line-entry system (SMILES) 12 string of a molecule into a low-dimensional representation. We will denote the encoder as q φ (z|x), where z is a latent encoding of input SMILES x and φ represents the encoder parameters. The decoder p θ (x|z)), which is also a neural network, then converts the latent vector z back into the reconstructed SMILES x. The encoder in a VAE is probabilistic in nature as it outputs latent encodings that are consistent with a Gaussian distribution. The decoder is therefore stochastic -it samples from the latent distribution to produce an output x. The encoder-decoder pair is trained end-to-end to optimize two objectives simultaneously. The first objective includes minimizing a loss term to ensure accurate reconstruction of an input SMILES from the corresponding latent embedding. The second objective consists of a regularization term to constrain the latent encodings to a standard normal distribution. The resulting latent space is continuous, enabling smooth interpolation as well as random sampling of new molecules from the latent space. To learn meaningful latent molecular representations that have general knowledge about diverse chemicals, in CogMol the VAE is trained on more than one and half million small molecules from public databases (see the Material and Methods section for details). Once the chemical latent representation is learned, CogMol performs attribute-conditioned sampling on that representation to generate entirely new molecular entities with properties biased toward the design specifications. Specifically, the goal is to design novel drug-like molecules with a high binding affinity to the target protein of interest. Two z-based property predictors are used: a drug-likeness (QED) predictor and a target-molecule binding (strong/weak) predictor. Both predictors used the z encodings of molecules as input. For the binding predictor, the protein sequence embeddings from a pre-existing deep neural net 13 was concatenated with the molecular latent encodings and trained on the general protein-molecule binding affinity data available in the BindingDB database ( Figure 1b) . Performance of the attribute predictors is reported in the Material and Methods section. Given a target protein sequence of interest, those two predictors are used together to sample molecules with desired properties from the latent space, by using the CLaSS sampling method proposed by Das, et al. 14 . CLaSS relies on a rejection sampling schema to accept/reject molecules, while sampling from a density model of the z embeddings. Acceptance/rejection criteria are determined by the output probabilities of property predictors. See the Materials and Methods and the Supplementary Details sections for further details on CLaSS. Note, the CogMol generative framework relies on a chemical VAE, a protein sequence encoder, and a set of molecular property predictors, all of which are pre-trained on large amount of broad data -i.e., chemical SMILES, protein sequences, and available protein-ligand binding affinities. The generative framework thus has important information already encoded about protein sequence homologies, chemical similarities, and protein-drug binding relations. This allows the framework to serve as a foundation, as it is instantly adaptable to different targets, without any further model retraining or fine-tuning on target-specific data. The approach further saves time and cost associated with generating target-specific binder libraries or resolving the target structure, which are typically considered as privileged information, i.e., not broadly available. The model can also extrapolate to a target that does not share high similarity with the training data. This is indeed the case for the SARS-CoV-2 targets considered (see SI Table 2 ) where the lowest Expect value, a measure of sequence homology (lower values indicate high homology), with respect to the BindingDB protein sequences is 0.51 (query coverage = 40%) and 1.9 (query coverage = 26%) for M pro and spike RBD, respectively. This analysis implies that both targets are not significantly similar to the protein sequences in the BindingDB database that was used for training the binding predictor, spike RBD being more distinct than M pro . The next stage includes in silico screening of generated candidates ( Figure 1d ) to prioritize them for synthesis and wet lab evaluation. For practical considerations, we sought to keep the number of prioritized machine-designed de novo compounds to be synthesized and tested very small -around 10 for each target, as opposed to screening thousands of chemicals in a more traditional set-up. We used a combination of physicochemical properties (estimated using cheminformatics), target-molecule binding free energy predicted by docking simulations, and retrosynthesis and toxicity predictions by using machine learning. For retrosynthesis prediction, we used the IBM RXN platform 15 that is based on a transformer neural network trained on chemical reaction data. For toxicity prediction, an in-house neural network-based model trained on publicly available in vitro and clinical toxicity data was used. See Material and Methods for details of candidate filtering and prioritization criteria. At the end of the in silico screening, the number of candidates per target is around 100, which was further narrowed down to around 10 per target by using the discretion of Enamine Ltd., the chemical manufacturer. Feasibility of the predicted reaction schema, as evaluated by organic synthetic chemist experts, as well as commercial availability and cost of the predicted reactants, was used to finalize the candidate synthesis list. The final four candidates for each target were chosen based on the synthesis cost and delivery time, as provided by Enamine. Figure 2 lists the eight de novo compounds designed by the generative machine learning framework that were synthesized (See SI Tables 3-4 for the predicted molecular properties). Details of the experimental synthesis protocols is provided in Methods and SI Section C.2. We also provide a comparison between the predicted and the actual retrosynthetic pathways for those eight machine-designed compounds in SI Table 5 . Five were synthesized using the top predicted pathway of IBM RXN. For two compounds, GEN626 and GEN777, predictions were found to be unsuccessful, so alternative pathways as designed by Enamine were employed (see Methods for details). For GXA104, reactants included in the RXN prediction were not available, so an alternative route was employed. Overall, these results show the usefulness of machine learning-based retrosynthesis predictions for reliably identifying plausible candidates and recommending viable synthesis routes. Enzymatic inhibition by the de novo M pro -specific molecules was measured by solid phase extraction purification linked to mass spectrometry (RapidFire MS) 16 . The results are presented in Figure 3a . Out of the four de novo compounds tested for this target, GXA70 and GXA112 both showed M pro inhibition in the micromolar range, with IC 50 values of 43 µM and 34.2 µM, respectively. This implies a 50% success rate of hit discovery for M pro . We further tested the generalizability of the pIC 50 predictor (trained directly on the molecular SMILES and protein sequences) by validating predictions on selected commercially available lead-like compounds from the Enamine Advanced Collection 17 . For this purpose, we selected the top three Enamine compounds based on their predicted pIC 50 . One of these Enamine compounds showed inhibition (IC 50 = 35.5 µM). Based on these results, we co-crystallised M pro in the presence of this compound (ID Z68337194) and successfully obtained crystals (see SI Table 6 ). The crystal structure determined revealed Z68337194 bound in the active site pocket. Structures of the other two commercially available compounds selected based on the pIC 50 predictions were also found bound to the active site of M pro , although these compounds showed no detectable inhibition of M pro using the RapidFire mass spectrometry-based assay. Detailed analysis of the structure obtained for the complex of M pro with Z68337194 (see Figure 3b -d) reveals that the sulphonamide group sits in the P4 subsite 18 Glu166. This interaction mimics that made by the P4 site amide of nirmatrelvir (PF-07321332) 19 . Z68337194 occupancy refines to approximately 50%. In the active site, shifts are observed in the positions of Pro168, Leu167, Glu166, and Met165 to accommodate ligand binding. The compound does not sit deeply in the active site and does not interact with the catalytic machinery, providing opportunities to elaborate upon the compound in order to take advantage of further subsites. In the captured crystal form, the active site sits at the interface between symmetry related protein monomers and as a result a symmetry related molecule provides additional interactions -primarily a stacking interaction between the ligand phenylamine ring and Pro252. Additionally, a hydrophobic pocket in the symmetry mate formed primarily by Gln256 and Val297 accommodates the chlorinated ring. For the CogMol-designed compounds targeting the spike RBD, we measured their neutralization ability using a spike-containing pseudotyped lentivirus and a live viral isolate. These results are summarized in Figure 4 . Out of the four candidates, GEN725 and GEN727 showed IC 50 values less than 50 µM (18.7 µM and 2.8 µM, respectively), indicating discovery of novel hits with reasonable inhibition of the pseudovirus at a 50% success rate ( Figure 4a ). Importantly, GEN727 exhibited live virus neutralization ability as well ( Figure 4b ). We further checked if GEN727 is effective across different SARS-CoV-2 variants. We compared the neutralization of viral variants of concern (VOCs) -Alpha, Beta, Delta and Omicron -with neutralization of Victoria (SARS-CoV-2/human/AUS/VIC01/2020), a Wuhan-related strain isolated early in the pandemic from Australia, in both pseudovirus and live virus. Figure 4c shows that GEN727 neutralizes spike-containing pseudovirus across all VOCs with an IC 50 value between 0.7-2.8 µM. Live virus data also shows inhibition with an IC 50 The most effective compound, GEN727, was selected for a pseudoviral neutralization assay against Victoria, Alpha, Beta, Gamma, Delta and Omicron variants of concern (VOCs), as well as (d) the live-virus neutralization assay. Error bars show the standard error of each measurement over two trials. performed thermofluor measurements to determine if GEN727 affected the stability of the spike. The presence of the compound appeared to reduce the speed of the transition of the spike to a less stable form; after overnight incubation at pH 7.5, very little of the spike population remained in the more stable form with the higher T m of 65°C (see SI Figure 9 ). In order to characterize the novelty of the de novo bioactive hits, we identified the nearest compound from the PubChem database, in terms of their Tanimoto similarity 25 estimated using Morgan fingerprints 26 . Figure 5 reveals that none of the de novo molecules shares ≥ 0.7 Tanimoto similarity with PubChem molecules. We further computed the Tanimoto similarity of the de novo compounds to known SARS-CoV-2 M pro inhibitors in literature. These results are shown in Table 1 . In this category, specifically, we considered the following: an aminipyridine hit identified in the COVID-19 Moonshot initiative 20 , X77 identified using ultralarge docking 21 , the oral inhibitor S-217622 from reference 22 Nirmatrelvir in PAXLOVID 19 , an α-ketoamide inhibitor (Compound 21 from Zhang, et al. 23 ), and Molnupiravir 24 . Consistently, the CogMol-designed inhibitors show high dissimilarity (as indicated by a low Tanimoto similarity around 0.1) to existing SARS-CoV-2 M pro inhibitors. As experimental determinations of the structure of either M pro or the spike protein in complex with the validated de novo inhibitors were not fruitful, we used docking simulations to provide insight into the plausible binding modes. Docking simulations on the generated molecules were performed in the presence of their respective target structure -PDB ID: 6LU7 18 for M pro and PDB ID: 7Z3Z for spike RBD (See Methods for details). As shown in Figure 6 , both machine-designed M pro inhibitors, GXA112 and GXA70, revealed mainly hydrophobic contacts to the residues from the P1 and P2 subsites which are the hotspots of interactions 18 . The hydrogen bonding pattern revealed by the two molecules is, however, starkly different: GXA112 forms hydrogen bonding mainly with P1' site (T25), whereas GXA70 interacts with the P2 residues (D187 and Y54). The non-extensive and diverse interaction pattern of the de novo and commercially sourced M pro inhibitors reported in this study is consistent with reported observations for non-covalent inhibitors 27 . For the validated de novo spike inhibitor, docking simulation (see Figure 7 ) reveals that GEN727 contacts with several hydrophobic residues, such as Tyr365, Tyr369, and Phe374, from RBD. Those residues that constitute the lipid binding pocket of the spike RBD are conserved across seven coronaviruses that infect humans 28 . Also, the docking strikingly recapitulates the binding of the natural lipid (see Figure 7d ), suggesting that the lipid binding function maintains the conserved site targeted by GEN727. Therefore, binding of GEN727 might stabilize the closed form of the spike, reducing receptor interactions. In line with that, the thermofluor results (SI Figure 9 ) showed an (albeit weak) indication that incubation of spike with GEN727 somewhat destabilized the spike, suggestive of a direct interaction underlying its broad-spectrum neutralization ability. We also perform docking on the other de novo-designed spike inhibitor that showed pseudoviral neutralization, GEN725, which can be found in SI Figure 10 , revealing an interaction pattern similar to that of GEN727. The discovery of therapeutic candidates for diseases, including COVID-19, has been greatly advanced by the combined power of numerous in silico approaches. Nevertheless, even the most effective methods face broad challenges that are at the same time inherent to general inverse molecular design tasks and specific to biological target-ligand binding chemistry. The first of these pertains tothe vastness of the chemical space being explored and its impact on the throughput and practical utility of the prevailing methods. For example, the use of docking or molecular simulation methods to screen on the order of 10 8 to 10 9 commercially available compounds, would incur a prohibitively high computational cost, estimated to reach 10 CPU years 21 per target (as opposed to screening of less than a thousand machine-designed de novo candidates via docking in the present study). The second challenge is availability of critical information: while methods such as pharmacophore modeling and molecular docking have been used successfully in virtual screening or design of molecules 21, 23, [29] [30] [31] , such approaches generally rely upon initial design constructs obtained from available crystal structure(s) of a target protein bound to a candidate compound or fragment hits. Such information is not guaranteed to be available for all drug targets of interest and may take months to derive experimentally, and consequently these approaches are not broadly applicable to the case where such structures are unknown. Recently, the field of structural biology has been revolutionized by deep-learning based methods, i.e., Alphafold 32 and RoseTTafold 33 , for predicting three-dimensional structure of a protein from its sequence. Whilst predicting structures with often astonishing accuracy, the structural models derived from neural networks are still relatively limited in aiding the understanding of natural protein function, in particular understanding the interactions with protein partners or small ligands. Therefore, the deduction of functional ligand and drug interaction still remains predominantly reliant on resource-intensive In general, reliance on privileged information (the target protein structure and/or known hits), confines the discovery space to the neighborhood of known chemical entities. This dependency therefore presents a practical challenge to expand the accessible chemical exploration space and to devise more readily generalizable approaches to inhibitor design for multiple targets, the structure and binders of which may not be known. This work establishes the basis for an alternative discovery paradigm, wherein a generative model is used to discover novel inhibitor hits for different protein targets efficiently. To our knowledge, this is the first validated demonstration of a single generative model enabling successful and efficient discovery of inhibitor molecules for two different target proteins, based only upon the protein sequence and without the prior knowledge of target-specific ligand binding data or target structure. Previous generative machine learning models that have been subject to experimental validation of de novo-designed molecules were primarily either trained or fine-tuned on a target-specific ligand library 6, 7, [34] [35] [36] [37] [38] . The sequence information of new drug targets typically emerges at a much faster (days vs. months) pace than their detailed structural information, thanks to the latest advances in sequencing. The structural deduction of target-ligand interaction takes even longer. In contrast, as shown in Figure 1 , it took us less than a week to design and prioritize the set of candidate molecules to be synthesized and tested in wet lab for the two SARS-CoV-2 targets, as our approach does not reply on target structure or binder information. The information on SARS-CoV-2 sequences was made publicly available starting around January of 2020 and CogMol-designed candidates were open-sourced in the IBM COVID-19 Molecule Explorer platform in April 2020. While the prioritized de novo compounds were ordered in August 2020, and the first round of wet lab validation was completed in October 2020. This rapid pace of novel drug-like inhibitor discovery across two distinct drug targets, when the world was experiencing a pandemic, shows the potential of a sequence-guided generative machine learning-based framework to help with better pandemic preparedness and other global urgency. The overall success rate of hit discovery found is 50% for both targets, which required synthesizing and screening only four compounds per target, as opposed to <10% obtained typically using high-throughput screening 1 . Additionally, the validated hits reported in this study appear to be novel, based on molecular similarity analyses with existing chemicals and SARS-CoV-2 inhibitors, indicating impressive creative ability of the generative framework, which is not possible when screening known compounds. The efficiency of hit discovery realized here and the demonstrated generalizability to different targets advocate for pre-training on a large volume of general data, e.g., molecular SMILES, protein sequences, and protein-ligand binding affinities. Conceptually this is a key feature of so-called foundation AI models 39 , which are trained on broad data at scale and can be easily adapted to newer tasks. This perspective is also consistent with recent work, establishing the informative nature of a deep language model trained on large number of protein sequences, in terms of capturing fundamental properties 14, 40 . The broad-spectrum efficacy across SARS-CoV-2 VOCs of the most potent spike hit observed is a further example: the VOC sequences were never made available to the generative framework during training or inference. Moreover, to our knowledge, this is the first report of a novel spike-based non-covalent inhibitor that exhibits broad-spectrum antiviral activity. This contrasts with therapeutic monoclonal antibodies, the only drugs currently in use that target the spike protein, where rather few are effective across VOCs 41 . Taken together, the results presented here establish the efficiency, generality, scalability, and readiness of a generative machine intelligence framework for rapid inhibitor discovery against existing and emerging targets. Such a framework, particularly when combined with autonomous synthesis planning and robotic synthesis and testing 8 , can further enhance preparedness for novel pandemics by enabling more efficient therapeutic design. The generality and efficiency of the mechanisms employed in CogMol for precisely controlling the attributes of generated molecules, by plugging in property predictors post-hoc to a learned chemical representation, makes it suitable for broader applications in advancing molecular and material discoveries. For example, the framework has already enabled novel photoacid generator molecule design in a data-efficient manner for performant and sustainable semiconductor manufacturing, which has been validated by subject matter experts 42 . There remains significant scope for improving the discovery power of the framework: incorporation of the 3D structural information (when available) 43, 44 and further constraining the generations (e.g. solubility, number of hydrogen bonding donor/acceptor sites, structural diversity) are potential directions for further work. Iterative optimization methods 45 can be adopted to improve initial hits by querying a set of molecular property evaluators along with a retrosynthesis predictor. Active learning paradigms can be also explored for improving process efficiency. CogMol overview SMILES VAE as a molecule generator: CogMol leverages a variational autoencoder 11, 46 paradigm as the base generative model for molecules. The encoder in the VAE encodes molecules to a latent vector representation. The decoder maps latent vectors back to molecules. New molecules are generated by sampling from the latent space. Here, molecular SMILES is used as the input and output to the encoder and the decoder, respectively. A bidirectional Gated Recurrent Unit (GRU) with a linear output layer was used as an encoder. The decoder contained a 3 layer GRU with a hidden dimension of 512 units and dropout layers with a dropout probability of 0.2. The parameters for the encoder-decoder pair is learned by optimizing a variational lower bound on the log-likelihood of the training data. The loss objective is comprised of a reconstruction loss and a Kullback-Leibler (KL) divergence (a measure of divergence between the fixed prior distribution p(z), standard normal in this case, and the learned distribution q φ (z|x)) term: This implies that new samples can be generated from random points in the latent space, while points close in the latent space will be decoded into chemically similar molecules. The VAE was first trained for 40 epochs on 1.6M chemical molecules from the MOSES benchmarking dataset 47 , which was chosen from the larger ZINC Clean Leads 48 collection. Then, along with the KL and reconstruction loss, the VAE was also jointly trained for another 15 epochs to predict the molecular attributes QED and synthetic accessibility (SA) from the latent vectors z. Two separate linear regression models were trained, such that the VAE latent space becomes organized based on 11/32 those physical properties and thus serves as an approximation of the joint probability distribution of molecular structure and the chemical properties 49 .The training was further continued for 50 epochs on around 211k ligand molecules from the BindingDB database 50 . This paradigm therefore served as a molecule generator that is unbiased toward any particular target. The final VAE generates SMILES strings by sampling from q φ (z|x) that are 99% unique and exhibit greater than 90% chemical validity, while root-mean-square errors (RMSE) on the QED and SA prediction are 0.0262 and 0.0175, respectively. Molecular attribute predictors for conditional generation: Two predictors trained on the latent z vectors were used for target-specific inhibitor molecule design, which are also drug-like. The QED regressor was comprised of 4 hidden layers with 50 units each and ReLU nonlinearity. Further, a target-chemical binder (strong/weak) predictor was trained on the latent z vectors of chemicals and the pretrained protein sequence embeddings 13 , which used the data released as part of the DeepAffinity 51 . A pIC 50 value of > 6 was used as a threshold to decide if a compound was a strong binder. The protein embeddings and the molecular embeddings were concatenated and passed through a single hidden layer with 2048 units and ReLU nonlinearity. The z-based QED and pIC 50 predictors yield an RMSE of 0.0281 and 1.282, respectively. These set of predictors were used for controlled sampling from the VAE model to design molecules with desired attributes. CLaSS sampling used for conditional generation in CogMol: We briefly describe Conditional Latent (attribute) Space Sampling (CLaSS) 14 here. CLaSS uses (i) a density model of the VAE latent representation, and (ii) a set of molecular attribute predictors trained on the VAE latent vectors, to generate molecules in an attribute-controlled manner. For this purpose, a rejection sampling approach utilizing Bayes' theorem is used. To elaborate further, first an explicit density model is learned on the latent embeddings of the training data to ensure sampling is uniformly random. A Gaussian mixture model with 100 components and diagonal covariance matrices was used for this purpose. Assuming the attributes are all independent of each other and can be conditioned on the latent embeddings (i.e., the latent space encompasses all combinations of attributes), Bayes' rule was then used to define the conditional probability of a sample, given certain properties in terms of the predictor models above. Finally, we employ this definition in a rejection sampling scheme, such that samples drawn from the density model are accepted according to the product of the attribute predictor scores. For more details on the algorithm, see SI Section C.1. Generating the 875k samples for each target took around two days using an NVIDIA Tesla K80 GPU. The filtering criteria included molecular weight (MW) less than 500 Da, QED greater than 0.5, SA less than 5, and octanol-water partition coefficient (logP) less than 3.5. MW, SA, logP, and QED were calculated using the RDKit toolkit 52 . A pIC 50 predictor trained on DeepAffinity 51 data was also used for ranking the designed molecules based on predicted affinity (AFF). A SMILES-based binding affinity (pIC 50 ) predictor was used for this purpose. SMILES sequences were first embedded using long short-term memory units (LSTMs). Those SMILES embeddings were then concatenated with pre-trained protein embeddings 13 , resulting in RMSE of 0.8426 on the test data. A threshold for predicted pIC 50 affinity with the respective target sequence was set -greater than 8 for molecules targeting M pro and greater than 7 for molecules targeting the spike RBD. This affinity predictor was also used to estimate target selectivity (SEL) 9 , defined as the excess affinity to the target compared to a random set of proteins, lack of which is a known cause for drug candidate failure. The molecules were also evaluated for predicted toxicity 53 across a total of 12 in vitro 54 and one clinical end-points 55 . Morgan fingerprints were used as the input features for the toxicity prediction model. A multitask deep neural network containing a total of four hidden layers was used 53 : two layers were shared across all toxicity endpoints and two were specific to each of the endpoints. A ReLU activation were used for all layers except for the last, for which a sigmoid activation was used. Molecules that were predicted to have no toxicity to any of the toxicity endpoints were progressed in the workflow. We then ran docking simulations on a prioritized set of designed molecules, less than 1000, with their respective target structures, as the docking energy can provide an indication of actual inhibition. For M pro , we used a monomer from the first structure determined and deposited with the Protein Data Bank for SARS-CoV-2 M pro complexed with the covalent inhibitor N3 (PDB ID: 6LU7 18 ) and set the search space to fully encompass the receptor. For spike, we used a lipid-bound conformation (PDB ID: 7Z3Z) and kept the protomer frozen during docking, as the goal is to find molecules that dock to the lipid-bound spike RBD. Our intent was to exploit the lipid binding pocket for developing inhibitors that can trap the spike protein in the closed conformation as this is known to have reduced interaction with the host ACE2 receptor 28, 56 . Docking was performed using AutoDock Vina 57 run blindly over the entire protein structure with an exhaustiveness of 8, and repeated 5 times to find the optimal conformation. Compounds with a binding free energy given by docking of less than −8.4 kcal/mol with M pro were selected. For the generated spike compounds, we prioritized those that exhibited a binding free energy less than −7.5 kcal/mol. Further, we only considered the compounds were docked less than 3.9 Å from the lipid binding pocket in the final docked configurations. The surface and ribbon representations of ligands docked (or bound) to the target structure were produced with PyMol 58 and the protein-ligand interaction plots were produced with LigPlot+ 59 . In contrast with large-scale screening techniques, docking is only used to provide additional validation of the binding affinity predictor model and therefore can be run after filtering candidates based on the easily computed properties described 12/32 above. After this filtering, we were left with fewer than 1000 molecules combined between the two targets on which to run docking. Each simulation takes only a few minutes and can be run independently in parallel which means the entire in silico screening can be performed in less than a day when run on a compute cluster consisting of Intel Xeon E5-2600 v2 processors. We assessed synthesis plausibility for the novel compounds, as a major challenge in driving successes in molecular discovery is to devise plausible and efficient synthesis-planning protocols. Here we applied the recent advances made by machine learning-based approaches to predict retrosynthetic routes from large reaction databases. To estimate the ease of synthesizability and facilitate synthesis planning of the selected compounds, we predicted the retrosynthesis pathways for each candidate using the IBM RXN platform 15 . RXN combines a transformer neural network for forward reaction prediction and graph exploration techniques to evaluate retrosynthesis paths, scoring them according to probability. The path is terminated when all reagents are found to be commercially available. Candidates for which RXN was unable to determine a feasible retrosynthesis route or which terminated with non-commercially available compounds were removed from consideration. For each prediction we used the following parameters: maximum single step reactions (depth), 6; minimum acceptance probability for a single step, 0.6; maximum number of pathways (beams), 10; number of steps between removal of low probability steps (pruning), 2; and maximum execution time, 1 hour. Commercial availability was determined by searching the eMolecules database 60 with a restriction on lead time of 4 weeks or less but no restriction on price. In the next section, we provide a detailed comparisons between predicted retrosynthesis and actual synthesis routes, which is also summarized in SI Table 5 . We considered three main aspects in the comparison: number of reaction steps leading to the final product, overlap of the products in the intermediate reaction steps, and overlap of reactants used in the reactions. We chose the best path from the top six predicted for comparison by optimizing first for product overlap and then for reactant overlap. Overall, the total number of actual reaction steps showed good agreement with predictions, generally only off by one or two steps. This was confirmed by the overlap of intermediate products, which showed that retrosynthesis often predicted the correct high-level path. Product overlap is highly variable, though, since there are relatively few per route (often only two or three). The actual synthesis routes even used many of the same reactants as predicted, although occasionally alternatives had to be found due to stock limitations. In general, the retrosynthesis prediction was used as a starting point and any "major" deviations required were considered a failure. In this section, we compare the retrosynthesis predictions to the actual routes used to synthesize the molecules: GEN727 was synthesized according to the best RXN-predicted method (Figure 8a ). The synthesis of GEN725 was carried out by analogy to the best RXN strategy (Figure 8b ). SNAr ester synthesis in DMF, gave intermediate compound 13 with high yield. Cross-coupling of 13 with sulfonamide-pinacolborane led to the final product with a moderate yield (see SI Section C.2 steps K-L for full details of the synthesis procedure). Several unsuccessful attempts were made to carry out the first step according to the retrosynthetic strategy for GEN626, which led to obtaining the desired intermediate with very low yield. As a result, the synthetic pathway was changed (Figure 8c ). SNAr reaction was carried out with cyanide 8, which was followed by hydrolysis of intermediate compound 10 (obtained with a moderate yield). Reduction of nitro-group of 11 led to GEN626 (see SI Section C.2 steps H-J). Unfortunately, following the pathway suggested by retrosynthesis for GEN777 didn't give good results and the synthetic strategy needed to be changed (Figure 8d ). We synthesized acyl chloride 5, which reacted with methyl amine on the next step. Thereafter, amide 6 was treated by PCl 5 and the resulting intermediate was reacted in situ with azide-anion (see SI Section C.2 steps D-G). Enamine did not have boc-amino pinacolborane 20 in stock and could not follow the proposed retrosynthetic strategy for GXA104 (Figure 8e ). Unprotected amino-pinacolborane was available and so the strategy was changed, which made it possible to obtain GXA104 in fewer steps. At first, 20 was reacted with carboxylic acid 19, which led to amide 21. Cross-coupling of 21 with 3-iodo-1H-indazole led to GXA104 (see SI Section C.2 steps P-Q). GXA56 was synthesized according to the top RXN-predicted method (Figure 8f ). GXA70 was synthesized by analogy to the best RXN-predicted method (Figure 8g ). Minor modifications were made to the synthetic steps, such as use of other bases and organic solvents (not significant for a whole scheme). The RXN strategy was chosen due to high reactivity of trichlorotriazine with amines and the need to substitute only one chlorine at the first stage (it is easier to be controlled with less nucleophilic aniline compared to more nucleophilic aliphatic secondary amines). The RXN-predicted strategy for GXA112 was followed as closely as possible. The last synthetic step (reaction with SO 2 (NH 2 ) 2 ) led to the final product with very low yield. To improve it, mono-Boc-protected SO 2 (NH 2 ) 2 was synthesized and reacted with 26. Boc-protected final product 30 was obtained and readily deprotected via TFA cocktail (see SI Section C.2 steps V-X). Spectroscopic characterization of synthesized de novo compounds can be found in SI Table 7 . Cloning, protein production, and crystallization M pro production: The M pro coding sequence was codon optimised for expression in E. coli and synthesised by Integrated DNA technologies (IDT). The M pro expression construct used for crystallization comprises an N-terminal GST region, an M pro autocleavage site, the M pro coding sequence, a hybrid cleavage site recognizable by 3C HRV protease and a C-terminal 6-Histidine tag 61 . The overall construct was flanked by In-Fusion compatible ends for insertion into BamHI-XhoI cleaved pGEX-6P-1 (Sigma). Protein expression, purification and crystallisation was carried out in similar conditions to those previously described in Douangamath, et al. 62 . Specifically, crystals were obtained from 0.1 M MES pH 6.5, 15 PEG4K, 5% DMSO using drop ratios of 0.15 µl protein, 0.3 µl reservoir solution and 0.05 µl seed stock. Genetic constructs of spike ectodomain: The gene encoding amino acids 1-1208 of the SARS-CoV-2 spike glycoprotein ectodomain, with mutations of RRAR > GSAS at residues 682-685 (the furin cleavage site) and KV > PP at residues 986-987, as well as inclusion of a T4 fibritin trimerisation domain, a HRV 3C cleavage site, a His-8 tag and a Twin-Strep-tag at the C-terminus, as reported by Wrapp, et al. 63 . All vectors were sequenced to confirm clones were correct. Spike protein production: Recombinant spike ectodomain was expressed by transient transfection in HEK293S GnTI-cells (ATCC CRL-3022) for 9 days at 30°C. Conditioned media was dialysed against 2x phosphate buffered saline pH 7.4 buffer. The spike ectodomain was purified by immobilized metal affinity chromatography using Talon resin (Takara Bio) charged with 14/32 Comparison between actual and predicted synthesis routes. For each subfigure, the top reaction (enclosed in a box) is the actual synthesis procedure used in this study while the bottom reaction is the predicted retrosynthetic pathway. cobalt followed by size exclusion chromatography using HiLoad 16/60 Superdex 200 column in 150 mM NaCl, 10 mM HEPES pH 8.0, 0.02% NaN 3 at 4°C. Compounds were dissolved in DMSO and directly added to the crystallization drops giving a final compound concentration of 10 mM and DMSO concentration of 10%. The crystals were left to soak in the presence of the compounds for 1-2 hours before being harvested and flash cooled in liquid nitrogen without the addition of further cryoprotectant. X-ray diffraction data were collected on beamline I04-1 at Diamond Light Source and automatically processed using the Diamond automated processing pipelines 64 . Analysis was performed as outlined previously 62 . Briefly, XChemExplorer 65 was used to analyse each processed dataset that was automatically selected and electron density maps were generated with Dimple 66 Ligand-binding events 15/32 (h) GXA112 Comparison between actual and predicted synthesis routes. For each subfigure, the top reaction (enclosed in a box) is the actual synthesis procedure used in this study while the bottom reaction is the predicted retrosynthetic pathway. were identified using PanDDA 67 , and ligands were modelled into PanDDA-calculated event maps using Coot 68 annotations cross-reviewed. The solid phase extraction C4-cartridge coupled RapidFire 365 Mass Spectrometry (SPE RFMS) based high throughput dose response assay has been described 16 . In brief, M pro inhibitors were dry dispensed in an 11-point 3-fold dilution series using acoustic liquid transfer robot (Labcyte 550) in 384 well polypropylene plate (Greiner Bio-One). M pro (0.3 µM) was dispensed across the well (25 µL/well) using MultidropTM Combi (Thermo Scientific™) and the reaction incubated at ambient temperature. Compounds were incubated with the protein for 15 minutes, following which an 11-mer substrate peptide TSAVLQ/SGFRK-NH 2 (4 µM) was dispensed (25 µL/well) for probing inhibition activity. Reaction was quenched by addition of 10% aqueous formic acid (5 µL/well) after 10 min incubation with the substrate at an ambient temperature. After addition of each reagent, the plates were centrifuged for 30s (Axygen Plate Spinner Centrifuge). Samples were analysed by RapidFire (RF) 365 high-throughput sampling robot (Agilent) connected to an iFunnel Agilent 6550 accurate mass quadrupole time-of-flight (Q-TOF) mass spectrometer (operating parameters: capillary voltage (4000 V), nozzle voltage (1000 V), fragmentor voltage (365 V), drying gas temperature (280°C), gas flow (13 L/min), sheath gas temperature (350°C), sheath gas flow (12 L/min)). The peptide/protein sample was loaded onto a solid-phase extraction (SPE) C4-cartridge, AND washed with 0.1% (v/v) aqueous formic acid to remove non-volatile buffer salts (5.5 s, 1.5 mL/min) prior to elution with aqueous 85% (v/v) acetonitrile containing 0.1% (v/v) formic acid (5.5 s, 1.25 mL/min). The cartridge was re-equilibrated with 0.1% (v/v) aqueous formic acid (0.5 s, 1.25 mL/min) and sample aspirator washed with an aqueous, organic and aqueous wash before the injection of next protein: peptide mixture sample onto the SPE cartridge. Data were extracted with Rapid Fire integrator software (Agilent) and m/z (+1) was used for both N-terminal fragment TSAVLQ (681.34 Da), and the 11-mer substrate peptide (1191.68 Da). The percentage M pro activity (N-terminal product peak integral/ (N-terminal product peak integral + substrate peak integral) *100) was calculated in Microsoft Excel and normalised data transferred to Prism 9 for non-linear regression curve analysis). IC 50 -values are reported as the mean of technical duplicates (n = 2; mean ± SD). Signal to noise (S/N) and Z'-factor were calculated in Microsoft Excel (Z'> 0.8) 16 . Thermofluor (differential scanning fluorimetry, DSF) experiments were performed in triplicate in 96-well white PCR plates using a 1300-fold excess of small molecule (in DMSO) to 1.5 µg spike monomer in 50 µL buffer per well. An Agilent MX3005p RT-PCR instrument (λ ex 492 nm/λ em 585 nm) was used to monitor the fluorescence change of a 3x final concentration of SYPRO Orange dye (Thermo) in an "increasing-sawtooth" temperature profile where the temperature was increased in 1°C increments from 25°C to 98°C with the fluorescence recorded at 25°C. Four of the synthesised compounds were investigated using Thermofluor assay to assess effect upon stability. Several conditions were tested: in 20 mM sodium acetate pH 4.6 150 mM NaCl, a storage buffer at which long term stability was observed to be much improved 73 ; in 50 mM HEPES pH 7.5, 200 mM NaCl immediately after buffer exchange from the storage buffer; after incubation overnight at pH 7.5; and after incubation overnight at pH 7.5 in the presence of the compound. Raw fluorescence data were analysed using Microsoft Excel and the JTSA software 74 using a 5-parameter model to produce melting temperature (T m ) values. Note that fresh spike protein exhibits a single melting transition which can be characterised as a melting point, T m , of 65°C in neutral pH buffer. At a reduced pH 4.6 the single melting transition is at 62°C. As spike is incubated in pH 7.5 a second transition appears at a lower temperature with a T m of 50°C. This transition increases as a proportion of the total melt until it is the only transition observed and correlates with a presumed conformational change of the spike trimer to a less stable form. Vero-CCL-81 cells (100,000 cells per well) were seeded in a 96-well, cell culture-treated, flat-bottom microplates for 48 hrs. Compounds were serially diluted and incubated with approximately 100 foci of SARS-CoV-2 for 1 hr at 37°C. The mixtures were added on cells and incubated for further 2 hrs at 37°C followed by the addition of 1.5% semi-solid carboxymethyl cellulose (CMC) overlay medium to each well to limit virus diffusion. Twenty hours after infection, cells were fixed and permeabilized with 4% paraformaldehyde and 2% Triton-X 100, respectively. The virus foci were stained with human anti-NP mAb (mAb206) and peroxidase-conjugated goat anti-human IgG (A0170; Sigma), and visualized by adding TrueBlue Peroxidase Substrate. Virus-infected cell foci were counted on the classic AID EliSpot reader using AID ELISpot software. The percentage of focus reduction was calculated by comparing the number of foci in treated wells with the number in untreated control wells and IC 50 was determined using the probit program from the SPSS package. Pseudotyped lentiviral particles expressing SARS-CoV-2 S protein were incubated with serial dilutions of compounds in white opaque 96-well plates for 1 hr at 37°C. The stable HEK293T/17 cells expressing human ACE2 were then added to the mixture at 15000 cells per well. Plates were spun at 500 RCF for 1 min and further incubated for 48 hrs. Finally, Culture supernatants were removed followed by the addition of Bright-GloTM Luciferase assay system (Promega, USA). The reaction was incubated at room temperature for 5 mins and the firefly luciferase activity was measured using CLARIOstar ® (BMG Labtech). The percentage of neutralization of compounds towards pseudotyped lentiviruses was calculated relative to the untreated control and IC 50 was determined using the probit program from the SPSS package. Details of the generated molecules used in this study is available from CogMol Molecule Explorer 10 . Crystal structures of machine-identified small molecules bound to M pro derived in this study have been deposited in Protein Data Bank (PDB ID: 5SML for M pro -Z68337194, 5SMM for M pro -Z1633315555, and 5SMN for M pro -Z1365651030). Table 5 . Consolidated results comparing predicted and actual synthesis paths. The top 6 predicted retrosynthesis paths (by confidence) are considered and the path with the best agreement is shown. "Steps" is simply the number of reaction steps actual / predicted number of reaction steps. "Products" shows the intermediate (not including the final molecule) reaction products overlap in terms of recall (with respect to the predicted path) while "reactants" similarly shows the overlap of reactants from all steps in terms of recall. The "success" column shows whether the given predicted path was successfully synthesized as is or with minor changes or failed (but still synthesized via an alternative method devised by Enamine). Figure 9 . Thermofluor assay results. Thermofluor raw fluorescence data for experiments with AI-designed compound GEN727 (black) and a DMSO control (grey). Data were recorded using protein that was used immediately after dilution into neutral buffer (solid lines), incubated overnight in neutral buffer (long-dashed lines), or incubated overnight with the compound in neutral buffer (short-dashed lines). For comparison, data from protein in pH 4.6 buffer is also shown (dotted lines). Require: Trained latent variable model (e.g. VAE), samples z j drawn from domain of interest, labeled samples for each attribute a i . 1: Encode training data x j in latent space: z j,k ∼ q φ (z|x j ) for k = 1, ..., K 2: Use z j,k to fit explicit density model Q ξ (z) to approximate marginal posterior q φ (z) 3: Train classifier models q ξ (a i |z) using labeled samples for each attribute a i to approximate probability p(a i |x) 4: Assuming attributes a i are conditionally independent given z, then via Bayes' rule. Sample from Q ξ (z) 8: Accept with probability f (z) Mg(z) = ∏ i q ξ (a i |z) ≤ 1 9: if Accepted then Step M: To the solution of compound 14 (2.0 g, 10.8 mmol, 1 eq) in 30 mL of dichloromethane cooled to 0°C, 1.2 equivalent of DIPEA was added dropwise under continuous stirring. Thereafter 1 eq of 2,3-dihydro-1H-inden-5-amine dissolved in 10 mL of dichloromethane was added. The resulting mixture was stirred at ambient temperature overnight. Thereafter resulting solution was washed with water, 3 × 20 mL. Then organic layer was dried over anhydrous sodium sulfite and evaporated in vacuo. Resulting compound 15 with 90% purity was used in the next step without additional purification. Yield 92%, 2.8 g. Step N: To the solution of compound 15 (2.8 g, 9.7 mmol, 1 eq) in 40 mL of dichloromethane 2.2 equivalents of DIPEA was added dropwise at 0°C under continuous stirring. The resulting solution was stirred for additional 30 min and then 4,4difluoropiperidine hydrochloride was added portionwise (1.1 eq). The resulting mixture was left to stir at ambient temperature overnight. Next day the reaction solution was washed with water, 3 × 20 mL. Resulting organic layer was dried with anhydrous sodium disulfite and evaporated under reduced pressure. The resulting product 16 with 90%+ purity was used in the next step without any additional purification. Yield 91%, 3.3 g. Step O: To the solution of compound 16 (3.3 g, 9.1 mmol, 1 eq) in 40 mL of DMF cooled to 0°C. 1.2 eq of DIPEA was added dropwise under stirring. Then mixture was stirred for additional 30 min and 1.05 eq of the corresponding amine in 10 mL of DMF was added. Resulting reaction mixture was stirred at 80°C overnight. Thereafter all volatiles were evaporated in vacuo and residue was washed with water twice. Resulting precipitate was dissolved in 50 mL of dichloromethane, dried with anhydrous sodium sulfate and filtered through the Celite pad. Resulting filtrate was evaporated under reduced pressure to give GXA70 with 95% purity. Yield 70%, 2.7 g. Step P: To a solution of compound 19 (0.975 g, 5.70 mmol), compound 20 (1.21 g, 5.18 mmol) and HOBt (0.775 g, 5.70 mmol) in dry DMA (10 mL), cooled to 0°C, was added dropwise EDC (0.964 g, 6.31 mmol) and the reaction mixture was stirred overnight at r.t., diluted with water and extracted with ethyl acetate. The combined organic layers were washed with High-throughput screening for the discovery of enzyme inhibitors Estimation of the size of drug-like chemical space based on GDB-17 data Innovation in the pharmaceutical industry: new estimates of R&D costs Inverse design in search of materials with target functionalities Generative deep learning for targeted compound design Deep learning enables rapid identification of potent DDR1 kinase inhibitors De novo design of bioactive small molecules by artificial intelligence Combining generative artificial intelligence and on-chip synthesis for de novo drug design Cogmol: Target-specific and selective drug design for covid-19 using deep generative models Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules Unified rational protein engineering with sequence-based deep representation learning Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy Structure of mpro from SARS-CoV-2 and discovery of its inhibitors The oral protease inhibitor (pf-07321332) protects syrian hamsters against infection with sars-cov-2 variants of concern Open science discovery of oral non-covalent sars-cov-2 main protease inhibitor therapeutics Ultralarge virtual screening identifies sars-cov-2 main protease inhibitors with broad-spectrum activity against coronaviruses Oral administration of s-217622, a sars-cov-2 main protease inhibitor, decreases viral load and accelerates recovery from clinical aspects of covid-19 Potent noncovalent inhibitors of the main protease of sars-cov-2 from molecular sculpting of the drug perampanel guided by free energy perturbation calculations Molnupiravir, an oral antiviral treatment for covid-19 Elementary mathematical theory of classification and prediction Extended-connectivity fingerprints Near-physiological-temperature serial crystallography reveals conformations of sars-cov-2 main protease active site for improved drug repurposing Free fatty acid binding pocket in the locked structure of sars-cov-2 spike protein Covid moonshot: open science discovery of sars-cov-2 main protease inhibitors by combining crowdsourcing, high-throughput experiments, computational simulations, and machine learning Discovery of sars-cov-2 main protease inhibitors using a synthesis-directed de novo design model Discovery of s-217622, a non-covalent oral sars-cov-2 3cl protease inhibitor clinical candidate for treating covid-19 Alphafold accelerates artificial intelligence powered drug discovery: Efficient discovery of a novel cyclindependent kinase 20 (cdk20) small molecule inhibitor Accurate prediction of protein structures and interactions using a three-track neural network Tuning artificial intelligence on the de novo design of natural-productinspired retinoid x receptor modulators Entangled conditional adversarial autoencoder for de novo drug discovery Adversarial threshold neural computer for molecular de novo design Automated design and optimization of multitarget schizophrenia drug candidates by deep learning A novel machine learning approach uncovers new and distinctive inhibitors for cyclin-dependent kinase 9 On the opportunities and risks of foundation models Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences Sars-cov-2 omicron-b. 1.1. 529 leads to widespread escape from neutralizing antibody responses Sampleefficient generation of novel photo-acid generator molecules using a deep generative model Inverse design of 3d molecular structures with conditional generative neural networks Augmenting molecular deep generative models with topological data analysis representations Optimizing molecules using efficient queries from property evaluations Generating sentences from a continuous space Molecular sets (MOSES): A benchmarking platform for molecular generation models ZINC-a free database of commercially available compounds for virtual screening Automatic chemical design using a data-driven continuous representation of molecules BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology Deepaffinity: interpretable deep learning of compound-protein affinity through unified recurrent and convolutional neural networks RDKit: Open-source cheminformatics Explaining chemical toxicity using missing features Tox21 challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs Moleculenet: a benchmark for molecular machine learning The sars-cov-2 spike harbours a lipid binding pocket which modulates stability of the prefusion trimer AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading The PyMOL molecular graphics system Ligplot+: multiple ligand-protein interaction diagrams for drug discovery Production of authentic sars-cov mpro with enhanced activity: application as a novel tag-cleavage endopeptidase for protein overproduction Crystallographic and electrophilic fragment screening of the SARS-CoV-2 main protease Cryo-em structure of the 2019-ncov spike in the prefusion conformation Achieving efficient fragment screening at XChem facility at diamond light source The xchemexplorer graphical workflow tool for routine or large-scale protein-ligand structure determination Overview of the ccp4 suite and current developments A multi-crystal method for extracting obscured crystallographic states from conventionally uninterpretable electron density Features and development of coot AceDRG: a stereochemical description generator for ligands Refinement of macromolecular structures by the maximum-likelihood method United Kingdom: Global Phasing Ltd Cryo-EM structures of SARS-CoV-2 spike without and with ACE2 reveal a pH-dependent switch to mediate endosomal positioning of receptor-binding domains JavaScript Thermal Shift Analysis Software A.5 Spectroscopic data GEN727 1 H NMR (400 MHz, dmso) δ 8.37 (d, J = 5.3 Hz, 1H), 8.21 (d, J = 8.4 Hz, 1H), 7.76 (d, J = 8.4 Hz, 1H), 7.59 (t, J = 7.6, 7.6 Hz, 1H) 400 MHz, dmso) δ 7.41 (m, 1H), 7.37 (m, 2H), 7.19 (d, J = 7.5 Hz, 1H) 400 MHz, dmso) δ 7.63 (d, J = 8.6 Hz, 1H), 7.28 (br s, 1H) Hz, 2H), 2.83 (m, 2H), 2.60 (m, 2H), 1.98 (m, 2H), 1.74 (m, 2H) 400 MHz, dmso) δ 7.97 (dd, J = 7.9, 1.8 Hz, 1H), 7.90 (m, 4H) GXA104 1 H NMR (400 MHz, dmso) δ 13.34 (s, 1H), 8.08 (d, J = 8.3 Hz, 1H), 8.01 (d, J = 8.3 Hz, 1H), 7.87 (s, 1H), 7.61 (d, J = 8.3 Hz, 2H), 7.42 (m, 1H), 7.36 (d, J = 7.7 Hz, 1H), 7.23 (m, 1H) The solids were removed via filtration and the solvent was removed in vacuo. The residue was diluted with an aqueous NaHSO 4 solution (50 mL) and washed with dichloromethane (2 × 20 mL); the aqueous layer was basified with NaOH to pH=14, and extracted with dichloromethane (3 × 30 mL). The organic extracts were combined, dried over Na 2 SO 4 and concentrated in vacuo to obtain crude 2 (0.4 g) which was used in the next step without purification. Step B: Crude compound 2 (0.4 g) was dissolved in methanol (10 mL) and a hydrogen chloride solution in dioxane (20 mL) was added. The reaction mixture was stirred 330 g, 2 mmol) and DIPEA (0.65 g, 5 mmol) were added to the solution. The reaction mixture was stirred at 100°C for 48 h and purified via preparative HPLC to obtain GEN727 (2 fractions: 0.0257 g and 0.0278 g, overall yield 9%) as brown solid 2 mmol) was added to a solution of compound 4 (1.7 g, 6.6 mmol) in dichloromethane (10 mL) and the mixture was refluxed for 1 h and evaporated under reduced pressure to give compound 5. Step E: To a saturated solution of aqueous methylamine (5 g) After the completion of the reaction was confirmed, the resulting mixture was extracted with MTBE. The combined organic layers were washed with brine dried over anhydrous Na 2 SO 4 and evaporated under reduced pressure to obtain 1 g of compound 6, which was used in the next step without further purification Step G: To the solution of compound 7 in dichloromethane (from Step F) was added TMSN3 (2.5 g, 21.7 mmol). The reaction mixture was stirred overnight at r.t. and evaporated under reduced pressure MeOH 10:1 as eluent) to afford 10 (0.18 g, 0.55 mmol, 18% yield) as yellow oil. Step I: Compound 10 (0.18 g, 0.55 mmol) was suspended in conc. H 2 SO 4 (5 mL) and the reaction mixture was heated to 60°C for 2 h, cooled with ice and diluted with an aqueous Na 2 CO 3 solution to basic pH. The resulting mixture was extracted with ethyl acetate (3 × 30 mL); the organic layer was dried over Na 2 SO 4 and concentrated in vacuo to obtain 11 (0.16 g, 0.46 mmol, 84% yield) as yellow solid. Step J: To a solution of compound 11 (0.16 g, 0.46 mmol) in methanol (10 mL), Pd/C (10%w, 0.100 g) was added. The reaction mixture was evacuated and backfilled with hydrogen and then stirred for 18 h 60% dispersion in mineral oil) in DMF (5 mL) was added dropwise a solution of 4-bromophenol (1.09 g, 6.31 mmol) in DMF (5 mL). The mixture was stirred for 1 h and compound 12 (1 g, 5.74 mmol) was added. The reaction mixture was stirred at 100°C overnight, cooled to r.t. and poured into ice (100 mL). The precipitate was filtered and washed with water (3 × 10 mL) and with hexanes. The solid was dried Step L: To a mixture of compound 13 (1 g, 3.06 mmol 67 mmol) and sodium carbonate (0.81 g, 7.65 mmol) in a mixture of dioxane and water (9:1, 10 mL) was added 36 mmol) under an inert atmosphere. The reaction mixture was stirred for 16 h at 95°C (oil bath), cooled to r.t., diluted with water (10 mL) and extracted with EtOAc (2 × 10 mL). The combined organic layers were dried over Na 2 SO 4 and evaporated under reduced pressure dried over anhydrous Na 2 SO 4 and evaporated under reduced pressure. The residue was crystallized from the minimum amount of ethyl acetate to obtain 1 Pd(PPh 3 ) 4 (0.061 g, 0.05 mmol) and Na 2 CO 3 (0.225 g, 2.13 mmol) in a mixture of dioxane/water (4:1) (5 mL) was stirred overnight at 90°C under an argon atmosphere. The cooled mixture was diluted with water and extracted with dichloromethane. The combined organic layers were washed with water, dried over anhydrous Na 2 SO 4 and evaporated under reduced pressure. The residue was purified by column chromatography to obtained by HPLC to afford 0 mL) and the resulting mixture was stirred at r.t. for 16 h. After that the reaction mixture was diluted with water; the organic phase was washed with water and brine, dried over Na 2 SO 4 and evaporated to obtain crude product 23 (1.1 g), which was used in the next step without further purification. Step S: To a stirred solution of compound 23 (1.1 g, 4 mmol) in dichloromethane (40 mL) at 0°C were added DIPEA (0.86 mL, 4.94 mmol) and tert-butyl N-[4-(methylamino)cyclohexyl]carbamate (0.94 g) and the resulting mixture was stirred at r.t. for 16 h. After that the reaction mixture was diluted with water; the organic phase was washed with water and brine, dried over Na 2 SO 4 and evaporated under reduced pressure to obtain crude product 24 (1.5 g), which was used in the next step without further purification 25 mmol) and the resulting mixture was stirred at r.t. for 16 h. After that an additional amount of DIPEA (0.68 mL, 3.90 mmol) and morpholine (0.28 mL, 3.25 mmol) was added and the resulting mixture was stirred at r.t. for another 16 h. Then the reaction mixture was diluted with water; the organic phase was washed with water and brine, dried over Na 2 SO 4 and evaporated under reduced pressure to obtain crude product 3 mmol) in dichloromethane (25 mL) was added 4 M HCl solution in dioxane and the resulting mixture was stirred at r.t. for 8 h. After that the reaction mixture was evaporated under reduced pressure to obtain crude product 26 (1.2 g), which was used in the next step without further purification Step W: To a stirred suspension of compound 26 (0.8 g, 1.7 mmol) in dichloromethane (10 mL) at 0°C was added Et 3 N (0.76 mL, 5.45 mmol) followed by a solution of compound 29 in dichloromethane (3 mL) and the resulting mixture was stirred at r.t. for 16 h. After that the reaction mixture was diluted with water; the organic phase was washed with water and brine, dried over Na 2 SO 4 and evaporated under reduced pressure to obtain crude product 30 (0.8 g), which was used in the next step without further purification. Step X: To a stirred solution of compound 30 (0.8 g, 1.4 mmol) in dichloromethane (5 mL) was added 4 M HCl solution in dioxane (1 mL) and the resulting mixture was stirred at r.t. for 8 h. Then the reaction mixture was evaporated under reduced pressure, the obtained residue was diluted with water, basified with a NaHCO 3 solution and extracted with dichloromethane. The combined organic phase was washed with water Resulting mixture was stirred at 60°C overnight. Formed precipitate was filtered off, dissolved in water and acidified with sodium hydrosulphate to pH 2, then stirred for 20 min and filtered to obtain compound 32 as yellow solid. Yield 66%, 1.8 g. Step Z: To compound 32 (1.8g, 1 eq) in 15 mL of POCl 3 was added 0.15 mL of DIPEA and resulting mixture was stirred at reflux for 3 hours. The resulting mixture was evaporated, quenched with ice and saturated solution of anhydrous potassium carbonate up to pH 12. Then the solution was left to stir at ambient temperature for 20 min. The resulting precipitate was filtered off and washed with water several times to obtain compound 33. Yield 26%, 0.53 g. Step AA: 1-Methyl-1H-pyrazol-3-amine (0.175 g, 1 eq), sodium iodide (0.27 g, 1 eq) and DIPEA (0.46 g, 2 eq) were added subsequently to a solution of compound 33 (0.5 g, 1 eq) in 10 mL of dry DMF. The resulting mixture was stirred at 80°C overnight. After mixture was cooled to r.t. and then diluted with water, formed precipitate was filtered and washed with water to give compound 34. Yield 58%, 0.35g. Step BB: Compound 34 (0.35 g, 1 eq) together with piperazine (0.17 g, 2 eq) and anhydrous potassium carbonate (0.27 g, 2 eq) was mixed in 15 mL of dry DMF and heated up to 120°C overnight This work was supported by the IBM Science for Social Good program. D.I.S. is supported by the UKRI MRC (MR/N00065X/1) and is a Jenner Investigator. This is a contribution from the UK Instruct-ERIC Centre. Authors thank the IBM COVID-19 Molecular Explorer team for help with open sourcing machine-designed inhibitor candidates, the IBM RXN team for help with retrosynthesis predictions, Diamond for beamtime through the COVID-19 dedicated call, the Diamond MX group and the IBM-Oxford partnership for valuable support.