key: cord-0674530-w3bl1d22 authors: Joshi, Rajendra P.; Kumar, Neeraj title: Artificial Intelligence based Autonomous Molecular Design for Medical Therapeutic: A Perspective date: 2021-02-10 journal: nan DOI: nan sha: bb6a7f1dedacda7d64710c17f27ec1ecce4a0a89 doc_id: 674530 cord_uid: w3bl1d22 Domain-aware machine learning (ML) models have been increasingly adopted for accelerating small molecule therapeutic design in the recent years. These models have been enabled by significant advancement in state-of-the-art artificial intelligence (AI) and computing infrastructures. Several ML architectures are pre-dominantly and independently used either for predicting the properties of small molecules, or for generating lead therapeutic candidates. Synergetically using these individual components along with robust representation and data generation techniques autonomously in closed loops holds enormous promise for accelerated drug design which is a time consuming and expensive task otherwise. In this perspective, we present the most recent breakthrough achieved by each of the components, and how such autonomous AI and ML workflow can be realized to radically accelerate the hit identification and lead optimization. Taken together, this could significantly shorten the timeline for end-to-end antiviral discovery and optimization times to weeks upon the arrival of a novel zoonotic transmission event. Our perspective serves as a guide for researchers to practice autonomous molecular design in therapeutic discovery. lot of resources [4, 5, 6, 7] . In this scenario, complete automation of laboratories is long overdue and have been used with limited success in the past [8, 9, 10, 11] . Automating the computational design of molecules that integrates physics-based simulations and optimization with ML approaches are a feasible and efficient alternative instead; it can contribute significantly in accelerating autonomous molecular design. High throughput quantum mechanical calculations such as efficient density functional theory (DFT) based simulations are the first step towards this goal of providing insight into larger chemical space and have shown some promise to accelerate novel molecule discovery. However, it still requires human intelligence for different decision-making processes and it cannot autonomously guide small molecule therapeutic discovery steps, thus slowing down the entire process. Additionally, inverse design of molecules is notoriously difficult with DFT alone. The amount of data produced by these high throughput methods is so large that it cannot be analyzed in real time with conventional methods. Autonomous computational design and characterization of molecules is more important in the scenarios where existing experimental/computational approaches are inefficient [12, 13] . One particular example is the challenge associated with identifying new metabolites in a biological sample from mass spectrometry data, which requires mapping the fragmented spectra of novel molecules to the existing spectral library making it slow and tedious. In many cases, such references libraries do not exist, and a machine learning integrated, automated workflow could be an ideal choice to deploy for rapid identification of metabolites as well as to expand the existing libraries for future reference. Such a workflow has shown the early ability to quickly screen molecules and accurately predict their properties for different applications. The synergistic use of high throughput methods in a closed loop with machine learning based methods, capable of inverse design, is considered vital for autonomous and accelerated discovery of molecules [11] . In this perspective, we discuss how computational workflows for autonomous molecular design can guide the goal of laboratory automation and review the current state-of-the art artificial intelligence (AI) guided autonomous molecular design focusing mainly on small molecule therapeutic discovery. The workflow for computational autonomous molecular design (CAMD) must be an integrated and closed loop system with (i) efficient data generation and extraction tools, (ii) robust data representation techniques, (iii) physics based predictive machine learning models, and (iv) tools to generate new molecules using the knowledge learned from steps i-iii. Ideally, an autonomous computational workflow for molecule discovery would learn from its own experience and adjust its functionality as the chemical environment or the targeted functionality changes. This can be achieved when all the components work in collaboration with each other, providing feedback while improving model performance as we move from one step to other. For data generation in CAMD, high-throughput density functional theory (DFT) [14, 15] is a common choice mainly because of its reasonable accuracy and efficiency [16, 17] . In DFT, we feed in 3D-structures to predict the properties of interest. Data generated from DFT simulations is processed to extract the more attempted by more robustly encoding rings and branches of molecules to find more concrete representations with high semantical and syntactical validity using canonical SMILES [48, 49] , InChI [41, 42] , SMARTS [50] , DeepSMILES [51] , DESMILES [52] etc. More recently, Alan et al. proposed 100 % syntactically correct and robust string based representation of molecules known as SELFIES [46] , which has been increasingly adopted for predictive and generative modelling [53] . Recently, molecular representations that can be iteratively learned directly from molecules have increasingly gained adoption, mainly for predictive molecular modeling achieving chemical accuracy for range of properties [31, 54, 55] . Such representations are more robust and out perform expert designed representations in drug discovery. [56] For representation learning, different variants of graph neural networks are a popular choice [34, 57] . It starts with generating atom (node) and bond (edge) features for all the atoms and bonds within a molecule which are iteratively updated using graph traversal algorithms, taking into account the chemical environment information to learn a robust molecular representation. Starting atom and bond features of the molecule may just be one hot encoded vector to include only atom-type, bond-type or a list of properties of the atom and bonds derived from SMILES strings. Yang et al. achieved the chemical accuracy for predicting a number of properties with their ML models by combining the atom and bond features of molecules with global state features before being updated during the iterative process [58] . Molecules are 3-dimensional multiconformational objects and hence it is natural to assume that they can be well represented by the nuclear coordinates as is the case of quantum mechanics based molecular simulations [59] . However, with coordinates the representation of molecules is non-invariant, non-invertible and non unique in nature [32] and hence not commonly used in conventional machine learning. In addition, the coordinates by itself do not carry information about the key attribute of molecules such as bond types, symmetry, spin states, charge etc in a molecule. Approaches/architectures have been proposed to create the robust, unique, invariant representations from nuclear coordinates using atom centered Gaussian functions, tensor field networks, and more robustly by using representation learning techniques [60, 61, 62, 55, 31, 63] . Chen et al. [31] achieved the chemical accuracy for predicting a number of properties with their ML models by combining the atom and bond features of molecules with global state features of the molecules before being updated during the iterative process. Robust representation of molecules can also be learned only from the nuclear charge and coordinates of molecules as demonstrated by Schutt et al [62, 55, 60] . Different variants (see Table 1 ) of message passing neural networks for representation learning have been proposed, with the main differences being how message are passed between the nodes and edges and how they are updated during the iterative process using hidden states h t v . Hidden states at each nodes during message passing phase are updated using models, information is propagated back and forth in the molecules in the form of waves making it possible to pass the information locally while simultaneously travelling the entire molecule in a single pass. With the unprecedented success of learned molecular representations for predictive modelling, they are also adopted with success for generative models [54, 66] Predictive modeling is the most widely studied area of applied machine learning in molecular modeling, drug discovery and medicine [67, 68, 69, 70, 71, 72, 62, 60, 55, 73] . Depending upon whether the ML architecture requires the pre-defined input representations as input features or can learn their own input representation by itself, predictive modeling can be broadly classified into two sub-categories. The former is well covered in several recent review articles [67, 68, 69, 70, 71, 72] . We will focus only on the latter, which has been increasingly adopted in predictive machine learning recently with unprecedented accuracy for a range of properties and data-sets. A number of related approaches for predictive feature/property learning have been • Learns molecular representation centered on bonds instead of atoms Table 1 . The comparison of mean absolute errors obtained from some of the benchmark models with their target chemical accuracy are reported in Table 2 . This shows that appropriate ML models when used with proper representation of molecules, a well curated accurate data set, and a well sought state-of-the-art chemical accuracy from machine learning can be achieved. To achieve the long overdue goal of exploring a large chemical space, accelerated molecular design, and generation of molecules with desired properties, inverse design is unavoidable. It is generally known that a molecule should have specific functionalities for it to be a effective therapeutic candidate against a particular disease, but in many cases new molecules that host such functionalities are not easily known with a direct approach. Furthermore, the pool where such molecules may exist is astronomically large [78, 79, 80] (approx. In such scenarios, inverse design is of significant interest where the focus is on quickly identifying novel molecules with desired properties, in contrast to the conventional, so called direct approach where known molecules are explored for different properties. In inverse design, we start with the initial data set for which we know the structure and properties, map this to a probability distribution and then use it to generate new, previously unknown candidate molecules with desired properties very efficiently. Inverse design uses optimization and search algorithms [81, 82] for the purpose and by itself can accelerate the lead molecule discovery process, which is the first step for any drug development. This paradigm holds even more promise when used in a closed loop with synthesis, characterization and different test tools in such a way that each of these steps receives and transmits feedback concurrently, thus improving each other over time. This has shown some promise recently by substantially reducing the timeline for commercialization of molecules from its discovery to days, which is otherwise known to span over a decade in most cases. In one recent work, Some of these issues can be eliminated by using the reinforced adversarial neural computer method [95] , which extends their work. Similar to VAE's, GAN's have also been used for molecular graph generation, which is considered more robust compared to SMILES string generation. Cao et al. [91] non-sequentially and efficiently generated the molecular graph of small molecules with high validity and novelty from jointly trained GAN and Reinforcement learning architectures. Maziarka et al. [89] proposed a method for graph-tograph translation, where they generated 100 % valid molecules identical with the input molecules, but with different desired properties. Their approach relies on the latent space trained for JT-VAE and a degree of similarity of the generated molecules to the starting ones can be tuned. Mendez-Lucio et al. [96] proposed conditional generative adversarial networks to generate molecules that produce a desired biological effect at a cellular level, thus bridging systems biology and molecular design. deep convolution NN based GAN [90] was used for de novo drug design targeting types of cannabinoid receptors. Generative models such as GAN's, RNN [99] proposed a fragment-based RL approach employing an actor-critic model, for generating more than 90 % valid molecules while optimizing multiple properties. Genetic algorithms (GA) have also been used for generating molecules while optimizing their properties [100, 101, 102, 103] . GA based models suffer from stagnation while being trapped in at the regions of local optima [104] . One notable work alleviating these problems is by Nigam et al. [53] where they hybridize GA and deep neural network to generate diverse molecules while outperforming related models in optimization. All of the generative models discussed above generate molecules in the form of 2D graphs, or SMILES strings. Models to generate molecules directly in the form of 3D coordinates have also recently gained attention [105, 106, 107] . Such generated 3D coordinates can be directly used for further simulation using quantum mechanics or by using docking methods. One of such first models is proposed by Niklas et al. [107] where they generate 3D coordinates of small molecules with light atoms (H, C, N, O, F). They then use the 3D coordinates of the molecules to learn representation to map it to a space which is then used to generate 3D coordinates of the novel molecules. Building on this for a drug discovery application, we recently proposed a model [66] to generate 3D coordinates of molecules while always preserving desired scaffolds. This approach has generated synthesizable drug-like molecules that show high docking score against the target protein. Other scaffolds based models to generates molecules in the form of 2D-graphs/SMILES strings are also published in the literature [108, 109, 110, 111, 112] . Recently, with the huge interest in the development of architecture and algorithms required for quantum computing, quantum version of generative models such as the quantum auto-encoder [113] and quantum GANs [114] have been proposed which carry huge potential, among others, for drug discovery. The preliminary proof of concept work of Romero et al. [114, 113] shows that it is possible to encode and decode molecular information using a quantum encoder, demonstrating generative modeling is possible with quantum VAE's and more work especially in the development of supporting hardware architecture is required in this direction. The success of current ML approaches depends upon how accurately we can represent a chemical structure for a given model. Finding a robust, transferable, interpretable, and easy to obtain representation which obeys the fundamental physics and chemistry of molecules that work for all different kind of applications is a critical task. If available, this would save lot of resources, while increasing the accuracy and flexibility of molecular representations. Efficiently using such representations with robust and reproducible ML architectures will provide predictive modeling engine. Once a desired accuracy for diverse molecular systems for a given property prediction is achieved, it can routinely be used as an alternative to expensive QM based simulations or experiments. In the chemical and biological sciences, a main bottleneck for deploying ML models is the lack of sufficiently curated data under similar conditions that is required for training the models. Finding architecture that works consistently well enough for relatively small amount of data is equally important. Strategies such as active learning (AL) and transfer learning (TL) are ideal for such scenarios to tackle problems [115, 116, 117, 118, 119] . Graph based methods for end-to-end feature learning and predictive modeling so far have been successfully used on small molecules consisting of lighter atoms. For larger molecules, robust representation learning and molecule generation parts must include non-local interactions such as vander-waals and H-bonding while building predictive and generative models. Equally important is to develop and tie a robust, transferable, and scalable state-of-the-art platform for inverse molecular design in a closed loop with predictive modeling engine to accelerate therapeutic design ultimately reducing the cost and time required. Many of the inverse ML models used for inverse design use single bio-chemical activity as the criteria to measure the success of generated candidate therapeutic, which is in-contrast to real clinical trial, where small molecule therapeutics are optimized for several bio-activities simultaneously. CAMD workflow should be designed in a way to optimize multiple objective functions while generating and validating therapeutic molecules. Validation of newly generated lead molecules for a given drug usage, if done by experiments or quantum mechanical simulations, is an expensive task for all generated lead molecules. Ways to auto-validate molecules (using an inbuilt robust predictive model) would be ideal to save resources. In addition, CAMD workflows should be able to quantify uncertainty associated with it using statistical measures. For an ideal case, such uncertainty should decrease over the time as it learns from its own experience in series of closed loop. CAMD workflows are generally built and trained with a specific goal in mind. Such workflows need to be re-configured and re-trained to work for different objective in therapeutic design and discovery. Design and build a single automated CAMD setup for multiple experiment (multi-parameter optimization) in a kind of transfer learning fashion is a challenge. It would be particularly helpful for the domains where a relatively small amount of data exist. Having such a CAMD infrastructure, algorithm and software stack would speedup end-to-end antiviral lead design and optimization for any future pandemics like Covid-19. Deep learning enables rapid identification of potent DDR1 kinase inhibitors Off-line quality control, parameter design, and the taguchi method Machine-learned and codified synthesis parameters of oxide materials Computational methods in drug discovery Innovation in the pharmaceutical industry: new estimates of r&d costs Envisioning the future: medicine in the year 2050 How to improve r&d productivity: the pharmaceutical industry's grand challenge Idea2data: toward a new paradigm for drug discovery Creating a virtual assistant for medicinal chemistry Current and future roles of artificial intelligence in medicinal chemistry synthesis A remote-controlled adaptive medchem lab: an innovative approach to enable drug discovery in the 21st century Collision cross sections for structural proteomics Identification of metabolites from tandem mass spectra with a machine learning approach utilizing structural features Inhomogeneous electron gas Self-consistent equations including exchange and correlation effects A high-throughput infrastructure for density functional theory calculations The electrolyte genome project: A big data approach in battery materials discovery Orbnet: Deep learning for quantum chemistry using symmetry-adapted atomic-orbital features Analytical gradients for molecular-orbital-based machine learning Quantum chemistry in the age of machine learning Quantum chemical accuracy from density functional approximations via machine learning Quantum chemistry structures and properties of 134 kilo molecules Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17 Nist standard reference simulation website, nist standard reference database number 173, national institute of standards and technology The ModelSEED Biochemistry Database for the Integration of Metabolic Annotations and the Reconstruction, Comparison and Analysis of Metabolic Models for Plants, Fungi and Microbes Text-mined dataset of inorganic materials synthesis recipes Text Mining for Drug Discovery Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature Information retrieval and text mining technologies for chemistry Communication: Understanding molecular representations in machine learning: The role of uniqueness and target similarity Graph networks as a universal machine learning framework for molecules and crystals Deep learning for molecular design-a review of the state of the art Smiles enumeration as data augmentation for neural network modeling of molecules Neural message passing for quantum chemistry Representation learning on graphs Molecular graph convolutions: moving beyond fingerprints Moleculenet: A benchmark for molecular machine learning Fast and accurate modeling of molecular atomization energies with machine learning Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules Inchi -the worldwide chemical structure identifier standard. journal of cheminformatics International chemical identifier for chemical reactions Applying machine learning techniques to predict the properties of energetic materials Selfies: a robust representation of semantically constrained graphs with an example application in chemistry Algorithm for advanced canonical coding of planar chemical structures that considers stereochemical and symmetric information Towards a Universal SMILES representation -A standard method to generate canonical SMILES based on the InChI DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures A deep-learning view of chemical space designed to facilitate drug discovery Augmenting genetic algorithms with deep neural networks for exploring the chemical space Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules Schnetpack: A deep learning toolbox for atomistic systems Ampl: A data-driven modeling pipeline for drug discovery Message-passing neural networks for high-throughput polymer screening Analyzing learned molecular representations for property prediction Bayer's in silico admet platform: a journey of machine learning over the past two decades Quantum-chemical insights from deep tensor neural networks Schnet -a deep learning architecture for molecules and materials Schnet: A continuous-filter convolutional neural network for modeling quantum interactions Advances in Neural Information Processing Systems Geom: Energy-annotated molecular conformations for property prediction and molecular generation When do short-range atomistic machine-learning models fall short? Learning a local-variable model of aromatic and conjugated systems 3D-scaffold: Deep learning framework to generate 3D coordinates of drug-like molecules with desired scaffolds Machine learning techniques and drug design Machine learning in drug discovery and development part 1: A primer Machine learning in chemoinformatics and drug discovery Ranking chemical structures for drug discovery: A new machine learning approach Machine learning for target discovery in drug development Applications of machine learning in drug target discovery Argumentative comparative analysis of machine learning on coronary artery disease Convolutional networks on graphs for learning molecular fingerprints Prediction errors of molecular machine learning models lower than hybrid dft error Neural message passing with edge updates for predicting properties of molecules and materials Estimation of the size of drug-like chemical space based on GDB-17 data PubChem Substance and Compound databases Defining and exploring chemical spaces Inverse design in search of materials with target functionalities Inverse strategies for molecular design Automatic chemical design using a data-driven continuous representation of molecules Recurrent neural network regularization Deep learning in neural networks: An overview Proceedings of the 34th International Conference on Machine Learning Constrained graph variational autoencoders for molecule design Learning multimodal graph-to-graph translation for molecular optimization Multi-resolution autoregressive graph-to-graph translation for moleculesdoi Deep convolutional generative adversarial network (dcgan) models for screening and design of small molecules targeting cannabinoid receptors Molgan: An implicit generative model for small molecular graphs drugan: An advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico Objectivereinforced generative adversarial networks (organ) for sequence generation models Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design Reinforced adversarial neural computer for de novo molecular design De Novo Generation of Hitlike Molecules from Gene Expression Signatures Using Artificial Intelligence Deep reinforcement learning for de novo drug design Molecular de novo design through deep reinforcement Deep Reinforcement Learning for Multiparameter Optimization in de novo Drug Design Computational design and selection of optimal organic photovoltaic materials Stochastic voyages into uncharted chemical space produce a representative library of all possible drug-like compounds Strategy to discover diverse optimal molecules in the small molecule universe A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space Properties of a genetic algorithm equipped with a dynamic penalty function Symmetry-aware actor-critic for 3d molecular design (2020) Hernández-Lobato, Reinforcement learning for molecular design guided by quantum mechanics Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules Deepscaffold: A comprehensive tool for scaffold-based de novo drug discovery using deep learning Scaffold-based molecular design with a graph generative model SMILESbased deep generative scaffold decorator for de-novo drug design Scaffold-Based Drug Discovery ScaffoldGraph: an open-source library for the generation and analysis of molecular scaffold networks and scaffold trees Quantum autoencoders for efficient compression of quantum data Quantum machine learning Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration Active learning in the drug discovery process Active learning strategies with combine analysis: new tricks for an old dog Bradshaw: a system for automated molecular design Deep model based transfer and multi-task learning for biological image analysis