key: cord-0674530-w3bl1d22
authors: Joshi, Rajendra P.; Kumar, Neeraj
title: Artificial Intelligence based Autonomous Molecular Design for Medical Therapeutic: A Perspective
date: 2021-02-10
journal: nan
DOI: nan
sha: bb6a7f1dedacda7d64710c17f27ec1ecce4a0a89
doc_id: 674530
cord_uid: w3bl1d22

Domain-aware machine learning (ML) models have been increasingly adopted for accelerating small molecule therapeutic design in the recent years. These models have been enabled by significant advancement in state-of-the-art artificial intelligence (AI) and computing infrastructures. Several ML architectures are pre-dominantly and independently used either for predicting the properties of small molecules, or for generating lead therapeutic candidates. Synergetically using these individual components along with robust representation and data generation techniques autonomously in closed loops holds enormous promise for accelerated drug design which is a time consuming and expensive task otherwise. In this perspective, we present the most recent breakthrough achieved by each of the components, and how such autonomous AI and ML workflow can be realized to radically accelerate the hit identification and lead optimization. Taken together, this could significantly shorten the timeline for end-to-end antiviral discovery and optimization times to weeks upon the arrival of a novel zoonotic transmission event. Our perspective serves as a guide for researchers to practice autonomous molecular design in therapeutic discovery.

lot of resources [4, 5, 6, 7] . In this scenario, complete automation of laboratories is long overdue and have been used with limited success in the past [8, 9, 10, 11] .

Automating the computational design of molecules that integrates physics-based simulations and optimization with ML approaches are a feasible and efficient alternative instead; it can contribute significantly in accelerating autonomous molecular design. High throughput quantum mechanical calculations such as efficient density functional theory (DFT) based simulations are the first step towards this goal of providing insight into larger chemical space and have shown some promise to accelerate novel molecule discovery. However, it still requires human intelligence for different decision-making processes and it cannot autonomously guide small molecule therapeutic discovery steps, thus slowing down the entire process. Additionally, inverse design of molecules is notoriously difficult with DFT alone. The amount of data produced by these high throughput methods is so large that it cannot be analyzed in real time with conventional methods.

Autonomous computational design and characterization of molecules is more important in the scenarios where existing experimental/computational approaches are inefficient [12, 13] . One particular example is the challenge associated with identifying new metabolites in a biological sample from mass spectrometry data, which requires mapping the fragmented spectra of novel molecules to the existing spectral library making it slow and tedious. In many cases, such references libraries do not exist, and a machine learning integrated, automated workflow could be an ideal choice to deploy for rapid identification of metabolites as well as to expand the existing libraries for future reference. Such a workflow has shown the early ability to quickly screen molecules and accurately predict their properties for different applications. The synergistic use of high throughput methods in a closed loop with machine learning based methods, capable of inverse design, is considered vital for autonomous and accelerated discovery of molecules [11] . In this perspective, we discuss how computational workflows for autonomous molecular design can guide the goal of laboratory automation and review the current state-of-the art artificial intelligence (AI) guided autonomous molecular design focusing mainly on small molecule therapeutic discovery.

The workflow for computational autonomous molecular design (CAMD) must be an integrated and closed loop system with (i) efficient data generation and extraction tools, (ii) robust data representation techniques, (iii) physics based predictive machine learning models, and (iv) tools to generate new molecules using the knowledge learned from steps i-iii. Ideally, an autonomous computational workflow for molecule discovery would learn from its own experience and adjust its functionality as the chemical environment or the targeted functionality changes. This can be achieved when all the components work in collaboration with each other, providing feedback while improving model performance as we move from one step to other.

For data generation in CAMD, high-throughput density functional theory (DFT) [14, 15] is a common choice mainly because of its reasonable accuracy and efficiency [16, 17] . In DFT, we feed in 3D-structures to predict the properties of interest. Data generated from DFT simulations is processed to extract the more attempted by more robustly encoding rings and branches of molecules to find more concrete representations with high semantical and syntactical validity using canonical SMILES [48, 49] , InChI [41, 42] , SMARTS [50] , DeepSMILES [51] , DESMILES [52] etc. More recently, Alan et al. proposed 100 % syntactically correct and robust string based representation of molecules known as SELFIES [46] , which has been increasingly adopted for predictive and generative modelling [53] .

Recently, molecular representations that can be iteratively learned directly from molecules have increasingly gained adoption, mainly for predictive molecular modeling achieving chemical accuracy for range of properties [31, 54, 55] . Such representations are more robust and out perform expert designed representations in drug discovery. [56] For representation learning, different variants of graph neural networks are a popular choice [34, 57] . It starts with generating atom (node) and bond (edge) features for all the atoms and bonds within a molecule which are iteratively updated using graph traversal algorithms, taking into account the chemical environment information to learn a robust molecular representation. Starting atom and bond features of the molecule may just be one hot encoded vector to include only atom-type, bond-type or a list of properties of the atom and bonds derived from SMILES strings. Yang et al. achieved the chemical accuracy for predicting a number of properties with their ML models by combining the atom and bond features of molecules with global state features before being updated during the iterative process [58] .

Molecules are 3-dimensional multiconformational objects and hence it is natural to assume that they can be well represented by the nuclear coordinates as is the case of quantum mechanics based molecular simulations [59] . However, with coordinates the representation of molecules is non-invariant, non-invertible and non unique in nature [32] and hence not commonly used in conventional machine learning. In addition, the coordinates by itself do not carry information about the key attribute of molecules such as bond types, symmetry, spin states, charge etc in a molecule. Approaches/architectures have been proposed to create the robust, unique, invariant representations from nuclear coordinates using atom centered Gaussian functions, tensor field networks, and more robustly by using representation learning techniques [60, 61, 62, 55, 31, 63] .

Chen et al. [31] achieved the chemical accuracy for predicting a number of properties with their ML models by combining the atom and bond features of molecules with global state features of the molecules before being updated during the iterative process. Robust representation of molecules can also be learned only from the nuclear charge and coordinates of molecules as demonstrated by Schutt et al [62, 55, 60] . Different variants (see Table 1 ) of message passing neural networks for representation learning have been proposed, with the main differences being how message are passed between the nodes and edges and how they are updated during the iterative process using hidden states h t v . Hidden states at each nodes during message passing phase are updated using

models, information is propagated back and forth in the molecules in the form of waves making it possible to pass the information locally while simultaneously travelling the entire molecule in a single pass. With the unprecedented success of learned molecular representations for predictive modelling, they are also adopted with success for generative models [54, 66] 

Predictive modeling is the most widely studied area of applied machine learning in molecular modeling, drug discovery and medicine [67, 68, 69, 70, 71, 72, 62, 60, 55, 73] . Depending upon whether the ML architecture requires the pre-defined input representations as input features or can learn their own input representation by itself, predictive modeling can be broadly classified into two sub-categories. The former is well covered in several recent review articles [67, 68, 69, 70, 71, 72] . We will focus only on the latter, which has been increasingly adopted in predictive machine learning recently with unprecedented accuracy for a range of properties and data-sets. A number of related approaches for predictive feature/property learning have been • Learns molecular representation centered on bonds instead of atoms Table 1 . The comparison of mean absolute errors obtained from some of the benchmark models with their target chemical accuracy are reported in Table 2 . This shows that appropriate ML models when used with proper representation of molecules, a well curated accurate data set, and a well sought state-of-the-art chemical accuracy from machine learning can be achieved.

To achieve the long overdue goal of exploring a large chemical space, accelerated molecular design, and generation of molecules with desired properties, inverse design is unavoidable. It is generally known that a molecule should have specific functionalities for it to be a effective therapeutic candidate against a particular disease, but in many cases new molecules that host such functionalities are not easily known with a direct approach. Furthermore, the pool where such molecules may exist is astronomically large [78, 79, 80] (approx. In such scenarios, inverse design is of significant interest where the focus is on quickly identifying novel molecules with desired properties, in contrast to the conventional, so called direct approach where known molecules are explored for different properties. In inverse design, we start with the initial data set for which we know the structure and properties, map this to a probability distribution and then use it to generate new, previously unknown candidate molecules with desired properties very efficiently. Inverse design uses optimization and search algorithms [81, 82] for the purpose and by itself can accelerate the lead molecule discovery process, which is the first step for any drug development. This paradigm holds even more promise when used in a closed loop with synthesis, characterization and different test tools in such a way that each of these steps receives and transmits feedback concurrently, thus improving each other over time. This has shown some promise recently by substantially reducing the timeline for commercialization of molecules from its discovery to days, which is otherwise known to span over a decade in most cases. In one recent work, Some of these issues can be eliminated by using the reinforced adversarial neural computer method [95] , which extends their work. Similar to VAE's, GAN's have also been used for molecular graph generation, which is considered more robust compared to SMILES string generation. Cao et al. [91] non-sequentially and efficiently generated the molecular graph of small molecules with high validity and novelty from jointly trained GAN and Reinforcement learning architectures. Maziarka et al. [89] proposed a method for graph-tograph translation, where they generated 100 % valid molecules identical with the input molecules, but with different desired properties. Their approach relies on the latent space trained for JT-VAE and a degree of similarity of the generated molecules to the starting ones can be tuned. Mendez-Lucio et al. [96] proposed conditional generative adversarial networks to generate molecules that produce a desired biological effect at a cellular level, thus bridging systems biology and molecular design. deep convolution NN based GAN [90] was used for de novo drug design targeting types of cannabinoid receptors.

Generative models such as GAN's, RNN [99] proposed a fragment-based RL approach employing an actor-critic model, for generating more than 90 % valid molecules while optimizing multiple properties. Genetic algorithms (GA) have also been used for generating molecules while optimizing their properties [100, 101, 102, 103] . GA based models suffer from stagnation while being trapped in at the regions of local optima [104] . One notable work alleviating these problems is by Nigam et al. [53] where they hybridize GA and deep neural network to generate diverse molecules while outperforming related models in optimization.

All of the generative models discussed above generate molecules in the form of 2D graphs, or SMILES strings. Models to generate molecules directly in the form of 3D coordinates have also recently gained attention [105, 106, 107] . Such generated 3D coordinates can be directly used for further simulation using quantum mechanics or by using docking methods. One of such first models is proposed by Niklas et al. [107] where they generate 3D coordinates of small molecules with light atoms (H, C, N, O, F). They then use the 3D coordinates of the molecules to learn representation to map it to a space which is then used to generate 3D coordinates of the novel molecules. Building on this for a drug discovery application, we recently proposed a model [66] to generate 3D coordinates of molecules while always preserving desired scaffolds. This approach has generated synthesizable drug-like molecules that show high docking score against the target protein.

Other scaffolds based models to generates molecules in the form of 2D-graphs/SMILES strings are also published in the literature [108, 109, 110, 111, 112] .

Recently, with the huge interest in the development of architecture and algorithms required for quantum computing, quantum version of generative models such as the quantum auto-encoder [113] and quantum

GANs [114] have been proposed which carry huge potential, among others, for drug discovery. The preliminary proof of concept work of Romero et al. [114, 113] shows that it is possible to encode and decode molecular information using a quantum encoder, demonstrating generative modeling is possible with quantum VAE's and more work especially in the development of supporting hardware architecture is required in this direction.

The success of current ML approaches depends upon how accurately we can represent a chemical structure for a given model. Finding a robust, transferable, interpretable, and easy to obtain representation which obeys the fundamental physics and chemistry of molecules that work for all different kind of applications is a critical task. If available, this would save lot of resources, while increasing the accuracy and flexibility of molecular representations. Efficiently using such representations with robust and reproducible ML architectures will provide predictive modeling engine. Once a desired accuracy for diverse molecular systems for a given property prediction is achieved, it can routinely be used as an alternative to expensive QM based simulations or experiments. In the chemical and biological sciences, a main bottleneck for deploying ML models is the lack of sufficiently curated data under similar conditions that is required for training the models. Finding architecture that works consistently well enough for relatively small amount of data is equally important.

Strategies such as active learning (AL) and transfer learning (TL) are ideal for such scenarios to tackle problems [115, 116, 117, 118, 119] . Graph based methods for end-to-end feature learning and predictive modeling so far have been successfully used on small molecules consisting of lighter atoms. For larger molecules, robust representation learning and molecule generation parts must include non-local interactions such as vander-waals and H-bonding while building predictive and generative models.

Equally important is to develop and tie a robust, transferable, and scalable state-of-the-art platform for inverse molecular design in a closed loop with predictive modeling engine to accelerate therapeutic design ultimately reducing the cost and time required. Many of the inverse ML models used for inverse design use single bio-chemical activity as the criteria to measure the success of generated candidate therapeutic, which is in-contrast to real clinical trial, where small molecule therapeutics are optimized for several bio-activities simultaneously. CAMD workflow should be designed in a way to optimize multiple objective functions while generating and validating therapeutic molecules. Validation of newly generated lead molecules for a given drug usage, if done by experiments or quantum mechanical simulations, is an expensive task for all generated lead molecules. Ways to auto-validate molecules (using an inbuilt robust predictive model) would be ideal to save resources. In addition, CAMD workflows should be able to quantify uncertainty associated with it using statistical measures. For an ideal case, such uncertainty should decrease over the time as it learns from its own experience in series of closed loop.

CAMD workflows are generally built and trained with a specific goal in mind. Such workflows need to be re-configured and re-trained to work for different objective in therapeutic design and discovery. Design and build a single automated CAMD setup for multiple experiment (multi-parameter optimization) in a kind of transfer learning fashion is a challenge. It would be particularly helpful for the domains where a relatively small amount of data exist. Having such a CAMD infrastructure, algorithm and software stack would speedup end-to-end antiviral lead design and optimization for any future pandemics like Covid-19. 

Deep learning enables rapid identification of potent DDR1 kinase inhibitors

Off-line quality control, parameter design, and the taguchi method

Machine-learned and codified synthesis parameters of oxide materials

Computational methods in drug discovery

Innovation in the pharmaceutical industry: new estimates of r&d costs

Envisioning the future: medicine in the year 2050

How to improve r&d productivity: the pharmaceutical industry's grand challenge

Idea2data: toward a new paradigm for drug discovery

Creating a virtual assistant for medicinal chemistry

Current and future roles of artificial intelligence in medicinal chemistry synthesis

A remote-controlled adaptive medchem lab: an innovative approach to enable drug discovery in the 21st century

Collision cross sections for structural proteomics

Identification of metabolites from tandem mass spectra with a machine learning approach utilizing structural features

Inhomogeneous electron gas

Self-consistent equations including exchange and correlation effects

A high-throughput infrastructure for density functional theory calculations

The electrolyte genome project: A big data approach in battery materials discovery

Orbnet: Deep learning for quantum chemistry using symmetry-adapted atomic-orbital features

Analytical gradients for molecular-orbital-based machine learning

Quantum chemistry in the age of machine learning

Quantum chemical accuracy from density functional approximations via machine learning

Quantum chemistry structures and properties of 134 kilo molecules

Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17

Nist standard reference simulation website, nist standard reference database number 173, national institute of standards and technology

The ModelSEED Biochemistry Database for the Integration of Metabolic Annotations and the Reconstruction, Comparison and Analysis of Metabolic Models for Plants, Fungi and Microbes

Text-mined dataset of inorganic materials synthesis recipes

Text Mining for Drug Discovery

Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature

Information retrieval and text mining technologies for chemistry

Communication: Understanding molecular representations in machine learning: The role of uniqueness and target similarity

Graph networks as a universal machine learning framework for molecules and crystals

Deep learning for molecular design-a review of the state of the art

Smiles enumeration as data augmentation for neural network modeling of molecules

Neural message passing for quantum chemistry

Representation learning on graphs

Molecular graph convolutions: moving beyond fingerprints

Moleculenet: A benchmark for molecular machine learning

Fast and accurate modeling of molecular atomization energies with machine learning

Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space

Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules

Inchi -the worldwide chemical structure identifier standard. journal of cheminformatics

International chemical identifier for chemical reactions

Applying machine learning techniques to predict the properties of energetic materials

Selfies: a robust representation of semantically constrained graphs with an example application in chemistry

Algorithm for advanced canonical coding of planar chemical structures that considers stereochemical and symmetric information

Towards a Universal SMILES representation -A standard method to generate canonical SMILES based on the InChI

DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures

A deep-learning view of chemical space designed to facilitate drug discovery

Augmenting genetic algorithms with deep neural networks for exploring the chemical space

Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules

Schnetpack: A deep learning toolbox for atomistic systems

Ampl: A data-driven modeling pipeline for drug discovery

Message-passing neural networks for high-throughput polymer screening

Analyzing learned molecular representations for property prediction

Bayer's in silico admet platform: a journey of machine learning over the past two decades

Quantum-chemical insights from deep tensor neural networks

Schnet -a deep learning architecture for molecules and materials

Schnet: A continuous-filter convolutional neural network for modeling quantum interactions

Advances in Neural Information Processing Systems

Geom: Energy-annotated molecular conformations for property prediction and molecular generation

When do short-range atomistic machine-learning models fall short?

Learning a local-variable model of aromatic and conjugated systems

3D-scaffold: Deep learning framework to generate 3D coordinates of drug-like molecules with desired scaffolds

Machine learning techniques and drug design

Machine learning in drug discovery and development part 1: A primer

Machine learning in chemoinformatics and drug discovery

Ranking chemical structures for drug discovery: A new machine learning approach

Machine learning for target discovery in drug development

Applications of machine learning in drug target discovery

Argumentative comparative analysis of machine learning on coronary artery disease

Convolutional networks on graphs for learning molecular fingerprints

Prediction errors of molecular machine learning models lower than hybrid dft error

Neural message passing with edge updates for predicting properties of molecules and materials

Estimation of the size of drug-like chemical space based on GDB-17 data

PubChem Substance and Compound databases

Defining and exploring chemical spaces

Inverse design in search of materials with target functionalities

Inverse strategies for molecular design

Automatic chemical design using a data-driven continuous representation of molecules

Recurrent neural network regularization

Deep learning in neural networks: An overview

Proceedings of the 34th International Conference on Machine Learning

Constrained graph variational autoencoders for molecule design

Learning multimodal graph-to-graph translation for molecular optimization

Multi-resolution autoregressive graph-to-graph translation for moleculesdoi

Deep convolutional generative adversarial network (dcgan) models for screening and design of small molecules targeting cannabinoid receptors

Molgan: An implicit generative model for small molecular graphs

drugan: An advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico

Objectivereinforced generative adversarial networks (organ) for sequence generation models

Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design

Reinforced adversarial neural computer for de novo molecular design

De Novo Generation of Hitlike Molecules from Gene Expression Signatures Using Artificial Intelligence

Deep reinforcement learning for de novo drug design

Molecular de novo design through deep reinforcement

Deep Reinforcement Learning for Multiparameter Optimization in de novo Drug Design

Computational design and selection of optimal organic photovoltaic materials

Stochastic voyages into uncharted chemical space produce a representative library of all possible drug-like compounds

Strategy to discover diverse optimal molecules in the small molecule universe

A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space

Properties of a genetic algorithm equipped with a dynamic penalty function

Symmetry-aware actor-critic for 3d molecular design (2020)

Hernández-Lobato, Reinforcement learning for molecular design guided by quantum mechanics

Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules

Deepscaffold: A comprehensive tool for scaffold-based de novo drug discovery using deep learning

Scaffold-based molecular design with a graph generative model

SMILESbased deep generative scaffold decorator for de-novo drug design

Scaffold-Based Drug Discovery

ScaffoldGraph: an open-source library for the generation and analysis of molecular scaffold networks and scaffold trees

Quantum autoencoders for efficient compression of quantum data

Quantum machine learning

Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration

Active learning in the drug discovery process

Active learning strategies with combine analysis: new tricks for an old dog

Bradshaw: a system for automated molecular design

Deep model based transfer and multi-task learning for biological image analysis