key: cord-0200924-ge37bzs0
authors: Gaudelet, Thomas; Day, Ben; Jamasb, Arian R.; Soman, Jyothish; Regep, Cristian; Liu, Gertrude; Hayter, Jeremy B. R.; Vickers, Richard; Roberts, Charles; Tang, Jian; Roblin, David; Blundell, Tom L.; Bronstein, Michael M.; Taylor-King, Jake P.
title: Utilising Graph Machine Learning within Drug Discovery and Development
date: 2020-12-09
journal: nan
DOI: nan
sha: da0c6c54f1abeeee2a01cb1027bfa9d8bfd0c910
doc_id: 200924
cord_uid: ge37bzs0

Graph Machine Learning (GML) is receiving growing interest within the pharmaceutical and biotechnology industries for its ability to model biomolecular structures, the functional relationships between them, and integrate multi-omic datasets - amongst other data types. Herein, we present a multidisciplinary academic-industrial review of the topic within the context of drug discovery and development. After introducing key terms and modelling approaches, we move chronologically through the drug development pipeline to identify and summarise work incorporating: target identification, design of small molecules and biologics, and drug repurposing. Whilst the field is still emerging, key milestones including repurposed drugs entering in vivo studies, suggest graph machine learning will become a modelling framework of choice within biomedical machine learning.

The process from drug discovery to market costs, on average, well over $1 billion and can span 12 years or more [47, 178, 199] ; due to high attrition rates, rarely can one progress to market in less than ten years [109, 139] . The high levels of attrition throughout the process not only make investments uncertain but require market approved drugs to pay for the earlier failures. Despite an industrywide focus on efficiency for over a decade, spurred on by publications and annual reports highlighting revenue cliffs from ending exclusivity and falling productivity, significant improvements have proved elusive against the backdrop of scientific, technological and regulatory change [178] . For the aforementioned reasons, there is now a greater interest in applying computational methodologies to expedite various parts of the drug discovery and development pipeline [153] , see Figure 1 .

Digital technologies have transformed the drug development process generating enormous volumes of data. Changes range from moving to electronic lab notebooks [127] , electronic regulatory submissions, through increasing volumes of laboratory, experimental, and clinical trial data collection [183] including the use of devices [36, 108] to precision medicine and the use of "big data" [79] . The data collected about therapies extends well beyond research and development to include hospital, specialist and primary care medical professionals' patient records -including observations taken from social media, e.g. for pharmacovigilance [162, 172] . There are innumerable online databases and other sources of information including scientific literature, clinical trials information, through to databases of repurposable drugs [37, 137] . Technological advances now allow for greater -omic profiling beyond genotyping and whole genome sequencing (WGS); standardisation of microfluidics and antibody tagging has made single-cell technologies widely available to study both the transcriptome, e.g. using RNA-seq [73] , the proteome (targeted), e.g. via mass cytometry [177] , or even multiple modalities together [111] .

One of the key characteristics of biomedical data that is produced and used in the drug discovery process is its inter-connected nature. Such data structure can be represented as a graph, a mathematical abstraction ubiquitously used across disciplines and fields in biology to model the diverse interactions between biological entities that intervene at the different scales. At the molecular scale, proteins and other biomolecules can be represented as graphs capturing spatial and structural relationships between their amino acid residues [57, 209] and small molecule drugs as graphs relating their constituent atoms and chemical bonding structure [49, 93] . At an intermediary scale, interactomes are graphs that capture specific types of interactions between biomolecular species (e.g. metabolites, mRNA, proteins) [70] , with protein-protein interaction (PPI) graphs being perhaps most commonplace. Finally, at a higher level of abstraction, knowledge graphs can represent the complex relationships between drugs, side effects, diagnosis, associated treatments, and test results [33, 215] as found in electronic medical records (EMR).

Within the last decade, two emerging trends have reshaped the data modelling community: network analysis and deep learning. The "network medicine" paradigm has long been recognised in the biomedical field [7] , with multiple approaches borrowed from graph theory and complex network science applied to biological graphs such as PPIs and gene regulatory networks (GRNs). Most approaches in this field were limited to handcrafted graph features such as centrality measures and clustering. In contrast, deep neural networks, a particular type of machine learning algorithms, are used to learn optimal tasks-specific features. The impact of deep Figure 1 : Timeline of drug development linked to potential areas of application by GML methodologies. Financial, timeline and success probability data taken from Paul et al. [139] .

learning was ground-breaking in computer vision [194] and natural language processing [208] but was limited to specific domains by the requirements on the regularity of data structures. At the convergence of these two fields is Graph Machine Learning (GML) a new class of ML methods exploiting the structure of graphs and other irregular datasets (point clouds, meshes, manifolds, etc).

The essential idea of GML methods is to learn effective feature representations of nodes [142, 182] (e.g., users in social networks), edges (e.g. predicting future interactions in recommender systems), or the entire graphs [181] (e.g. predicting properties of molecular graphs). In particular, a type of methods called graph neural networks (GNNs) [62, 92, 191] , which are deep neural network architectures specifically designed for graph-structure data, are attracting growing interest. GNNs iteratively update the features of the nodes of a graph by propagating the information from their neighbours. These methods have already been successfully applied to a variety of tasks and domains such as recommendation in social media and E-commerce [136, 157, 158, 203] , traffic estimations in Google Maps [96] , misinformation detection in social media [118] , and various domains of natural sciences including modelling of glass [38] and event classification in particle physics [34, 169] .

In the biomedical domain, GML has now set the state of the art for mining graph-structured data including drug-target-indication interaction and relationship prediction through knowledge graph embedding [6, 164, 182] ; molecular property prediction [49, 93] , including the prediction of absorption, distribution, metabolism, and excretion (ADME) profiles [53] ; early work in target identification [145] to de novo molecule design [58, 211] . Most notably, Stokes et al. [179] used directed message passing graph neural networks operating on molecular structures to propose repurposing candidates for antibiotic development, validating their predictions in vivo to propose suitable repurposing candidates remarkably structurally distinct from known antibiotics. Therefore, GML methods appear to be extremely promising in applications across the drug development pipeline.

The rest of this paper is organised as follows. Section 2 introduces notation and mathematical preliminaries to simplify the exposition. Section 3 introduces the diverse tasks addressed with machine learning approaches on graphs as well as the main classes of existing methods. Section 4 focuses on the application of machine learning on graphs within drug discovery and development before providing a closing discussion in Section 5.

We denote a graph = (V, E, X , X ) where V is a set of = |V | nodes, or vertices, and E ⊆ V × V is a set of edges. Let ∈ V denote a node and = ( , ) ∈ E denote an edge from node to node . When multiple edges can connect the same pair of nodes, the graph is called a multigraph. Node features are represented by X ∈ R × and x ∈ R are the features of node . Edge features, or attributes, are similarly represented by X ∈ R × where x , = x , ∈ R . We may also denote different nodes as and such that , = ( , ) is the edge from to with attributes x , . Note that under this definition, undirected graphs are defined as directed graphs with each undirected edge represented by two directed edges.

The neighbourhood N ( ) of node , sometimes referred to as one-hop neighbourhood, is the set of nodes that are connected to it by an edge, N ( ) = { ∈ V |( , ) ∈ E}, with shorthand N ( ) = N used for compactness. The cardinality of a node's neighbourhood is called its degree and the diagonal degree matrix, D, has elements D = |N |.

Two nodes and in a graph are connected if there exists a path in starting at one and ending at the other, i.e. there exists a sequence of consecutive edges of connecting the two nodes. A graph is connected if there exists a path between every pair of nodes in the graph. The shortest path distance between and is defined as the number of edges in the shortest path between the two nodes and denoted by ( , ).

A graph = (Ṽ,Ẽ, X˜, X˜) is a subgraph of if and only if V ⊆ V andẼ ⊆ E. If it also holds thatẼ = Ṽ ×Ṽ ∩ E, then is called an induced subgraph of .

The adjacency matrix A typically represents the relations between nodes such that the entry on the ℎ row and ℎ column indicates whether there is an edge from node to node , with 1 representing that there is an edge, and 0 that there is not (i.e. A = 1( , )). Most commonly, the adjacency matrix is a square (from a set of nodes to itself), but the concept extends to bipartite graphs where an × matrix can represent the edges from one set of nodes to another set of size , and is sometimes used to store scalar edge weights. The Laplacian matrix of a simple (unweighted)

is often preferred, with a variant defined asL = I − D −1 A.

The term knowledge graph is used to qualify a graph that captures types of relationships between a set of entities. In this case, X includes relationship types as edge features. Knowledge graphs are commonly introduced as sets of triplets ( , , ) ∈ V × R × V, where R represents the set of relationships. Note that multiple edges of different types can connect two given nodes. As such, the standard adjacency matrix is ill-suited to capture the complexity of a knowledge graph. Instead, a knowledge graph is often represented as a collection of adjacency matrices {A 1 , ..., A }, forming an adjacency tensor, in which each adjacency matrix A captures one type of relationship.

A random walk is a sequence of nodes selected at random during an iterative process. A random walk is constructed by considering a random walker that moves through the graph starting from a node . At each step, the walker can either move to a neighbouring node with probability ( | ), ∈ N or stay on node with probability ( | ). The sequence of nodes visited after a fixed number of steps gives a random walk of length . Graph diffusion is a related notion that models the propagation of a signal on a graph. A classic example is heat diffusion [94] , which studies the propagation of heat in a graph starting from some initial distribution.

Two graphs = (V , E ) and = (V , E ) are said to be isomorphic if there exists a bijective function : V ↦ → V such that ∀( , ) ∈ E , ( ( ), ( )) ∈ E . Finding if two graphs are isomorphic is a recurrent problem in graph analysis that has deep ramifications for machine learning on graphs. For instance, in graph classification tasks, it is assumed that a model needs to capture the similarities between pairs of graphs to classify them accurately.

The Weisfeiler-Lehman (WL) graph isomorphism test [198] is a classical polynomial-time algorithm in graph theory. It is based on iterative graph recolouring, starting with all nodes of identical "colour" (label). At each step, the algorithm aggregates the colours of nodes and their neighbourhoods and hashes the aggregated colour into unique new colours. The algorithm stops upon reaching a stable colouring. If at that point, the colourings of the two graphs differ, the graphs are deemed non-isomorphic. However, if the colourings are the same, the graphs are possibly (but not necessarily) isomorphic. In other words, the WL test is a necessary but insufficient condition for graph isomorphism. There exist non-isomorphic graphs for which the WL test produces identical colouring and thus considers them possibly isomorphic; the test is said to fail in this case [10] .

Most machine learning methods that operate on graphs can be decomposed into two parts: a general-purpose encoder and a taskspecific decoder [27] . The encoder embeds a graph's nodes, or the graph itself, in a low-dimensional feature space. To embed entire graphs, it is common first to embed nodes and then apply a permutation invariant pooling function to produce a graph level representation (e.g. sum, max or mean over node embeddings) analogous to global pooling in convolutional neural networks. The decoder computes an output for the associated task. The components can either be combined in two-step frameworks, with the encoder pre-trained in an unsupervised setting, or in an end-toend fashion. The end tasks can be classified following multiple dichotomies: supervised/unsupervised, inductive/transductive, and node-level/graph-level.

Supervised/unsupervised task. This is the classical dichotomy found in machine learning [120] . Supervised tasks aim to learn a mapping function from labelled data such that the function maps each data point to its label and generalises to unseen data points. In contrast, unsupervised tasks highlight unknown patterns and uncover structures in unlabelled datasets.

Inductive/transductive task. Inductive tasks correspond to supervised learning discussed above. Transductive tasks expect that all data points are available when learning a mapping function, including unlabelled data points [92] . Hence, in the transductive setting, the model learns both from unlabelled and labelled data. In this respect, inductive learning is more general than transductive learning, as it extends to unseen data points.

Node-level/graph-level task. This dichotomy is based on the object of interest. A task can either focus on the nodes within a graph, e.g. classifying nodes within the context set by the graph, or focus on whole graphs, i.e. each data point corresponds to an entire graph [181] . Note that node-level tasks can be further decomposed into node attributes prediction tasks [92] and link inference tasks [182] . The former focuses on predicting properties of nodes while the latter infers missing links in the graph.

As an illustration, consider the task of predicting the chemical properties of small molecules based on their chemical structures. This is a graph-level task in a supervised (inductive) setting whereby labelled data is used to learn a mapping from chemical structure to chemical properties. Alternatively, the task of identifying groups of proteins that are tightly associated in a PPI graph is an unsupervised node-level task. However, predicting proteins' biological functions using their interactions in a PPI graph corresponds to a node-level transductive task.

Further, types of tasks can be identified, e.g. based on whether we have static or varying graphs. Biological graphs can vary and evolve along a temporal dimension resulting in changes to composition, structure and attributes [134, 146] . However, the classifications detailed above are the most commonly found in the literature. We review below the existing classes of GML methods.

3.1.1 Graph statistics. In the past decades, a flourish of heuristics and statistics have been developed to characterise graphs and their nodes. For instance, the diverse centrality measures capture different aspects of graphs connectivity. The closeness centrality quantifies how closely a node is connected to all other nodes, and the betweenness centrality counts on how many shortest paths between pairs of other nodes in the graph a given node is. Furthermore, graph sub-structures can be used to derive topological descriptors of the wiring patterns around each node in a graph. For instance, motifs [114] , and graphlets [147] correspond to sets of small graphs used to characterise local wiring patterns of nodes. Specifically, for each node, one can derive a feature vector of length corresponding to the number of motifs (graphlets) and where the ℎ entry gives the number of occurrences of motif in the graph that contains the node.

These handcrafted features can provide node, or graph, representations that can be used as input to machine learning algorithms. A popular approach has been the definition of kernels based on graph statistics that can be used as input to Support Vector Machines (SVM). For instance, the graphlet kernel [167] captures node wiring patterns similarity, and the WL kernel [166] captures graph similarity based on the Weisfeiler-Lehman algorithm discussed in Section 2.

3.1.2 Random walks. Random-walk based methods have been a popular, and successful, approach to embed a graph's nodes in a low-dimensional space such that node proximities are preserved. The underlying idea is that the distance between node representations in the embedding space should correspond to a measure of distance on the graph, measured here by how often a given node is visited in random walks starting from another node. Deepwalk [142] and node2vec [66] are arguably the most famous methods in this category.

In practice, Deepwalk simulates multiple random walks for each node in the graph. Then, given the embedding x of a node , the objective is to maximise the log probability log ( |x ) for all nodes that appear in a random walk within a fixed window of . The method draws its inspiration from the SkipGram model developed for natural language processing [113] .

DeepWalk uses uniformly random walks, but several follow-up works analyse how to bias these walks to improve the learned representations. For example, node2vec biases the walks to behave more or less like certain search algorithms over the graph. The authors report a higher quality of embeddings with respect to information content when compared to Deepwalk.

Geometric models for knowledge graph embedding posit each relation type as a geometric transformation from source to target in the embedding space. Consider a triplet ( , , ), denoting the source node and denoting the target node. A geometric model learns a transformation (·, ·) such that ( (h , h ), h ) is small, with (·, ·) being some notion of distance (e.g. Euclidean distance) and h denoting the embedding of entity . The key differentiating choice between these approaches is the form of the geometric transformation .

TransE [16] is a purely translational approach, where corresponds to the sum of the source node and relation embeddings. In essence, the model enforces that the motion from the embedding h of the source node in the direction given by the relation embedding h terminates close to the target node's embedding h as quantified by the chosen distance metric. Due to its formulation, TransE is not able to account effectively for symmetric relationships or one-to-many interactions.

Alternatively, RotatE [182] represents relations as rotations in a complex latent space. Thus, applies a rotation matrix h , corresponding to the relation, to the embedding vector of the source node h such that the rotated vector (h , h ) lies close to the embedding vector h of the target node in terms of Manhattan distance. The authors demonstrate that rotations can correctly capture diverse relation classes, including symmetry/anti-symmetry.

Matrix factorisation is a common problem in mathematics that aims to approximate a matrix X by the product of low-dimensional latent factors, F , ∈ {1, ..., }. The general problem can be written as

where Δ(·, ·) represents a measure of the distance between two inputs, such as Euclidean distance or Kullback-Leibler divergence.

In machine learning, matrix factorisation has been extensively used for unsupervised applications such as dimensionality reduction, missing data imputation, and clustering. These approaches are especially relevant to the knowledge graph embedding problem and have set state-of-the-art (SOTA) results on standard benchmarks [156] .

For graphs, the objective is to factorise the adjacency matrix A, or a derivative of the adjacency matrix (e.g. Laplacian matrix). It can effectively be seen as finding embeddings for all entities in the graph on a low-dimensional, latent manifold under user-defined constraints (e.g. latent space dimension) such that the adjacency relationships are preserved under dot products.

Laplacian eigenmaps, introduced by Belkin et al. [9] , is a fundamental approach designed to embed entities based on a similarity derived graph. Laplacian eigenmaps uses the eigendecomposition of the Laplacian matrix of a graph to embed each of the nodes of a graph in a low-dimensional latent manifold. The spectral decomposition of the Laplacian is given by equation L = QΛQ ⊤ , where Λ is a diagonal matrix with entries corresponding to the eigenvalues of L and column q of Q gives the eigenvector associated to the ℎ eigenvalue (i.e. Lq = q ). Given a user defined dimension ≤ , the embedding of node is given by the vector

Nickel et al. [125] introduced RESCAL to address the knowledge graph embedding problem. RESCAL's objective function is defined as

where U ∈ R × and ∀ , R ∈ R , , with denoting the latent space dimension. The function (·) denotes a regulariser, i.e. a function applying constraints on the free parameters of the model, on the factors U and {R 1 , .., R }. Intuitively, factor U learns the embedding of each entity in the graph and factor R specifies the interactions between entities under relation . Yang et al. [202] proposed DistMult, a variation of RESCAL that considers each factor R as a diagonal matrix. Trouillon et al. [189] proposed ComplEx, a method extending DistMult to the complex space taking advantage of the Hermitian product to represent asymmetric relationships.

Alternatively, some existing frameworks leverage both a graph's structural information and the node's semantic information to embed each entity in a way that preserves both sources of information. One such approach is to use a graph's structural information to regularise embeddings derived from the factorisation of the feature matrix X [24, 29] . The idea is to penalise adjacent entities in the graph to have closer embeddings in the latent space, according to some notion of distance. Another approach is to jointly factorise both data sources, for instance, introducing a kernel defined on the feature matrix [76, 77] .

Graph Neural Networks (GNNs) were first introduced in the late 1990s [65, 112, 176] but have attracted considerable attention in recent years, with the number of variants rising steadily [28, 62, 69, 92, 107, 164, 191, 200, 201] . From a high-level perspective, GNNs are a class of models that learn how to diffuse information on graph-structured datasets for representation learning.

A typical GNN layer is comprised of three functions: 1) a message passing function Msg that permits information exchange between nodes over edges; 2) an aggregation function Agg that combines the collection of received messages into a single, fixed-length representation; 3) and an update function Update that produces node-level representations given the previous representation and the aggregated messages. Common choices are a simple linear transformation for Msg, summation, simple-or weighted-averages for Agg, and multilayer perceptrons (MLPs) with activation functions for the Update function, although it is not uncommon for the Msg or Update function to be absent or reduced to an activation function only. Where the node representations after layer are h ( ) , we have

, Agg msg , ∈ N or, more compactly,

where , □ and are the update, aggregation and message passing functions, respectively, and ( ) indicates the layer index [55] . The initial node representations, h

, are typically set to node features, x . Figure 2 gives a schematic representation of this operation. Figure 2 : Illustration of a general aggregation step performed by a graph neural network for the central node (green) based on its direct neighbours (orange). Messages may be weighted depending on their content, the source or target node features, or the attributes of the edge they are passed along, as indicated by the thickness of incoming arrows.

The graph convolutional network (GCN) [92] can be decomposed in this framework as

where is some activation function, usually a rectified linear unit (ReLU). The scheme is simplified further if we consider the addition of self-loops, that is, an edge from a node to itself, commonly expressed as the modified adjacencyÂ = A + I, where the aggregation includes the self-message and the update reduces to Update (. . . ) = (agg ).

With respect to the notations in Figure 2 , for GCN we have , = 1 √ .

As the update depends only on a node's local neighbourhood, these schemes are also commonly referred to as neighbourhood aggregation. Indeed, taking a broader perspective, a single-layer GNN updates a node's features based on its immediate or one-hop neighbourhood. Adding a second GNN layer allows information from the two-hop neighbourhood to propagate via intermediary neighbours. By further stacking GNN layers, node features can come to depend on the initial values of more distant nodes, analogous to the broadening the receptive field in later layers of convolutional neural networks-the deeper the network, the broader the receptive field (see Figure 3 ). However, this process is diffusive and leads to features washing out as the graph thermalises. This problem is solved in convolutional networks with pooling layers, but an equivalent canonical coarsening does not exist for irregular graphs. [191] weight incoming messages with an attention mechanism and multi-headed attention for train stability. Including self-loops, the message, aggregation and update functions

are otherwise unchanged. Although the authors suggest the attention mechanism is decoupled from the architecture and should be task specific, in practice, their original formulation is most widely used. The attention weights, are softmax normalised, that is

where is the output of a single layer feed-forward neural network without a bias (a projection) with LeakyReLU activations, that takes the concatenation of transformed source-and target-node features as input,

where LeakyReLU( ) = max( , ); 0 ≤ ≤ 1.

Relational graph convolutional networks. At many scales of systems biology, the relationships between entities have a type, a direction, or both. For instance, the type of bonds between atoms, binding of two proteins, and gene regulatory interactions are essential to understanding the systems in which they exist. This idea is expressed in the message passing framework with messages that depend on edge attributes. Relational Graph Convolutional Networks (R-GCNs) [164] learn separate linear transforms for each edge type, which can be viewed as casting the graph as a multiplex graph and operating GCN-like models independently on each layer, as shown in Figure 4 . The R-GCN model decomposes to

for edge types ∈ R, with separate transforms W ( ) 0 for self-loops, and problem-specific normalization constant , .

Pooling. Geometric deep learning approaches machine learning with graph-structured data as the generalisation of methods designed for learning with grid and sequence data (images, time-series; Euclidean data) to non-Euclidean domains, i.e. graphs and manifolds [21] . This is also reflected in the derivation and naming conventions of popular graph neural network layers as generalised convolutions [92, 164] . Modern convolutional neural networks have settled on the combination of layers of 3 × 3 kernels interspersed with 2 × 2 max-pooling. Developing a corresponding pooling workhorse for GNNs is an active area of research. The difficulty is that, unlike Euclidean data structures, there are no canonical up-and down-sampling operations for graphs. As a result, there are many proposed methods that centre around learning to pool or prune based on features [14, 25, 60, 98, 206] , and learned or non-parametric structural pooling [11, 17, 85, 104] . However, the distinction between featural and structural methods is blurred when topological information is included in the node features. [206] .

The most successful feature-based methods extract representations from graphs directly either for cluster assignments [14, 206] or top-k pruning [25, 60, 98] . DiffPool uses GNNs both to produce a hierarchy of representations for overall graph classification and to learn intermediate representations for soft cluster assignments to a fixed number of pools [206] . Figure 5 presents an example of this kind of pooling. top-k pooling takes a similar approach, but instead of using an auxiliary learning process to pool nodes, it is used to prune nodes [25, 60] . In many settings, this simpler method is competitive with DiffPool at a fraction of the memory cost.

Structure-based pooling methods aggregate nodes based on the graph topology and are often inspired by the processes developed by chemists and biochemists for understanding molecules through their parts. For example, describing a protein in terms of its secondary structure ( -helix, -sheets) and the connectivity between these elements can be seen as a pooling operation over the protein's molecular graph. Figure 6 shows how a small molecule can be converted to a junction tree representation, with the carbon ring (in pink) being aggregated into a single node. Work on decomposing molecules into motifs bridges the gap between handcrafted secondary structures and unconstrained learning methods [85] . Motifs are extracted based on a combined statistical and chemical analysis, where motif templates (i.e. graph substructures) are selected based on how frequently they occur in the training corpus and molecules are then decomposed into motifs according to some chemical rules. More general methods look to concepts from graph theory such as minimum cuts [11, 17] and maximal cliques [104] on which to base pooling. Minimum cuts are graph partitions that minimise some objective and have obvious connections to graph clustering, whilst cliques (subsets of nodes that are fully connected) are in some sense at the limit of node community density. Figure 6 : Illustration of a. the molecule aspirin, b. its basic graph representation, and c. the associated junction tree representation. Colours on the node correspond to atom types.

The process of discovering a drug and making it available to patients can take up to 10 years and is characterised by failures, or attrition, see Figure 1 . The early discovery stage involves target identification and validation, hit discovery, lead molecule identification and then optimisation to achieve desired characteristics of a drug candidate [78] . Pre-clinical research typically comprises both in vitro and in vivo assessments on toxicity, pharmacokinetics (PK), pharmacodynamics (PD), and efficacy of the drug. Providing good pre-clinical evidence are presented, the drug then progresses for human clinical trials normally through three different phases of clinical trials. In the following subsections, we explore how GML can be applied to distinct stages within the drug discovery and development process.

Target identification is the search for a molecular target with a significant functional role(s) in the pathophysiology of a disease such that a hypothetical drug could modulate said target culminating with beneficial effect [163, 187] . Early targets included G-protein coupled receptors (GPCRs), kinases, and proteases and formed the major target protein families among first-in-class drugs [50] targeting other classes of biomolecules is also possible, e.g., nucleic acids. For an organ-specific therapy, an ideal target should be strongly and preferably expressed in the tissue of interest, and preferably a three-dimensional structure should be obtainable for biophysical simulations.

There is a range of complementary lines of experimental evidence that could support target identification. For example, a phenomenological approach to target identification could consider the imaging, histological, or -omic presentation of diseased tissue when compared to matched healthy samples. Typical differential presentation includes chromosomal aberrations (e.g., from WGS), differential expression (e.g., via RNA-seq) and protein translocation (e.g., from histopathological analysis) [135] . As the availability ofomic technologies increases, computational and statistical advances must be made to integrate and interpret large quantities of high dimensional, high-resolution data on a comparatively small number of samples, occasionally referred to as panomics [110, 161] .

In contrast to a static picture, in vitro and in vivo models are built to examine the dynamics of disease phenotype to study mechanism. In particular, genes are manipulated in disease models to understand key drivers of a disease phenotype. For example, random mutagenesis could be induced by chemicals or transposons in cell clones or mice to observe the phenotypic effect of perturbing certain cellular pathways at the putative target protein [187] . As a targeted approach, bioengineering techniques have been developed to either silence mRNA or remove the gene entirely through genetic editing. In modern times, CRISPR is being used to knockout genes in a cleaner manner to prior technologies, e.g. siRNA, shRNA, TALEN [15, 141, 173] . Furthermore, innovations have led to CRISPR interference (CRISPRi) and CRISPR activation (CRISPRa) that allow for suppression or overexpression of target genes [97] .

To complete the picture, biochemical experiments observe chemical and protein interactions to inform on possible drug mechanisms of action [163] , examples include: affinity chromatography, a range of mass spectrometry techniques for proteomics, and drug affinity responsive target stability (DARTS) assays [40, 103, 132] . X-ray crystallography and cryogenic electron microscopy (cryo-EM) can be used to detail structures of proteins to identify druggable pockets [170] ; computational approaches can be used to assess the impacts of mutations in cancer resulting in perturbed crystal structures [106] . Yeast two-hybrid or three-hybrid systems can be employed to detail genomic protein-protein or RNA-protein interactions [68, 101] .

Systems biology aims to unify phenomenological observations on disease biology (the "-omics view"), genetic drivers of phenotypes (driven by bioengineering) through a network view of interacting biomolecules [23] . The ultimate goal is to pin down a "druggable" point of intervention that could hopefully reverse the disease condition. One of the outcomes of this endeavour is the construction of signalling pathways; for example, the characterisation of the TGF-, PI3K/AKT and Wnt-dependent signalling pathways have had profound impacts on oncology drug discovery [3, 74, 83] .

In contrast to complex diseases, target identification for infectious disease requires a different philosophy. After eliminating pathogenic targets structurally similar to those within the human proteome, one aims to assess the druggability of the remaining targets. This may be achieved using knowledge of the genome to model the constituent proteins when 3D structures are not already available experimentally. The Blundell group has shown that 70-80% of the proteins from Mycobacterium tuberculosis and a related organism, Mycobacterium abscessus (infecting cystic fibrosis patients), can be modelled via homology [130, 171] . By examining the potential binding sites, such as the catalytic site of an enzyme or an allosteric regulatory site, the binding hotspot can be identified and the potential value as a target estimated [12] . Of course, target identification is also dependent on the accessibility of the target to a drug, as well as the presence of efflux pumps -and metabolism of any potential drug by the infectious agent.

4.1.1 From systems biology to machine learning on graphs. Organisms, or biological systems, consist of complex and dynamic interactions between entities at multiple scales. At the submolecular level, proteins are chains of amino acid residues which fold to adopt highly specific conformational structures. At the molecular scale, proteins and other biomolecules physically interact through transient and long-timescale binding events to carry out regulatory processes and perform signal amplification through cascades of chemical reactions. By starting with a low-resolution understanding of these biomolecular interactions, canonical sequences of interactions associated with specific processes become labelled as pathways that ultimately control cellular functions and phenotypes. Within multicellular organisms, cells interact with each other forming diverse tissues and organs. A reductionist perspective of disease is to view it as being the result of perturbations of the cellular machinery at the molecular scale that manifest through aberrant phenotypes at the cellular and organismal scales. Within target identification, one is aiming to find nodes that upon manipulation lead to a causal sequence of events resulting in the reversion from a diseased to a healthy state.

It seems plausible that target identification will be the greatest area of opportunity for machine learning on graphs. From a genetics perspective, examining Mendelian traits and genome-wide association studies (GWAS) linked to coding variants of drug targets have a greater chance of success in the clinic [91, 123] . However, when examining protein-protein interaction networks, Fang et al. [52] found that various protein targets were not themselves "genetically associated", but interacted with other proteins with genetic associations to the disease in question. For example in the case of rheumatoid arthritis (RA), tumour necrosis factor (TNF) inhibition is a popular drug mechanism of action with no genetic associationbut the interacting proteins of TNF including CD40, NFKBIA, REL, BIRC3, and CD8A are known to exhibit a genetic predisposition to RA.

Oftentimes, systems biology has focused on networks with static nodes and edges, ignoring faithful characterisation of underlying biomolecules that the nodes represent. With GML, we can account for much richer representations of biology accounting for multiple relevant scales, for example, graphical representation of molecular structures (discussed in Sections 4.2 and 4.3), functional relationships within a knowledge graph (discussed in Section 4.4), and Green nodes indicate proteins, orange nodes corresponds to drugs. Green links represent proteinprotein interactions while gray edges indicate drugtarget interactions. expression of biomolecules. Furthermore, GML can learn graphs from data as opposed to relying on pre-existing incomplete knowledge [88, 196] . Early work utilising GML for target identification includes Pittala et al. [145] , whereby a knowledge graph link prediction approach was used to beat the in house algorithms of Open Targets [26] to rediscover drug targets within clinical trials for Parkinson's Disease.

The utilisation of multi-omic expression data capturing instantaneous multimodal snapshots of cellular states will play a significant role in target identification as costs decrease [126, 160] -particularly in a precision medicine framework [161] . Currently, however, only a few panomic datasets are publicly available. A small number of early adopters have spotted the clear utility in employing GML [195] , occasionally in a multimodal learning [105, 124] , or causal inference setting [143] . These approaches have helped us move away from the classical Mendelian "one gene -one disease" philosophy and appreciate the true complexity of biological systems.

Drug design broadly falls into two categories: phenotypic drug discovery and target-based drug discovery. Phenotypic drug discovery (PDD) begins with a disease's phenotype without having to know the drug target. Without the bias from having a known target, PDD has yielded many first-in-class drugs with novel mechanisms of action [185] . It has been suggested that PDD could provide the new paradigm of drug design, saving a substantial amount of costs and increasing the productivity [116] . However, drugs found by PDD are often pleiotropic and impose greater safety risks when compared to target-oriented drugs. In contrast, best-in-class drugs are usually discovered by a target-based approach.

For target-based drug discovery, after target identification and validation, "hit" molecules would be identified via high-throughput screening of compound libraries against the target [78] , typically resulting in a large number of possible hits. Grouping these into "hit series" and they become further refined in functional in vitro assays. Ultimately, only those selected via secondary in vitro assays and in vivo models would be the drug "leads". With each layer of screening and assays, the remaining compounds should be more potent and selective against the therapeutic target. Finally, lead compounds are optimised by structural modifications, to improve properties such as PKPD, typically using heuristics, e.g. Lipinski's rule of five [102] . In addition to such structure-based approach, fragment-based (FBDD) [13, 121] and ligand-based drug discovery (LBDD) have also been popular [1, 51] . FBDD enhances the ligand efficiency and binding affinity with fragment-like leads of approximately ∼150 Da, whilst LBDD does not require 3D structures of the therapeutic targets.

Both phenotypic and target-based drug discovery comes with their own risks and merits. While the operational costs of target ID may be optional, developing suitable phenotypic screening assays for the disease could be more time-consuming and costly [116, 212] . Hence, the overall timeline and capital costs are roughly the same [212] .

In this review, we make no distinction between new chemical entities (NCE), new molecular entities (NME), or new active substances (NAS) [19] .

For a drug, the base graph representation is obtained from the molecule's SMILES signature and captures bonds between atoms, i.e. each node of the graph corresponds to an atom and each edge stands for a bond between two atoms [49, 93] . The features associated with atoms typically include its element, valence, and degree. Edge features include the associated bond's type (single, double, triple), its aromaticity, and whether it is part of a ring or not. Additionally, Klicpera et al. [93] consider the geometric length of a bond and geometric angles between bonds as additional features. This representation is used in most applications, sometimes complemented or augmented with heuristic approaches.

To model a graph structure, Jin et al. [84] used the base graph representation in combination with a junction tree derived from it. To construct the junction tree, the authors first define a set of molecule substructures, such as rings. The graph is then decomposed into overlapping components, each corresponding to a specific substructure. Finally, the junction tree is defined with each node corresponding to an identified component and each edge associates overlapping components.

Jin et al. then extended their previous work by using a hierarchical representation with various coarseness of the small molecule [85] . The proposed representation has three levels: 1) an atom layer, 2) an attachment layer, and 3) a motif layer. The first level is simply the basic graph representation. The following levels provide the coarse and fine-grain connection patterns between a molecule's motifs. Specifically, the attachment layer describes at which atoms two motifs connect, while the motif layer only captures if two motifs are linked. Considering a molecule base graph representation = (V, E), a motif is defined as a subgraph of induced on atoms in V and bonds in E. Motifs are extracted from a molecule's graph by breaking bridge bonds.

Kajino [87] opted for a hypergraph representation of small molecules. A hypergraph is a generalisation of graphs in which an edge, called a hyperedge, can connect any number of nodes. In this setting, a node of the hypergraph corresponds to a bond between two atoms of the small molecule. In contrast, a hyperedge then represents an atom and connects all its bonds (i.e. nodes).

Pharmaceutical companies may screen millions of small molecules against a specific target, e.g., see GlaxoSmithKline's DNA-encoded small molecule library of 800 million entries [35] . However, as the end result will be optimised via a skilled medicinal chemist, one should aim to substantially cut down the search space by screening only a representative selection of molecules for optimisation later. One route towards this is to select molecules with heterogeneous chemical properties using GML approaches. This is a popular task with well-established benchmarks such as QM9 [151] and MD17 [32] . Top-performing methods are based on graph neural networks.

For instance, using a graph representation of drugs, Duvenaud et al. [49] have shown substantial improvements over non-graphbased approaches for molecule property prediction tasks. Specifically, the authors used GNNs to embed each drug and tested the predictive capabilities of the model on diverse benchmark datasets. They demonstrated improved interpretability of the model and predictive superiority over previous approaches which relied on circular fingerprints [63] . The authors use a simple GNN layer with a read-out function on the output of each GNN layer that updates the global drug embedding.

Alternatively, Schutt et al. [165] introduced SchNet, a model that characterises molecules based on their representation as a list of atoms with interatomic distances, that can be viewed as a fully connected graph. SchNet uses learned embeddings for each atom using two modules: 1) an atom-wise module, and 2) an interaction module. The former applies a simple MLP transformation to each atom representation input, while the latter updates the atom representation based on the representations of the other atoms of the molecule and using relative distances to modulate contributions. The final molecule representation is obtained with a global sum pooling layer over all atoms' embeddings.

With the same objective in mind, Klicpera et al. [93] recently introduced DimeNet, a novel GNN architecture that diverges from the standard message passing framework presented in Section 3. DimeNet defines a message coefficient between atoms based on their relative positioning in 3D space. Specifically, the message from node to node is iteratively updated based on 's incoming messages as well as the distances between atoms and the angles between atomic bonds. DimeNet relies on more geometric features, considering both the angles between different bonds and the distance between atoms. The authors report substantial improvements over SOTA models for the prediction of molecule properties on two benchmark datasets.

Most relevant to the later stages of preclinical work, Feinberg et al. extended previous work on molecular property prediction [54] to include ADME properties [53] . In this scenario, by only using structures of drugs predictions were made across a diverse range of experimental observables, including half-lives across in vivo models (rat, dog), hERG protein interactions and IC 50 values for common liver enzymes predictive of drug toxicity.

Within the previous section, chemical properties were a priori defined. In contrast, Stokes et al. [179] leveraged results from a small phenotypic growth inhibition assay of 2,335 molecules against Escherichia coli to infer antibiotic properties of the ZINC15 collection of >107 million molecules. After ranking and curating hits, only 23 compounds were experimentally tested -leading to halicin being identified. Of particular note was that the Tanimoto similarity of halicin when compared its nearest neighbour antibiotic, metronidazole, was only ∼0.21demonstrating the ability of the underlying ML to generalise to diverse structures.

Testing halicin against a range of bacterial infections, including Mycobacterium tuberculosis, demonstrated broad-spectrum activity through selective dissipation of the ΔpH component of the proton motive force. In a world first, Stokes et al. showed efficacy of an AI-identified molecule in vivo (Acinetobacter baumannii infected neutropenic BALB/c mice) to beat the standard of care treatment (metronidazole) [179] .

A more challenging task than those previously discussed is de novo design of small molecules from scratch; that is, for a fixed target (typically represented via 3D structure) can one design a suitable and selective drug-like entity?

In the landmark paper, Zhavoronkov et al. [211] created a novel chemical matter against discoidin domain receptor 1 (DDR1) using a variational autoencoder style architecture. Notably, they penalise the model to select structures similar to disclosed drugs from the patent literature. As the approach was designed to find small molecules for a well-known target, crystal structures were available and subsequently utilised. Additionally, the ZINC dataset containing hundreds of millions of structures was used (unlabelled data) along with confirmed positive and negative hits for DDR1.

In total, 6 compounds were synthesised with 4 attaining <1 M IC 50 values. Whilst selectivity was shown for 2 molecules of DDR1 when compared to DDR2, selectivity against a larger panel of offtargets was not shown. Whilst further development (e.g., PK or toxicology testing) was not shown, Zhavoronkov et al. demonstrated de novo small molecule design in an experimental setting [211] . Arguably, the recommendation of an existing molecule is a simpler task than designing one from scratch.

New biological entities (NBE) refer to biological products or biologics, that are produced in living systems [168] . The types of biologics are very diversified, from proteins (>40 amino acids), peptides, antibodies, to cell and gene therapies. Therapeutic proteins tend to be large, complex structured and are unstable in contrast to small molecules [138] . Biologic therapies typically use cell-based production systems that are prone to post-translational modification and are thus sensitive to environmental conditions requiring mass spectrometry to characterise the resulting heterogeneous collection of molecules [115] .

In general, the target-to-hit-to-lead pathway also applies to NBE discovery, with similar procedures like high-throughput screening assays. Typically, an affinity-based high-throughput screening method is used to select from a large library of candidates using one target. One must then separately study off-target binding from similar proteins, peptides and immune surveillance [95] .

Focusing on proteins, the consensus to derive the protein graph representation is to use pairwise spatial distances between amino acid residues, i.e. the protein's contact map, and to apply an arbitrary cut-off or Gaussian filter to derive adjacency matrices [57, 64, 209] , see Figure 8 .

However, protein structures are substantially more complex than small molecules and, as such, there are several resulting graph construction schemes. For instance, residue-level graphs can be constructed by representing the intramolecular interactions, such as hydrogen bonds, that compose the structure as edges joining their respective residues. This representation has the advantage of explicitly encoding the internal chemistry of the biomolecule, which determines structural aspects such as dynamics and conformational rearrangements. Other edge constructions can be distance-based, such as K-NN (where a node is joined to its most proximal neighbours) [57, 209] or based on Delaunay triangulation. Node features can include structural descriptors, such as solvent accessibility metrics, encoding the secondary structure, distance from the centre or surface of the structure and low-dimensional embeddings of physicochemical properties of amino acids. It is also possible to represent proteins as large molecular graphs at the atomic level a.

b. Figure 8 : Illustration of a. a protein (PDB accession: 3EIY) and b. its graph representation derived based on intramolecular distance with cut-off threshold set at 10Å.

in a similar manner to small molecules. Due to the plethora of graph construction and featurisation schemes available, tools are being made available to facilitate the pre-processing of said protein structure [82] . One should note that sequences can be considered as special cases of graphs and are compatible with graph-based methods. However, in practice, language models are preferred to derive protein embeddings from amino acids sequences [152, 155] . Recent works suggest that combining the two can increase the information content of the learnt representations [64, 80] . Several recurrent challenges in the scientific community aim to push the limit of current methods. For instance, the CAFA [148] and CAPRI [99] challenges aim to improve protein functional classification and protein-protein interaction prediction.

Display technologies have driven the modern development of NBEs; in particular, phage display and yeast display are widely used for the generation of therapeutic antibodies. In general, a peptide or protein library with diverse sequence variety is generated by PCR, or other recombination techniques [59] . The library is "displayed" for genotypephenotype linkage such that the protein is expressed and fused to surface proteins while the encoding gene is still encapsulated within the phage or cell. Therefore, the library could be screened and selected, in a process coined "biopanning", against the target (e.g. antigen) according to binding affinity. Thereafter, the selected peptides are further optimised by repeating the process with a refined library. In phage display, selection works by physical capture and elution [128] ; for cell-based display technologies (like yeast display), fluorescence-activated cell sorting (FACS) is utilised for selection [18] .

Due to the repeated iteration of experimental screens and the high number of outputs, such display technologies are now being coupled to ML systems for greater speed, affinity and further in silico selection [154, 204] . As of yet, it does not appear that advanced ML architectures have been applied in this domain, but promising routes forward have been recently developed. For example, Hawkins-Hooker et al. [71] trained multiple variational autoencoders on the amino acid sequences of 70,000 luciferase-like oxidoreductases to generate new functional variants of the luxA bacterial luciferase. Testing these experimentally led to variants with increased enzyme solubility without disrupting function. Using this philosophy, one has a grounded approach to refine libraries for a directed evolution screen.

Some proteins have reliable scaffolds that one can build upon, for example, antibodies whereby one could modify the variable region but leave the constant region intact. For example, Deac et al. [44] used dilated (á trous) convolutions and self-attention on antibody sequences to predict the paratope (the residues on the antibody that interact with the antigen) as well as a cross-modal attention mechanism between the antibody and antigen sequences. Crucially, the attention mechanisms also provide a degree of interpretability to the model.

In a protein-agnostic context, Fout et al. [57] used Siamese architectures [20] based on GNNs to predict at which amino acid residues are involved in the interface of a protein-protein complex. Each protein is represented as a graph where nodes correspond to amino acid residues and edges connect each residue to its closest residues. The authors propose multiple aggregation functions with varying complexities and following the general principles of DCNN [4] . The output embeddings of each residue of both proteins are concatenated all-to-all and the objective is to predict if two residues are in contact in the protein complex based on their concatenated embeddings. The authors report a significant improvement over the method without the GNN layers, i.e. directly using the amino acid residue sequence and structural properties (e.g. solvent accessibility, distance from the surface).

Gainza et al. [58] recently introduced Molecular Surface Interaction Fingerprinting (MaSIF) for tasks such as protein-ligand prediction, protein pocket classification, or protein interface site prediction. The approach is based on GML applied on mesh representations of the solvent-excluded protein surface, abstracting the underlying sequence and internal chemistry of the biomolecule. In practice, MaSIF first discretises a protein surface with a mesh where each point (vertex) is considered as a node in the graph representation. Then the protein surface is decomposed into overlapping small patches based on the geodesic radius, i.e. clusters of the graph. For each patch, geometric and chemical features are handcrafted for all nodes within the patch. The patches serve as bases for learnable Gaussian kernels [117] that locally average node-wise patch features and produce an embedding for each patch. The resulting embeddings are fed to task-dependent decoders that, for instance, give patch-wise scores indicating if a patch overlaps with an actual protein binding site.

One of the great ambitions of bioengineering is to design proteins from scratch. In this case, one may have an approximate structure in mind, e.g., to inhibit the function of another endogenous biomolecule. This motivates the inverse protein folding problem, identifying a sequence that can produce a pre-determined protein structure. For instance, Ingraham et al. [80] leveraged an autoregressive self-attention model using graph-based representations of structures to predict corresponding sequences.

Strokach et al. [180] leveraged a deep graph neural network to tackle protein design as a constraint satisfaction problem. Predicted structures resulting from novel sequences were initially assessed in silico using molecular dynamics and energy-based scoring. Subsequent in vitro synthesis of sequences led to structures that matched the secondary structure composition of serum albumin evaluated using circular dichroism.

With a novel amino acid sequence that could generate the desired shape of an arbitrary protein, one would then want to identify potential wanted and unwanted effects via functional annotations. These include enzyme commission (EC) numbers, a hierarchy of labels to characterise the reactions that an enzyme catalysespreviously studied by the ML community [41, 159] .

Zamora et al. [209] developed a pipeline based on graph representations of a protein's amino acid residues for structure classification. The model consists of the sequential application of graph convolutional blocks. A block takes as input two matrices corresponding to residue features and coordinates, respectively. The block first uses a layer to learn Gaussian filters applied on the proteins' spatial distance kernel, hence deriving multiple graph adjacency matrices. These are then used as input graphs in a GNN layer which operates on the residue features. The block then performs a 1D average pooling operation on both the feature matrix and the input coordinate matrix, yielding the outputs of the block. After the last block, the final feature matrix is fed to a global attention pooling layer which computes the final embedding of the protein used as input to an MLP for classification. The model performs on par with existing 3D-CNN-based models. However, the authors observe that it is more interpretable, enabling the identification of substructures that are characteristic of structural classification.

Recently, Gligorijevic et al. [64] proposed DeepFRI, a model that predicts a protein's functional annotations based on structure and sequence information. The authors define the graph representation of a protein-based on the contact map between its residues. They first use a pre-trained language module that derives X , each protein's graph is then fed to multiple GCN layers [92] . The outputs of each GCN layer are concatenated to give the final embedding of a protein that is fed to an MLP layer giving the final functional predictions. The authors report substantial improvements over SOTA methods. Furthermore, they highlight the interpretability of their approach, demonstrating the ability to associate specific annotations with particular structures.

The term drug repurposing is used to describe the use of an existing drug, whether approved or in development as a therapy, for an indication other than the originally intended indication. Considering that only 12% of drugs that reach clinical trials receive FDA approval, repurposed drugs offer an attractive alternative to new therapies as they are likely to have shorter development times and higher success rates with early stages of drug discovery already completed. It has been estimated that repurposed treatments account for approximately 30% of newly FDA approved drugs and their associated revenues [144] , and that up to 75% of entities could be repurposed [129] . Note that we incorporate product line extensions (PLEs) within drug repurposing whereby one wishes to identify secondary indications, different formulations for an entity, or partner drugs for combination therapies.

As such, there is a major interest in using in silico methods to screen and infer new treatment hypotheses [75] . Drug repurposing relies on finding new indications for existing molecules, either by identifying actionable pleiotropic activity (off-target repurposing), similarities between diseases (on-target repurposing), or by identifying synergistic combinations of therapies (combination repurposing). Well-known examples include: ketoconazole (Nizoral) used to treat fungal infections via enzyme CYP51A1 and now used to treat Cushing syndrome via off-target interactions with CYP17A1 (steroid synthesis/degradation), NR3C4 (androgen receptor), and NR3C1 (glucocorticoid receptor); sildenafil (Viagra) originally intended for pulmonary arterial hypertension on-target repurposed to treat erectile dysfunction; and Genvoya, a combination of emtricitabine, tenofovir alafenamide (both reverse transcriptase inhibitors), elvitegravir (an integrase inhibitor), and cobicistat (a CYP3A inhibitor to improve PK) to treat human immunodeficiency virus (HIV).

Estimates suggest that each small molecule may interact with tens, if not hundreds of proteins [213] . Due to small molecule pleiotropy -particularly from first-in-class drugs [116] -off-targets of existing drugs can be a segue to finding new indications.

A variety of traditional techniques are used to identify missing drug-target interactions. For instance, when the structure of the protein is known, these stem from biophysical simulations, i.e. molecular docking or molecular dynamics. Depending on how one thresholds sequence similarity between the ∼21,000 human proteins, structural coverage whereby a 3D structure of the protein exists ranges from 30% (≥ 98% seq. sim.) to 70% (≥ 30% seq. sim.) [174] . However, ∼34% of proteins are classified as intrinsically disordered proteins (IDPs) with no 3D structure [46, 190] . Besides, drugs seldom have a fixed shape due to the presence of rotatable bonds. There is now a growing body of GML literature to infer missing drug-target interactions both with and without relying on the availability of a 3D protein structure.

Requiring protein structures, Torng et al. [188] focused on the task of associating drugs with protein pockets they can bind to. Drugs are represented based on their atomic structures and protein pockets are characterised with a set of key amino acid residues connected based on Euclidean distance. Drug embeddings are obtained with the GNN operator from Duvenaud et al. [49] . To derive embeddings of protein pockets, the authors first use two successive graph autoencoders with the purposes of 1) deriving a compact feature vector for each residue, and 2) deriving a graph-level representation of the protein pocket itself. These autoencoders are pre-trained, with the encoder of the first serving as input to the second. Both encoders are then used as input layers of the final model. The association prediction between a drug and a protein pocket is then obtained by feeding the concatenation of the drug and pocket representations to an MLP layer. The authors report improved performance against the previous SOTA model based on a 3D-CNN operating on a grid-structure representation of the protein pocket [149] .

A range of GML methods for drug-target interaction do not require protein structure. For instance, Gao et al. [61] use two encoders to derive embeddings for proteins and drugs, respectively. For the first encoder, recurrent neural networks are used to derive an embedding matrix of the protein-based on its sequence and functional annotations. For the second encoder, each drug is represented by its underlying graph of atoms and the authors use GNNs to extract an embedding matrix of the graph. They use three layers of GIN [200] to build their subsequent architecture. Finally, a global attention pooling mechanism is used to extract vector embeddings for both drugs and proteins based on their matrix embeddings. The two resulting vectors are fed into a Siamese neural network [20] to predict their association score. The proposed approach is especially successful compared to baseline for cold-start problems where the protein and/or drug are not present in the training set.

Alternatively, Nascimento et al. [122] introduce KronRLS-MKL, a method that casts drug-target interaction prediction as a link prediction task on a bi-partite graph capturing drug-protein binding. The authors define multiple kernels capturing either drug similarities or protein similarities based on multiple sources of data. The optimisation problem is posed as a multiple kernel learning problem. Specifically, the authors use the Kronecker operator to obtain a kernel between drug-protein pairs. The kernel is then used to predict a drug-protein association based on their similarity to existing drug-target link. Crichton et al. [39] cast the task in the same setting. However, the authors use existing embedding methods, including node2vec [66] , deepwalk [142] , and LINE [186] , to embed nodes in a low-dimensional space such that the embeddings capture the local graph topology. The authors feed these embeddings to a machine learning model trained to predict interactions. The underlying assumption is that a drug will be embedded closer to its protein targets.

Similarly, Olayan et al. [131] propose DDR to predict drug-target interactions. The authors first build a graph where each node represents either a drug or a protein. In addition to drug-protein edges, an edge between two drugs (or two proteins) represents their similarity according to a predefined heuristic from multiple data sources. DDR embeds each drug-protein pair based on the number of paths of predefined types that connect them within the graph. The resulting embeddings are fed to a random forest algorithm for drug-target prediction.

On-target repurposing takes a holistic perspective and uses known targets of a drug to infer new putative indications based on diverse data. For instance, one can identify functional relationships between a drug's targets and genes associated with a disease. Also, one may look for similarities between diseases -especially those occurring in different tissues. Hypothetically, one could prospectively find repurposing candidates in the manner of Fang et al. [52] by finding a missing protein-protein interactions between a genetically validated target and a drug's primary target. Knowledge graph completion approaches have been particularly effective in addressing these tasks.

For instance, Yang et al. [205] introduced BNNR. The authors build a block matrix with a drug similarity matrix, a disease similarity matrix, and a disease-drug indication matrix. The method is based on the matrix completion property of Singular Value Thresholding (SVT) algorithm applied to the block matrix. BNNR incorporates regularisation terms to balance approximation error and matrix rank properties to handle noisy drug-drug and disease-disease similarities. It also adds a constraint that clips the association scores to the interval [0, 1]. The authors report performance improvements when compared to competing approaches.

Alternatively, Wang et al. [197] recently proposed an approach to predict new drug indications based on two bipartite graphs, capturing drug-target interactions and disease-gene associations, and a PPI graph. Their algorithm is composed of an encoder module, relying on graph attention networks (GAT; [191] ), and an MLP decoder module. The encoder derives drug and disease embeddings through the distillation of information along the edges of the graphs. The input features for drugs and diseases are based on similarity measures. On the one hand, drug features correspond to the Tanimoto similarities between its SMILES representation and that of the other drugs. On the other hand, a disease's features are defined by its similarity to other diseases computed based on MeSH-associated terms.

Combination drugs have been particularly effective in diseases with complex aetiology or an evolutionary component where resistance to treatment is common, such as infectious diseases. If synergistic drugs are found, one can reduce dose whilst improving efficacy [72, 89] . Strategically, combination therapies provide an additional way to extend the indications and efficacy of available entities. They can be used for a range of purposes, for example, convenience and compliance as a fixed-dose formulation (e.g. valsartan and hydrochlorothiazide for hypertension [48] ), to achieve synergies (e.g. co-trimoxazole: trimethoprim and sulfamethoxazole for bacterial infections), to broaden spectrum (e.g. for treatment of infections by an unknown pathogen), or to combat disease resistance (e.g. multi-drug regimens for drugsensitive and drug-resistant tuberculosis). The number of potential pairwise combinations of just two drugs makes a brute force empirical laboratory testing approach a lengthy and daunting prospect. To give a rough number, there exist around 4,000 approved drugs which would require ∼8 million experiments to test all possible combinations of two drugs at a single dose. Besides, there are limitless ways to change the dosage and the timing of treatments, as well as the delivery method.

Arguably some of the first work using GML to model combination therapy was DECAGON by Zitnik et al. [216] used to model polypharmacy side-effects via a multi-modal graph capturing drugside effect-drug triplets in addition to PPI interactions. In contrast, Deac et al. [43] forwent incorporation of a knowledge graph instead modelling drug structures directly and using a coattention mechanism to achieve a similar level of accuracy. Typically architectures predicting drug-drug antagonism can be minimally adapted for prediction of synergy. However, more nuanced architectures are emerging combining partial knowledge of drug-target interactions with target-disease machine learning modules [86] .

In the last year to address the unprecedented COVID-19 global health crisis, multiple research groups have explored graph-based approaches to identify drugs that could be repurposed to treat SARS-CoV-2 [81, 119, 210, 214] . For instance, Morselli et al. [119] proposed an ensemble approach combining three different graph-based association strategies. The first two are similar in principle. First, each drug and each disease is represented by the set of proteins that it targets. Second, the association between a drug and a disease is quantified based on the distance between the two sets on a PPI graph. The two approaches differ on whether the distance is quantified with shortest paths or random walks. The last strategies rely on GNNs for multimodal graphs (knowledge graph). The graph contains PPIs, drug-target interactions, disease-protein associations, and drug indications. The formulation of the GNN layer is taken from the DECAGON model [216] , an architecture similar to the R-GCN model [164] .

Alternatively, Zeng et al. [210] use RotatE to identify repurposing hypotheses to treat SARS-CoV-2 from a large knowledge graph constructed from multiple data sources and capturing diverse relationships between entities such as drugs, diseases, genes, and anatomies. Additionally, Ioannidis et al. [81] proposed a modification of the RGCN architecture to handle few-shot learning settings in which some relations only connect a handful of nodes.

In the rare disease arena, Sosa et al. [175] introduced a knowledge graph embedding approach to identify repurposing hypotheses for rare diseases. The problem is cast as a link prediction problem in a knowledge graph. The authors use the Global Network of Biological Relationships (GNBR) [140] , a knowledge graph built through literature mining and that contains diverse relationships between diseases, drugs, and genes. Due to the uncertainty associated with associations obtained from literature mining, the authors use a knowledge graph embedding approach design to account for uncertainty [30] . Finally, the highest-ranking associations between drugs and rare diseases are investigated, highlighting literature and biological support.

Drug repurposing is now demonstrating itself as a first use case of GML methods likely to lead to new therapies within the coming years. Outside of the pharmaceutical industry, GML methods recommending nutraceuticals [193] may also offer fast routes to market through generally recognized as safe (GRAS) regulatory designations.

We have discussed how GML has produced the state-of-the-art results both on graph-level problems for the description of drugs and other biomolecules, and node-level problems for the navigation of knowledge graphs and representation of disease biology. With the design, synthesis and testing of de novo small molecules [211] , the in vitro and in vivo testing of drug repurposing hypotheses [179] , and target identification frameworks being conceptualised [145] , we are potentially entering into a golden age of validation for GML within drug discovery and development.

A few key hurdles limit lossless representation of biology. At the molecular level, bonds can be either rotatable single bonds or fixed bonds; accurately representing the degrees of freedom of a molecule is a topic of active research [56] . At the cellular level, expression of mRNA and proteins exhibit stochastic dynamics [90, 150] . A pathway is not expressed in a binary fashion: some proteins may only have the potential to be expressed, e.g. via unspliced pre-mRNA, and meanwhile, proteases are actively recycling unwanted proteins. Historically, most -omic platforms have recorded an average "bulk" signal; however, with the recent single-cell revolution, GML offers a principled approach to the characterisation of signalling cascades.

GML is still in its infancy, and underlying theoretical guarantees and limitations are under active research. For instance, deeper GNNs suffer from oversmoothing of features and oversquashing of information. Oversmoothing is the phenomenon of features washing out through repeated rounds of message passing and aggregation [133] . The inverse, having too few layers to exchange information globally, is referred to as under-reaching [8] . These issues have limited the expressivity of traditional GNNs [31, 45] . To alleviate this, a promising direction is to incorporate global information in the model, for instance, by using contrastive approaches [181, 192] , by augmenting the framework with relative positioning to anchor nodes [100, 207] , or by implementing long-range information flow [56] .

Due to the problem of missing data within biomedical knowledge graphs, we envisage opportunities for active learning to be deployed to label critical missing data points to explore experimentally and therefore reduce model uncertainty [184] . Due to the significant expense associated with drug discovery and development, integrating in silico modelling and experimental research is of great strategic importance. While active learning has previously led to biased datasets in other settings, modern techniques are addressing these drawbacks [2, 67] .

Finally, because GML allows for the representation of unstructured multimodal datasets, one can expect to see tremendous advances made within data integration. Most notably, highly multiplexed single-cell omic technologies are now being expanded in spatial settings [5, 22] . In addition, CRISPR screening data with associated RNA sequencing readouts are emerging as promising tools to identify key genes controlling a cellular phenotype [42] .

Recent advances in ligand-based drug design: relevance and utility of the conformationally sampled pharmacophore approach

Active Learning for Imbalanced Datasets

Targeting the TGF signalling pathway in disease

Diffusion-convolutional neural networks

Mass Cytometry Imaging for the Study of Human Diseases-Applications and Data Analysis Strategies

TuckER: Tensor Factorization for Knowledge Graph Completion

Network medicine: a network-based approach to human disease

The logical expressiveness of graph neural networks

Laplacian eigenmaps and spectral techniques for embedding and clustering

Limitations of algebraic approaches to graph isomorphism testing

Spectral Clustering with Graph Neural Networks for Graph Pooling

A Personal History of Using Crystals and Crystallography to Understand Biology and Advanced Drug Discovery

High-throughput crystallography for lead discovery in drug design

Deep Graph Mapper: Seeing Graphs through the Neural Lens

Choosing the right tool for the job: RNAi, TALEN, or CRISPR

Translating embeddings for modeling multi-relational data

Graph cuts in vision and graphics: Theories and applications. In Handbook of mathematical models in computer vision

Beyond natural antibodies: the power of in vitro display technologies

New drug" designations for new therapeutic entities: new active substance, new chemical entity, new biological entity, new molecular entity

Signature verification using a "siamese" time delay neural network

Geometric deep learning: going beyond euclidean data

Spatial transcriptomics coming of age

Systems biology in drug discovery

Graph Regularized Nonnegative Matrix Factorization for Data Representation

Towards sparse hierarchical graph classifiers

Open Targets Platform: new developments and updates two years on

Machine Learning on Graphs: A Model and Comprehensive Taxonomy

Hyperbolic graph convolutional neural networks

Heterogeneous network embedding via deep architectures

Embedding uncertain knowledge graphs

Can graph neural networks count substructures?

Machine learning of accurate energy-conserving molecular force fields

Learning the Graphical Structure of Electronic Health Records with Graph Convolutional Transformer

Graph neural networks for icecube signal classification

Design, synthesis and selection of DNA-encoded small-molecule libraries

Advancing the Use of Mobile Technologies in Clinical Trials: Recommendations from the

The Drug Repurposing Hub: a next-generation drug library and information resource

Discovering Symbolic Models from Deep Learning with Inductive Biases

Neural networks for link prediction in realistic biomedical graphs: a multi-dimensional evaluation of graph embedding-based approaches

Selective enzyme purification by affinity chromatography

ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature

Identification of required host factors for SARS-CoV-2 infection in human cells

Drug-drug adverse effect prediction with graph co-attention

Attentive crossmodal paratope prediction

Understanding the representation power of graph neural networks in learning graph topology

Intrinsically disordered proteins and structured proteins with intrinsically disordered regions have different functional roles in the cell

Innovation in the pharmaceutical industry: new estimates of R&D costs

Fixed-dose combination pharmacologic therapy to improve hypertension control worldwide: Clinical perspective and policy implications

Convolutional networks on graphs for learning molecular fingerprints

The discovery of first-in-class drugs: origins and evolution

Fragment-based drug discovery

A genetics-led approach defines the drug target landscape of 30 immune-related traits

Improvement in ADMET Prediction with Multitask Deep Featurization

PotentialNet for molecular property prediction

Fast Graph Representation Learning with PyTorch Geometric

Neural Message Passing on High Order Paths

Protein interface prediction using graph convolutional networks

Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning

Library-based display technologies: where do we stand

Graph U-Nets

Interpretable Drug Target Prediction Using Deep Neural Representation

Neural message passing for Quantum chemistry

Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to ADME

A new model for learning in graph domains

node2vec: Scalable feature learning for networks

Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision

Yeast two-hybrid methods and their applications in drug discovery

Inductive representation learning on large graphs

Understanding biological functions through molecular networks

Generating functional protein variants with variational autoencoders

Methods for high-throughput drug combination screening and synergy scoring

Single-cell analysis tools for drug discovery and development

Exploiting the PI3K/AKT pathway for cancer drug discovery

silico methods for drug repurposing and pharmacology

Accelerated attributed network embedding

Label informed attributed network embedding

Principles of early drug discovery

From big data to precision medicine

Generative models for graph-based protein design

Few-shot link prediction via graph neural networks for Covid-19 drug-repurposing

Graphein -a Python Library for Geometric Deep Learning and Network Analysis on Protein Structures

The Wnt-dependent signaling pathways as target in oncology drug discovery

Junction Tree Variational Autoencoder for Molecular Graph Generation

Hierarchical Generation of Molecular Graphs using Structural Motifs

Modeling Drug Combinations based on Molecular Structures and Biological Targets

Molecular hypergraph grammar with its application to molecular optimization

Nassir Navab, and Michael Bronstein. 2020. Differentiable Graph Module (DGM) Graph Convolutional Networks

Multicomponent therapeutics for networked systems

Cell-signalling dynamics in time and space

Are drug targets with genetic support twice as likely to be approved? Revised estimates of the impact of genetic support for drug mechanisms on the probability of drug approval

Semi-supervised classification with graph convolutional networks

Directional message passing for molecular graphs

Diffusion kernels on graphs and other discrete structures

Characterization of Protein-Protein and Protein-Peptide Interactions: Implication for Biologics Design. Characterization of Protein-Protein and Protein-Peptide Interactions: Implication for Biologics Design

Traffic prediction with advanced Graph Neural Networks

Dual direction CRISPR transcriptional regulation screening uncovers gene networks driving drug resistance

Self-attention graph pooling

Modeling protein-protein, protein-peptide, and proteinoligosaccharide complexes: CAPRI 7th edition

Distance Encoding-Design Provably More Powerful GNNs for Structural Representation Learning

A three-hybrid system for detecting small ligand-protein receptor interactions

Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings

Target identification using drug affinity responsive target stability (DARTS)

Clique pooling for graph classification

Integrate multi-omics data with biological interaction networks using Multi-view Factorization AutoEncoder (MAE)

Understanding the impacts of missense mutations on structures and functions of human cancer-related genes: A preliminary computational analysis of the COSMIC Cancer Gene Census

Invariant and Equivariant Graph Networks

Clinical trial cycle times continue to increase despite industry efforts

Omics"-informed drug and biomarker discovery: opportunities, challenges and future perspectives

MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices

Automatic generation of complementary descriptors with molecular graph networks

Efficient estimation of word representations in vector space

Network motifs: simple building blocks of complex networks

Structural mass spectrometry in biologics discovery: advances and future trends

Opportunities and challenges in phenotypic drug discovery: an industry perspective

Geometric deep learning on graphs and manifolds using mixture model cnns

Fake news detection on social media using geometric deep learning

Network Medicine Framework for Identifying Drug Repurposing Opportunities for COVID-19. arXiv (2020)

Machine learning: a probabilistic perspective

Structural biology in fragmentbased drug design

A multiple kernel learning algorithm for drug-target interaction prediction

The support of human genetic evidence for approved drug indications

Multiview learning for understanding functional multiomics

A threeway model for collective learning on multi-relational data

Integrated Multi-Omics Analyses in Oncology: A Review of Machine Learning Methods and Tools

Description of research data in laboratory notebooks: Challenges and opportunities

Drugs derived from phage display: from candidate identification to clinical practice

New tricks for old drugs

CHOPIN: a web resource for the structural and functional proteome of Mycobacterium tuberculosis

DDR: efficient computational method to predict drug-target interactions using graph mining and machine learning approaches

Mass spectrometry-based proteomics turns quantitative

Graph Neural Networks Exponentially Lose Expressive Power for Node Classification

Instability and dynamic pattern in cellular networks

An omics perspective on drug target discovery platforms

PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest

ReDO_DB: the repurposing drugs in oncology database

Biologics and biosimilars

How to improve R&D productivity: the pharmaceutical industry's grand challenge

A global network of biomedical relationships derived from text

Combined shRNA over CRISPR/cas9 as a methodology to detect off-target effects and a potential compensatory mechanism

Deepwalk: Online learning of social representations

Stabilizing Variable Selection and Regression

A medicinal chemistry perspective of drug repositioning: Recent advances and challenges in drug discovery

Relationweighted link prediction for disease gene identification

Walter Birchmeier, and Nikolaus Rajewsky. 2020. Tracing tumorigenesis in a solid tumor model at single-cell resolution

Modeling interactome: scale-free or geometric?

A large-scale evaluation of computational protein function prediction

Protein-ligand scoring with convolutional neural networks

Nature, nurture, or chance: stochastic gene expression and its consequences

Quantum chemistry structures and properties of 134 kilo molecules

Evaluating protein transfer learning with TAPE

Machine learning applications in drug development

Machine learning-driven protein engineering: a case study in computational drug discovery

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Knowledge Graph Embedding for Link Prediction: A Comparative Analysis

Temporal graph networks for deep learning on dynamic graphs

SIGN: Scalable Inception Graph Neural Networks

Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers

Interpreting molecular similarity between patients as a determinant of disease comorbidity relationships

Panomics for precision medicine

Utilizing social media data for pharmacovigilance: a review

Target identification and mechanism of action in chemical biology and drug discovery

Modeling relational data with graph convolutional networks

SchNet-A deep learning architecture for molecules and materials

Weisfeiler-lehman graph kernels

Efficient graphlet kernels for large graph comparison

Formulation and manufacturability of biologics

Graph neural networks in particle physics

X-rays in the cryo-electron microscopy era: Structural biology's dynamic future

Mabellini: a genome-wide database for understanding the structural proteome and evaluating prospective antimicrobial targets of the emerging pathogen Mycobacterium abscessus

Social media and pharmacovigilance: a review of the opportunities and challenges

Evaluation of RNAi and CRISPR technologies by large-scale gene expression profiling in the Connectivity Map

Structural coverage of the proteome for pharmaceutical applications

A Literature-Based Knowledge Graph Embedding Method for Identifying Drug Repurposing Opportunities in Rare Diseases

Supervised neural networks for the classification of structures

Mass cytometry: single cells, many features

Ten years on -Measuring return from pharmaceutical innovation 2019

A deep learning approach to antibiotic discovery

Fast and flexible protein design using deep graph neural networks

Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization

RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space

Data-driven transformation in drug discovery

A review of active learning approaches to experimental design for uncovering biological networks

How were new medicines discovered?

Line: Large-scale information network embedding

Identification and validation of protein targets of bioactive small molecules

Graph convolutional neural networks for predicting drug-target interactions

Complex embeddings for simple link prediction

Intrinsically disordered proteins and their "mysterious"(meta) physics

Graph Attention Networks

Deep Graph Infomax

Hy-perFoods: Machine intelligent mapping of cancer-beating molecules in foods

Anastasios Doulamis, and Eftychios Protopapadakis

MORONET: Multi-omics Integration via Graph Convolutional Networks for Biomedical Data Classification

Dynamic graph cnn for learning on point clouds

Toward heterogeneous information fusion: bipartite graph convolutional networks for in silico drug repurposing

A reduction of a graph to a canonical form and an algebra arising during this reduction

Estimated research and development investment needed to bring a new medicine to market

How Powerful are Graph Neural Networks

Representation Learning on Graphs with Jumping Knowledge Networks

Embedding entities and relations for learning and inference in knowledge bases

Aligraph: A comprehensive graph neural network platform

Machine-learningguided directed evolution for protein engineering

Drug repositioning based on bounded nuclear norm regularization

Hierarchical graph representation learning with differentiable pooling

Position-aware Graph Neural Networks

Recent trends in deep learning based natural language processing

Structural learning of proteins using graph convolutional neural networks

Repurpose Open Data to Discover Therapeutics for COVID-19 using Deep Learning

Deep learning enables rapid identification of potent DDR1 kinase inhibitors

Phenotypic screens as a renewed approach for drug discovery. Drug discovery today 18

Comprehensive prediction of drug-protein interactions and side effects for the human proteome

Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2

Graph Neural Network on Electronic Health Records for Predicting Alzheimer's Disease

Modeling polypharmacy side effects with graph convolutional networks

We gratefully acknowledge support from William L. Hamilton, Benjamin Swerner, Lyuba V. Bozhilova, and Andrew Anighoro.