key: cord-0502617-kprykn1o authors: Zhang, Shuo; Liu, Yang; Xie, Lei title: Molecular Mechanics-Driven Graph Neural Network with Multiplex Graph for Molecular Structures date: 2020-11-15 journal: nan DOI: nan sha: a1f49dd940bd65a4db02a7908117ac3217a5e8c4 doc_id: 502617 cord_uid: kprykn1o The prediction of physicochemical properties from molecular structures is a crucial task for artificial intelligence aided molecular design. A growing number of Graph Neural Networks (GNNs) have been proposed to address this challenge. These models improve their expressive power by incorporating auxiliary information in molecules while inevitably increase their computational complexity. In this work, we aim to design a GNN which is both powerful and efficient for molecule structures. To achieve such goal, we propose a molecular mechanics-driven approach by first representing each molecule as a two-layer multiplex graph, where one layer contains only local connections that mainly capture the covalent interactions and another layer contains global connections that can simulate non-covalent interactions. Then for each layer, a corresponding message passing module is proposed to balance the trade-off of expression power and computational complexity. Based on these two modules, we build Multiplex Molecular Graph Neural Network (MXMNet). When validated by the QM9 dataset for small molecules and PDBBind dataset for large protein-ligand complexes, MXMNet achieves superior results to the existing state-of-the-art models under restricted resources. Human society benefits greatly from the discovery and design of new molecules with desired properties, from COVID-19 vaccines to solar cells. Artificial intelligence (AI) plays an increasingly important role in accelerating the molecular discovery process. One of the crucial tasks in AI-assisted molecular design is to predict the physicochemical properties of molecules from their structures. In recent years, many machine learning techniques have been proposed for the representational learning of molecules to reduce the computational cost involved in quantum chemistry calculations (DFT) and molecular dynamics simulations (MD) [1] . Among those methods, Graph Neural Networks (GNNs) have shown superior performance by treating the molecule as a graph and performing message passing scheme on it [2] . To better model the interactions in molecules and increase the expressive power of methods, previous GNNs have adopted auxiliary information such as chemical properties, pairwise distances between atoms, and angular information [3, 4, 5, 6, 7, 8, 9] . However, adopting such information in GNNs will inevitably increase the computational complexity. For example, when passing messages on a molecular graph that has N nodes with an average of k nearest neighbors for each node, O(N k 2 ) or O(N 3 ) messages are required in the worst case for the previous state-of-the-art GNNs [8, 9] to [11] . When compared with Schnet [6] , PhysNet [7] and DimeNet [8] , MXMNet gets the state-of-the-art performance and is memory-efficient. capture the angular information. With restricted memory resources, those GNNs could exhibit limited expressive power or even fail when applied to macromolecules like proteins or RNAs. To address the limitation, we propose a novel GNN that is both powerful and efficient. Inspired by molecular mechanics methods [10] , we use the angular information to model only the local connections to avoid using expensive computations on all connections. Besides, we divide the molecular interactions into two categories: local and global. Then a two-layer multiplex graph G = {G l , G g } is constructed for a molecule. In G, the local layer G l only contains the local connections that mainly capture covalent interactions, and the global layer G g contains the global connections that cover non-covalent interactions. With the multiplex molecular graphs, we then design Multiplex Molecular (MXM) module that contains a novel angle-aware message passing operated on G l and an efficient message passing operated on G g . Note that the MXM module reduces the computational complexity by avoiding capturing the angular information in nonlocal interactions. Finally, we construct the Multiplex Molecular Graph Neural Network (MXMNet) for the representation learning of molecules. To empirically evaluate the power and efficiency of MXMNet, we conduct experiments on a small molecules dataset QM9 [11] and a protein-ligand complexes dataset PDBBind [12] . On both datasets, our model can outperform the baseline models. Regarding the efficiency, our model requires significantly less memory than the previous state-of-the-art model [8] as shown in Figure 1 and achieves a training speedup of 260%. The main contributions of our work are as follows: • We propose a molecular mechanics-driven approach to represent the molecule by using a two-layer multiplex graph, where one layer contains local connections and another layer contains global connections. • We propose Multiplex Molecular (MXM) module which performs the message passing on the whole multiplex graph. The MXM module captures the global pairwise distances and local angles to be both powerful and efficient. • We propose Multiplex Molecular Graph Neural Network (MXMNet) based on the MXM module. Experiments on benchmark datasets validate that MXMNet achieves state-of-the-art performance and is efficient. GNNs for Molecules. To learn the representations of graph-structured data using neural networks, Graph Neural Networks (GNNs) have been proposed [3, 13, 14] and attracted growing interests. Due to the superior performance achieved by GNNs in various tasks, researchers began to apply GNNs for predicting various properties of molecules. Initial works treat the chemical bonds in molecules as edges and atoms as nodes to create graphs for molecules [3, 4, 5] . These GNNs also integrate many hand-picked chemical features to improve performance. However, they do not take account of the 3-dimensional structure of molecules, which is critical for many physiochemical properties of molecules. Thus later works [15, 6, 16, 7] turn to take the atomic positions into consideration and use interatomic distances to create the edges as well as edge features between atoms. Usually, a cutoff distance is used to find the neighbors in molecules instead of creating a complete graph to reduce the computational complexity and overfitting. However, the setting of cutoff sometimes can lead the GNNs to fail to distinguish certain molecules [8] . To solve this issue, angular information is further used in GNNs to achieve higher expressive power [8, 9] . However, those angle-aware GNNs have significantly higher time and space complexity than the previous works. They are not scalable to macromolecules or large-batch learning. Multiplex Graph. The multiplex graph (a.k.a multi-view graph) consists of multiple types of edges among a set of nodes. Informally, it can be considered as a collection of graphs, where each type of edges with the same set of nodes forms a graph or a layer. To get the representation of each node, both intra-layer relationships and cross-layer relationships have to be addressed properly. In practice, various methods have been proposed to learn the embedding of the multiplex graph [17, 18, 19, 20, 21] and the multiplex graph can be applied in many fields [22, 23, 24] . For the representation learning on molecules, previous work [25] implicitly represents molecular graphs as multiplex graphs and passes messages according to the edge types. In this work, we explicitly represent molecules as multiplex graphs based on the geometric information in molecules. Moreover, we propose different message passing schemes for different layers in the multiplex graph. In this section, we will introduce the preliminaries about our work. We first introduce the main notations used in this paper. Let G = (V, E) be a graph with N = |V | nodes and M = |E| edges. The nearest neighbors of node i are defined as N (i) = {j|d(i, j) = 1}, where d(i, j) is the shortest distance between node i and j. The average number of the nearest neighbors of each node is k = 2M/N . In the later formulations, we will use h i as the embedding of node i, e ji as the edge embedding between node i and j, which embeds the pairwise distance, m ji as the message being sent from node j to node i in the message passing scheme [5] , MLP as the multi-layer perceptron, as the concatenation operation, as the element wise production and W as the weight matrix. Next we provide the definition of a multiplex graph: Definition 1. Multiplex Graph. A multiplex graph can be defined as an L + 1-tuple G = (V, E 1 , . . . , E L ) where V is the set of nodes and for each l ∈ {1, 2, . . . , L}, E l is the set of edges in type l that between pairs of nodes in V . By defining the graph G l = (V, E l ) which is also called a plex or a layer, the multiplex graph can be seen as the set of graphs G = {G 1 , G 2 , ..., G L }. Now we introduce the message passing scheme [5] which is a general graph convolution used in spatial-based GNNs [2] : Definition 2. Message Passing. Given a graph G, the node feature of each node i is x i , and the edge feature for each node pair j and i is e ji . The message passing scheme iteratively updates the node embedding h using the following functions: where the superscript t denotes the t-step iteration, h 0 i = x i , the f m and f u are learnable functions. In recent works [8, 9] , the message passing scheme has been modified to capture the angular information in a 3D molecular graph G = (V, E) with N nodes and their Cartesian coordinates r = {r 1 , . . . , r N }, where r i ∈ R 3 is the position of node i. To analyze their computational complexity, we start from the number of angles in G to be captured: Theorem 1. Given a 3D molecular graph G, each pair of adjacent edges that share a common node can define an angle in G. There are O(N k 2 ) angles in G, where N is the number of nodes and k is the average number of nearest neighbors for each node. The proof is straightforward: For each node in G, there is an average of k edges connected to it. Those k edges can define (k(k − 1))/2 angles. Thus in total, we have O(N k 2 ) angles in G. To capture those angles in message passing scheme, there is at least one message being used to contain each angle in recent approaches [8, 9] . Thus the computational complexity of those models is at least O(N k 2 ) for each graph in an operation. Finally we present the problem investigated in this work: In this section, we introduce our molecular mechanics-driven approach including the multiplex molecular graphs, the Multiplex Molecular (MXM) module, and the Multiplex Molecular Graph Neural Network (MXMNet). In molecular mechanics methods [10] , the molecular energy E is modeled as E = E local + E nonlocal (see Figure 2 (a)), where E local = E bond + E angle + E dihedral models the local, covalent interactions including E bond that depends on bond lengths, E angle on bond angles, and E dihedral on the dihedral angles. E nonlocal models the non-local, non-covalent interactions between atom pairs. When focusing on the geometric information contained in the molecular mechanics method, we will find that the local interactions capture the angles α local and the pairwise distances d local while the nonlocal interactions only capture the pairwise distances d nonlocal (see Figure 2 (b)). These inspire us to use the angular information to model only the local interactions instead of all interactions in our model to reduce the computational complexity. To achieve our goal, we first divide the geometric information (GI) in molecular mechanics methods into two groups: Local GI that contains α local and d local , and Global GI that contains d local and d nonlocal (see Figure 2 (b)). Given a 3D molecule, we then construct the corresponding interaction graphs that contain different GI (see Figure 2 (c)). For the local GI, we can create edges by using either chemical bonds or finding the neighbors of each node within a small cutoff distance depending on the task being investigated. For the global GI, we create the edges by defining the neighbors of each node within a relatively large cutoff distance. With the interaction graphs, we treat them as layers to build a multiplex molecular graph G = {G l , G g }, which consists of a local layer G l and a global layer G g (see Figure 2 (d)). The resulting G will be used as the input of our model. With the multiplex molecular graph G, we propose Multiplex Molecular (MXM) module that uses different rules to update the node embeddings based on the different edges in G (see Figure 2 (e)). For G g , we propose the global layer message passing. For G l , we propose the local layer message passing. To transfer the information between different layers, we use a cross layer mapping. These operations will be introduced as follows in detail. Global Layer Message Passing Module. In this module, the message passing is performed on the global layer, which contains both local and non-local connections. We propose a message passing module that can capture the pairwise distances based on the message passing defined in Definition 2. Note that the message passing in Definition 2 can only take the one-hop neighbors of the central node in the aggregation per iteration. Inspired by previous works that demonstrate the power of addressing high-order neighbors in GNNs [26, 27, 28] , we here propose a message passing that captures up to the two-hop neighbors per iteration. A straightforward way to achieve the goal would be directly aggregating all two-hop neighbors. However, this would require O(N k 2 ) messages on the graph per iteration. Instead, we perform the one-hop based message passing twice in each iteration to address the two-hop neighbors. The resulting operation will only need O(2N k) messages in this way. As illustrated in Figure 3 (b), our global layer message passing module consists of two identical message passing operations that can capture the pairwise distance information e. Each message passing operation is formulated as follows: where i, j ∈ G global , the superscripts denote the state of h in the operation. In our global layer message passing, an update function f u is used between the two message passing operations. We define f u using multiple residual modules (see Figure 3 (d)). Each residual module consists of a two-layer MLP and a skip connection (see Figure 3 (e)). Local Layer Message Passing Module. In this module that performs message passing on the local layer, we will incorporate both the pairwise distance and angles associated with local interactions. In practice, we propose a message passing that captures up to the two-hop neighbors per iteration. In this way, the edges can define two kinds of angles: The two-hop angles that between the one-hop edges and the two-hop edges (∠ij 1 k 1 , ∠ij 1 k 2 in Figure 4 ). The one-hop angles that only between the one-hop edges (∠j 1 ij 2 and ∠j 1 ij 3 in Figure 4 ). Our message passing can capture all of those angles. While the previous work [8] only captures the two-hop angles. In detail, we propose a 3-step message passing scheme to be the local layer message passing: Step 1 contains Message Passing 1 that captures the two-hop angles and related pairwise distances to update edge-level embeddings {m ji } (see Figure 4 (a)). Step 2 contains Message Passing 2 that captures the one-hop angles and related pairwise distances to further update {m ji } (see Figure 4 (b)). Step 3 finally aggregates {m ji } to update the node-level embedding h i . These steps in the t-th iteration can be formulated as follows: Step 1: Message Passing 1 Step 2: Message Passing 2 Step 3: Aggregation and Update where i, j, k ∈ G local , a kj,ji is the feature for angle α kj,ji = ∠h k h j h i . We define f u using the same form as in the global layer message passing. These steps need O(2N k 2 + N k) messages in total. Figure 3 (c) illustrates the architecture of the global layer message passing. Note that we also include an Output module (see Figure 3 (c) and (f)), which is used for producing the output when creating the whole GNN model later. Cross Layer Mapping. After having the message passing modules for the local and global layer, we further use a cross layer mapping function f cross to address the connections between the same nodes across different layers in a multiplex molecular graph (see Figure 2 (d)). The cross layer mapping function f cross takes either the node embeddings {h g } in the global layer or the node embeddings {h l } in the local layer as input, and maps them to replace the node embeddings in the other layer (see Figure 3 (a)): where g ∈ G global , l ∈ G local , the f cross and f cross are learnable functions. In practice, we use multi-layer perceptrons to be f cross and f cross . Each of them needs O(N ) messages being updated. With MXM module, we build Multiplex Molecular Graph Neural Network (MXMNet) for the prediction of molecular properties as shown in Figure 3 (g). In the Embedding module, the atomic numbers Z are represented with randomly initialized, trainable embeddings to be the input node embeddings. In the RBF & SBF module, the Cartesian coordinates r of atoms are used to compute the pairwise distances and angles. We use the basis functions proposed in [8] to construct the representations of e RBF and a SBF . Then we stack MXM modules to perform message passings. In each MXM module, we use an Output module to get the node-level output. The final prediction y is computed by summing all outputs together among all nodes and all layers. Expressive Power. We analyze the expressive power of MXMNet by focusing on the effect of captured geometric information on representing molecular structures. Since MXMNet takes the pairwise distance information in global connections and the angular information in local connections into consideration, it is more powerful than the GNNs that only captures the pairwise distance information [15, 6, 16, 7] . When compared with the GNNs that captures both the pairwise distance information and angular information in global connections [8, 9] , MXMNet theoretically has lower expressive power due to the uncaptured angular information in nonlocal connections. However, note that expressive power does not directly speak about the generalization ability of GNNs [29, 30] , our experiments will empirically show that MXMNet exhibits good generalization ability with state-of-the-art performance. Computational Complexity. To analyze the computational complexity, we focus on the time and space complexity of message passing in MXMNet. We denote the cutoff distance when creating the edges as d g and d l in G g and G l . The average number of the nearest neighbors per node is k g in G g and is k l in G l . For 3D molecules, we have k g ∝ d g 3 and k l ∝ d l 3 . As d g > d l , we know that k g k l . As discussed in previous sections, the message passing operations in our MXM module requires the computation of O(2N k g + 2N k l 2 + N k l + 2N ) messages in total. Therefore, MXMNet is much more efficient than the GNNs capturing angular information in global connections [8, 9] , which require O(N k g 2 ) messages. In our experiments, we evaluate the generalization power as well as the efficiency of our MXMNet on the QM9 dataset for predicting molecular properties and the PDBBind dataset for predicting the protein-ligand binding affinities. Several state-of-the-art baseline models are also included for comparisons. QM9. The QM9 dataset is a widely used benchmark for the prediction of physical properties of molecules in equilibrium [11] . It consists of around 130k small organic molecules with up to 9 heavy atoms (C, O, N, and F). The properties are computed using density functional theory (DFT) calculations. Following [8] , we randomly use 110000 molecules for training, 10000 for validation and the rest for testing. We evaluate the mean absolute error (MAE) of the target properties. To create the multiplex molecular graphs, we use the chemical bonds as the edges in the local layer, and a cutoff distance to create the edges in the global layer. PDBBind. PDBBind is a database of experimentally measured binding affinities for protein-ligand complexes [12] . It contains detailed 3D structures and associated inhibition constants K i for the complexes. In our experiment, we use the PDBBind 2015 refined subset which contains roughly 4K structures. In each complex, we exclude the protein residues that are more than 6Å from the ligand. Besides, we remove all hydrogen atoms and use the remaining heavy atoms in the structure. The resulting complexes contain around 200 atoms on average. In the experiment, we split the dataset into training, validation, and testing sets by 8:1:1 and perform 10-fold cross-validation. The mean absolute error (MAE) of the binding free energy and the Pearson correlation coefficient (R) of logK i are reported. To create the multiplex molecular graphs, we use a cutoff distance of 2Å in the local layer and 6Å in the global layer when defining the edges. In our experiments, we use the following state-of-the-art models as baselines: SchNet [6] , PhysNet [7] , MEGNet-full [16] , Cormorant [31] , MGCN [32] and DimeNet [8] . On QM9, we use the results reported in the original works for the baselines. On PDBBind, we conduct the experiments based on the corresponding implementations. All of the experiments are done on an NVIDIA Tesla V100 GPU (32 GB). More details of the parameter settings and training setup are included in the appendix. On the QM9 dataset, we test the performance of MXMNet under different configurations by changing the batch size BS and the cutoff distance d g used in the global layer. As reported in Table 1 , MXMNet variants get better results than the baselines on 9 targets. We also compute the mean standardized MAE (std. MAE) as used in [8] to evaluate the overall performance of the models. MXMNet (BS=128, d g = 10Å) has the lowest std. MAE among all models and decreases the mean std. MAE by 13% compared to the previous best model DimeNet. The results clearly demonstrate the excellent generalization power of MXMNet. Ablation Study. By comparing the results between MXMNet (BS=32, d g = 5Å) and MXMNet (BS=128, d g = 5Å), we find that the effect of batch size on the performance is small. With a relatively large batch size (128), the overall performance is slightly better than using a small batch size (32) . Moreover, we can benefit from the large batch to achieve faster training. To investigate the effect of d g on the performance, we compare the results of the MXMNet variants that using different d g in Table 1 . When using d g = 5Å, MXMNet can get better results than using d g = 10Å on the targets ZPVE, U 0 , U , H, G and c v . This suggests that those properties benefit more from modeling a limited range of interactions rather than simply increasing the interaction range. While for the targets µ, HOMO , LUMO , ∆ and R 2 , the performance of MXMNet can be improved by using a larger d g = 10Å that helps to capture longer range interactions. Therefore, in practice, it is recommended to use different d g for predicting different properties. To further test whether our proposed two message passing modules (local layer and global layer) will both contribute to the success of MXMNet, we conduct experiments by using only one of the two modules or even parts of a module. Table 2 shows that all the ablations will decrease the performance of MXMNet. These validate that both of the two message passing modules contribute to the power of MXMNet. Besides, when only using the global layer message passing module, the ablation with only one message passing performs worse than the ablation with two message passings, which shows the effectiveness of capturing the two-hop neighbors. When only using the local layer message passing module, the mean std. MAE increases significantly compared to the original MXMNet, suggesting that the local connections are not adequate for the task. The results also validate the necessity to capture both one-hop angles and two-hop angles: The ablations with either one kind of them perform worse than the ablation with all of them. Efficiency Evaluation. To evaluate the space and time efficiency of MXMNet, we first compare the memory consumption during the training on QM9 for SchNet, PhysNet, DimeNet, and MXMNet. For the baselines, the model configurations are the same as those in their original papers. As illustrated in Figure 1 , all of the three MXMNet variants use a much smaller memory than DimeNet. For SchNet and PhysNet that consume less memory than MXMNet, they perform worse than MXMNet with higher mean std. MAEs. Then for time efficiency, we focus on the total training time. Note that the total training time is affected by all operations in the models and different models need different computational time for passing one message. Thus a smaller number of messages being passed in a GNN does not guarantee a shorter training time. Instead, we find that the batch size significantly affects the training time: MXMNet can benefit from large batch training with BS=128 to achieve a speedup of the training to 2.6× against DimeNet that can only use BS=32 on our GPU. On the PDBBind dataset with much larger molecules than those in QM9, the training of MXMNet is still able to be performed on our GPU. With the same model configuration, DimeNet will raise the out-of-memory error. As shown in Table 3 , when compared with SchNet and PhysNet that do not have the memory issue, MXMNet outperforms them significantly with a higher Pearson R and a lower MAE. Those results validate that our model is both powerful and memory-efficient to be used for macromolecules. In this paper, we propose a powerful and efficient GNN, MXMNet, for predicting the properties of molecules. Our model can significantly improve both expressive power and memory efficiency of GNNs for molecules. The novelty of MXMNet lies in its representation of molecules as a multiplex graph that is rooted in molecular mechanics. Experiments on QM9 and PDBBind datasets have successfully demonstrated the power and efficiency of MXMNet compared with the state-of-the-art baselines. In future work, it would be interesting to address the dihedral angles in 3D molecules. It is also promising to use MXMNet as a general tool to learn the representations of molecules in more tasks. Moreover, since molecules can have multiple conformations. It remains unclear how these conformations affect our model and other related GNNs. QM9. For the QM9 dataset, we use the source 1 provided by [1] . Following the previous works [2, 3, 4, 5, 6] , we process the QM9 dataset by removing about 3k molecules that fail a geometric consistency check or are difficult to converge [7] . For the properties U 0 , U , H, and G, only the atomization energies are used by subtracting the atomic reference energies as in [8] . For the property ∆ , we follow the same way as the DFT calculation and predict it by calculating LUMO − HOMO . PDBBind. For the PDBBind dataset, we use the version 2 that is included in the MoleculeNet benchmark for molecular machine learning [9] . We use logK i as the target property being predicted, which is proportional to the binding free energy. For the baselines used in the experiment on PDBBind, we find their codes provided by the original papers are based on different frameworks: SchNet [3] is based on PyTorch [10] , while PhysNet [4] and DimeNet [8] are based on Tensorflow [11] . To make fair comparisons and exclude the differences brought by different frameworks, we adopt the implementations of SchNet 3 and DimeNet 4 provided by the widely used PyTorch Geometric library [12] for graph representation learning. Since DimeNet is built based on PhysNet, by comparing their original implementations, we create the implementation of PhysNet based on 4 by changing the corresponding modules. Besides, the code of our MXMNet is also built based on 4 . For the multi-layer perceptrons (MLPs) used in our MXMNet, they all have 2 layers to take advantage of the approximation capability of MLP [13] . For all activation functions, we use the self-gated Swish activation function [14] . For the basis functions e RBF and a SBF , we use N SHBF = 7, N SRBF = 6 and N RBF = 16. To initialize all learnable parameters, we use the default settings used in PyTorch without assigning specific initializations except the initialization for the input node embeddings h (0) g in the global layer: h (0) g are initialized with random values uniformly distributed between − √ 3 and √ 3. In our experiment on QM9, we use the single-target training following [8] by using a separate model for each target instead of training a single shared model for all targets. The models are optimized by minimizing the mean absolute error (MAE) loss using the Adam optimizer [15] . We use a linear learning rate warm-up over 1 epoch and an exponential decay with ratio 0.1 every 600 epochs. The model parameter values for validation and test are kept using an exponential moving average with a decay rate of 0.999. To prevent overfitting, we use early stopping on the validation loss. Besides, we repeat our runs 3 times for each MXMNet variant following [16] . In our experiment on PDBBind, for each model being investigated, we create three weight-sharing, replica networks, one each for predicting the target G of complex, protein pocket, and ligand following [17] . The final target is computed by ∆G complex = G complex − G pocket − G ligand . The full model is trained by minimizing the mean squared error (MSE) loss between ∆G complex and the true values using the Adam optimizer [15] . The learning rate is dropped by a factor of 0.2 every 50 epoch. Moreover, we perform 10-fold cross-validation and repeat the experiments 5 times for each model. The validation losses are used for early stopping. In Table 4 , we list the most important hyperparameters used in our experiments. The rise of deep learning in drug discovery A comprehensive survey on graph neural networks Convolutional networks on graphs for learning molecular fingerprints Molecular graph convolutions: moving beyond fingerprints Neural message passing for quantum chemistry Schnetpack: A deep learning toolbox for atomistic systems Physnet: A neural network for predicting energies, forces, dipole moments, and partial charges Directional message passing for molecular graphs Heterogeneous molecular graph neural networks for predicting molecule properties Molecular modeling and simulation: an interdisciplinary guide Quantum chemistry structures and properties of 134 kilo molecules The pdbbind database: Collection of binding affinities for protein-ligand complexes with known three-dimensional structures Learning convolutional neural networks for graphs Semi-supervised classification with graph convolutional networks Quantum-chemical insights from deep tensor neural networks Graph networks as a universal machine learning framework for molecules and crystals Principled multilayer network embedding Scalable multiplex network embedding Modeling relational data with graph convolutional networks Representation learning for attributed multiplex heterogeneous network Multi-dimensional graph convolutional networks Layer communities in multiplex networks Heterogeneous multi-layered network model for omics data integration and analysis Abstract diagrammatic reasoning with multiplex graph networks A graph to graphs framework for retrosynthesis prediction Geniepath: Graph neural networks with adaptive receptive paths Spagan: Shortest path graph attention network Greg Ver Steeg, and Aram Galstyan. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing Understanding deep learning requires rethinking generalization How powerful are graph neural networks? Cormorant: Covariant molecular neural networks Molecular property prediction: A multilevel quantum interactions modeling perspective Quantum chemistry structures and properties of 134 kilo molecules Neural message passing for quantum chemistry Schnetpack: A deep learning toolbox for atomistic systems Physnet: A neural network for predicting energies, forces, dipole moments, and partial charges Graph networks as a universal machine learning framework for molecules and crystals Molecular property prediction: A multilevel quantum interactions modeling perspective Prediction errors of molecular machine learning models lower than hybrid dft error Directional message passing for molecular graphs Moleculenet: a benchmark for molecular machine learning Pytorch: An imperative style, high-performance deep learning library Tensorflow: A system for large-scale machine learning Fast graph representation learning with pytorch geometric Multilayer feedforward networks are universal approximators Searching for activation functions Adam: A method for stochastic optimization Cormorant: Covariant molecular neural networks Atomic convolutional networks for predicting protein-ligand binding affinity This project has been funded by National Institute of General Medical Sciences (R01GM122845) and National Institute on Aging (R01AD057555) of the National Institute of Health.