key: cord-0283541-e4fp7npo authors: Guo, Zhichun; Zhang, Chuxu; Yu, Wenhao; Herr, John; Wiest, Olaf; Jiang, Meng; Chawla, Nitesh V. title: Few-Shot Graph Learning for Molecular Property Prediction date: 2021-02-16 journal: nan DOI: 10.1145/3442381.3450112 sha: eb8dba325534da472170293b054596a17558c7f2 doc_id: 283541 cord_uid: e4fp7npo The recent success of graph neural networks has significantly boosted molecular property prediction, advancing activities such as drug discovery. The existing deep neural network methods usually require large training dataset for each property, impairing their performances in cases (especially for new molecular properties) with a limited amount of experimental data, which are common in real situations. To this end, we propose Meta-MGNN, a novel model for few-shot molecular property prediction. Meta-MGNN applies molecular graph neural network to learn molecular representation and builds a meta-learning framework for model optimization. To exploit unlabeled molecular information and address task heterogeneity of different molecular properties, Meta-MGNN further incorporates molecular structure, attribute based self-supervised modules and self-attentive task weights into the former framework, strengthening the whole learning model. Extensive experiments on two public multi-property datasets demonstrate that Meta-MGNN outperforms a variety of state-of-the-art methods. Drug discovery significantly benefits all human beings, especially for public health during this tough and special time caused by COVID-19 [26] . Developing and discovering new drugs is a time, resource, and money consuming process. A key step is to test a large number of molecules for therapeutic activity through extensive biological studies [23] . Unfortunately, these discovered ones often fail to become the approved drug candidates for various reasons such as low activity or toxicity [30] . Researchers need to select a great number of similar molecules as potential candidates. To find the molecules which have the same efficacious property, these selected molecules need to be tested through a complex experimental process. After that, only a few or even no molecules will be remaining as possible drug candidates to be tested further for risk and pharmaceutical activity. Therefore, it is crucial to improve the effectiveness of filtering the most likely drug candidates before taking experiments via wet-lab experimentation, thus wasting less time and resources on molecules that are unlikely to proceed to the lead stage. This concept is generally described as "fail early-fail cheap". Virtual screening is a widely used approach to screen out molecules likely to fail early, which avoids a large set of molecules to be investigated [22, 23] . Recent advances in deep learning have played an important role in virtual screening. These deep learning techniques have inspired novel approaches to a better understanding of molecules and their properties through molecular representation learning [7, 10, 35, 43, 45] . Deep neural networks learn more about specific molecular properties when they are fed with more instances during training. Thus, deep learning models require a large amount of training data to achieve desired capability and satisfactory performance [3] . However, it is common that there are only a few known molecules that share the same set of properties [1, 30] . We analyzed the datasets in MoleculeNet [33] , a well-known benchmark for predicting molecular properties. We find that more than half of the properties only are shared by fewer than 100 molecules across several datasets. This is a case of the well-known problem of few-shot available data, which seriously impairs the performances of current approaches. Therefore, it is essential to develop a deep neural model for predicting molecular properties effectively in few-shot scenarios. There are several challenges that need to be overcome to achieve this goal. Molecules can be considered as a heterogeneous structure where each atom connects to different neighboring atoms via different types of bonds. Previous work [29] represents molecules as SMILES strings and leverages sequence models [21, 29] to learn molecular embedding. This approach is not able to capture information in each bond well [32] . This is because bonds in molecules not only represent connected relations between different atoms but also contain attributed information that characterizes the bond type such as single, double, or triple. Thus, the first challenge is to design a deep neural network that can discover effective molecular representations from few-shot data. Because only a limited amount of labeled molecular property data are available, the second challenge is to exploit the useful unlabeled information in molecule data and further develop an efficient learning procedure to transfer the knowledge from other property prediction, so that the model can fast adapt to the novel (new) molecular properties with limited data. Moreover, different molecular properties could represent quite different molecular structures. Thus, their data should be treated differently in the knowledge transfer process. The third challenge is to distinguish the different importance of molecular properties when performing the efficient learning procedure. To address the above challenges, we propose a novel model called Meta-MGNN for few-shot molecular property prediction. First, we leverage graph neural network with the pre-training process to fuse heterogeneous molecular graph information as molecular embedding. Then, we develop a meta-learning framework to transfer knowledge from different property prediction tasks and obtain a well-initialized model which could be fast adapted to a new molecular property with limited data. In order to exploit and capture unlabeled information in molecule data, we design a self-supervised module which consists of a bond reconstruction loss and an atom type prediction loss, accompanied by the main property prediction loss. Moreover, considering different property prediction tasks contribute differently to the few-shot learner, we further introduce a self-attentive task weight to measure their importance. Both selfsupervised module and self-attentive task weight are incorporated into the meta-learning procedure for strengthening the model. Contributions. To summarize, the main contributions of this work are as follows: • We formulate the molecular property prediction as a few-shot learning problem, which exploits the rich information in various properties to address the lack of laboratory data problem for each individual property. • To deal with the few-shot challenge, we propose a novel model called Meta-MGNN by exploring graph neural network, selfsupervised learning, and task weight aware meta-learning. • We conduct extensive experiments on two public datasets and the evaluation results demonstrate the superior performance of Meta-MGNN over state-of-the-art methods. The effectiveness of each model component is also verified. In this section, we review existing work including graph neural network, few-shot learning, and molecular property prediction. Graph Neural Network (GNN). GNNs have gained increasing popularity due to its capability of modeling graph-structured data [9, 27, 39] . Typically, a GNN model uses a neighborhood aggregation function to iteratively update the representation of a node by aggregating representations of its neighboring nodes and edges. GNNs have showed attractive performance in various applications, such as recommendation systems [4, 24] , behavior modeling [38] , and anomaly detection [44] . Molecular property prediction is also a popular application of GNNs since a molecule could be represented as a topological graph by treating atoms as nodes, and bonds as edges [7, 8, 18, 20] . We will elaborate them in the next paragraph. Molecular Property Prediction. Methods can be categorized into two main groups based on the input molecular type: (1) molecular graph, and (2) simplified molecular-input line-entry (SMILES) [31] . For the first group of methods, each molecule is represented as a graph associated with different atom nodes interconnected by bond edges. One typical way is to employ graph neural networks to learn molecular representations [7, 8, 10, 18, 20] . For example, Lu et al. [18] proposed a novel hierarchical GNN. It includes an embedding layer, a Radial Basis Function layer, and an interaction layer to learn molecular representations from different levels. Hu et al. [10] proposed several novel pre-training strategies to pretrain GNNs at the level of individual nodes and the entire graph to lean local and global molecular representations simultaneously. For the second type of representation, SMILES is a sequence notation for describing the structure of molecules. Researchers take molecules as sequences and adopt language models to learn their representations [29, 43, 45] . For example, Zhang et al. [43] proposed a semi-supervised Seq2Seq fingerprint model which contains three ends of one input, one supervised output, and one unsupervised output. Zheng et al. [45] presented a new model to study structure-property relationships through a self-attentive linear notation syntax analysis. Guo et al. [8] proposed a novel graph and sequence fusion learning model to capture information both from the molecular graph structure and SMILES. Here, we take each molecule as a graph as it preserves the molecular inner structure better, and employ graph neural networks to learn their representations. Few-shot Learning. Successes of few-shot learning have been accomplished in various application domains such as computer vision [5, 11] and graph learning [6, 36, 37, 40, 41] . There are two notable types of few-shot learning approaches: (1) metric-based learning and (2) gradient-based learning. The former learns a generative metric to compare and match few-examples [2, 25, 28] . Vinyals et al. [28] proposed a novel matching metric, named Matching Nets, to match unlabeled examples to the class of few-shot labeled examples. Sung et al. [25] proposed relation network which learns a deep distance metric to compute relation scores of different images and further classify images. The latter aims to employ a specific meta-learner to learn well-initialized parameters of the base model for different tasks [5, 15, 42] . For instance, Finn et al. [5] proposed MAML which designs this kind of meta-learner to effectively initialize a base-learner that could be fast adapted to new tasks. In this work, our few-shot learning strategy is gradient-based learning. In this section, we first define the few-shot molecular property prediction problem, then present the details of using graph neural network (GNN) for learning molecular representations. Let = (V, E) denote a molecular graph where V is the set of nodes and E ⊆ V × V is the set of edges. Particularly, a node in a molecular graph represents a chemical atom and an edge represents a chemical bond between two atoms. Given a set of molecular graphs G = { 1 , · · · , } and their labels Y = { 1 , · · · , }, the goal of molecular property prediction is to learn a molecular representation vector for predicting its label (i.e., molecular property) of each ∈ G, i.e., to learn a mapping function : G → Y. Unlike previous studies where there are enough examples for each new property prediction task, this work considers a more practical scenario that only few-shot samples are given. Specifically, we aim to develop a classifier which can be fast adapted to predict new Tasks Representation Then the support loss L is calculated and utilized to update the GNN parameters to ′ . Next, the examples in the corresponding query set are fed into the GNN parameterized by ′ and calculate the loss L ′ for this task. The same process repeats for other training tasks. Later, we compute the summation of L ′ over all sampled tasks and use it to further update the GNN parameters for testing. (b) Self-supervised module: It includes bond reconstruction and atom type prediction. The orange part shows that we sample two atoms and use GNN to predict if there is a bond between them. The green part shows that we mask several atoms randomly and use GNN to predict their types. (c) Task-aware attention: It calculates the average of all the molecular embedding from the query set of the same task to represent this task. With the embedding of each task, we design a self-attentive layer to compute the weight of each task, then incorporate it into a meta-training process for updating model parameters . molecular properties that are unseen during the training process, given only a few samples of these new properties. Formally, the problem is defined as follows. Problem 1. Few-Shot Molecular Property Prediction Given molecular properties Y = { 1 , · · · , } and their corresponding fewshot molecular graph sets {G 1 ∈ 1 , · · · , G ∈ } (training data), the task is to design a machine learning model to predict molecular graphs of new properties that only have few-shot examples (test data). By viewing molecular structure as graph data (i.e., molecular graph), recent deep learning methods for graphs, such as graph neural networks (GNNs) [32, 39] , can be utilized to learn molecular representations which are fed to downstream machine learning models for molecular property prediction [17] . In this section, we will present the details of employing GNNs to obtain molecular representations. A GNN model is able to utilize both graph structure and node/edge features information to learn a representation vector h for each node ∈ V. Specifically, a GNN model uses a neighborhood aggregation function to iteratively update the representation of a node by aggregating representations of its neighboring nodes and edges. After iterations, a node representation h ( ) is able to capture the information within its -hop neighborhoods. In a molecular graph, each node represents an atom and each edge represents a chemical bond between two atoms. As the input layer of GNN, we first initialize representations of both nodes and edges using their attributes in molecular graph. The node attributes include atom number (AN) and chirality tag (CT), and edge attributes include bond type (BT) and bond direction (BD). Formally, we initialize node representation as where v and e denote node/edge attributes and ⊕ is concatenation operator. Then, the node representation h ( ) at the -th layer of GNN is formulated as: where N ( ) is the neighbor set of , (·) is a non-linear activation function (e.g., LeakyReLU). Agg(·) is an aggregating function. A number of architectures for Agg(·) have been proposed in recent years such as graph convolutional neural network (GCN) [12] and graph attention network (GAT) [27] . Here, we use graph isomorphism network (GIN) [34] , which has demonstrated state-ofthe-art performance on a variety of benchmark tasks. After that, we can learn the representation of each node in molecular graph: To obtain the graph-level representation h for a molecular graph, we calculate the average node embeddings at the final layer: The graph-level molecular representation h can be further fed into a classifier (e.g., a multi-layer perception) for molecular property prediction, as we will present in the next section. Pre-trained Molecular Graph Neural Network. Pre-trained models have been widely used in natural language processing, computer vision, and graph analysis in recent years [3, 10] . In general, pretraining allows a model to learn universal representations, provides a better parameter initialization, and avoids overfitting on downstream tasks with small training data. Models with pre-training have been demonstrated to obtain superior performance than models without that. Therefore, we are motivated to leverage the recent pre-trained graph neural network technique (PreGNN) [10] to obtain parameter initialization of molecular graph neural network. In this section, we present the details of proposed the Meta-MGNN for few-shot molecular property prediction. Meta-MGNN is built on MGNN and employs a meta-learning framework for model initialization and adaption. Molecular structure and feature based self-supervised module and self-attentive task weight are further incorporated into the former framework for model enhancement. We build the meta-learning framework based on MAML [5] . Given the model with learnable parameters that maps molecular graph to specific properties such as toxicity, i.e., : G → Y. In meta-learning, the model is expected to adapt to a number of different tasks, i.e., predicting different kinds of molecular properties. Particular, in the -shot meta-learning, for each task T sampled from distribution (T ), the model is trained using only data samples and further tested on remaining data samples of T . In this setting, we refer to the corresponding training and test sets of each task as support set and query set, denoted as where G , Y are support sets of input molecular graphs and property labels, and G ′ , Y ′ are query sets of input molecular graphs and property labels. During meta-training, the model is first updated to task-specific model using support set of each task, then further optimized to task-agnostic model using prediction loss over the query set of all tasks in training data. After sufficient training, the learned model can be further utilized to predict new tasks (new molecular properties) with only data samples as support set, which is called meta-testing. To avoid data overlapping, data of tasks used for meta-testing are held out during meta-training. The whole framework is illustrated in Figure 1 (a). In meta-training, the goal is to obtain well initialized model parameters that can be generally applicable to different tasks, and explicitly encourage the initialized parameters to perform well after a small number of gradient descent updates on a new task with few-shot data. When adapting to a task T , we begin with feeding the support set to the model and calculate the loss L T to update parameters to ′ through gradient descent: where is the step size. It should be noted that Eq.(4) only shows one-step gradient update while we can take multiple-steps gradient update in practice. Loss Function. Typically, the above loss L T is calculated by the supervised signals from downstream tasks [5, 46] , i.e., molecular property labels in this study. However, simply using supervised signals maybe not effective since only a few samples are given for each task. In addition, the complexity of molecules inherently bring useful unlabeled information in both structure and attribute. Therefore, to enhance the above meta-training process, we propose to exploit and leverage unlabeled information in molecular graphs. In particular, we design a self-supervised module which consists of a bond reconstruction loss and an atom type prediction loss, accompanying with the property prediction loss. Molecular Property Prediction Loss. To predict molecular property, we introduce a multi-layer perception (MLP) on top of the graph-level molecular representation h ( Eq.(3)), i.e.,ˆ= MLP(h). The loss of prediction is defined as the cross entropy loss between the predicted labels and ground-truth labels: where is the number of data samples. Bond Reconstruction Loss. To perform bond reconstruction in molecular graphs, we first sample a set of positive edges (existing bonds) in the molecular graph, then sample a set of negative edges (non-existing bonds) by choosing node pairs that do not have an edge in the original molecular graph. We denote E as the union set of sampled positive edges and negative edges. In practice, we set |E | = 10 including 5 positive samples and 5 negative samples. The bond reconstruction score is computed by the inner product of embeddings between the sampled pair of nodes, i.e.,ˆ= h ⊤ · h . The bond reconstruction loss is defined as the binary cross entropy loss between the predicted bonds and ground-truth bonds: Atom Type Prediction Loss. In a molecule, different atoms are connected in a certain way (e.g., carbon-carbon bond, carbon-oxygen bond), leading to different molecular structure. The atom type determines how a node in the molecular graph connects with neighboring nodes. Thus, we utilize the contextual sub-graph of a node (atom) to predict its type. Specifically, we first sample a set of nodes in a molecular graph, denoted as V ⊆ V. For each node in V , the contextual sub-graph is defined as its neighbors ⊆ U × U , and N ( ) represents the the set of neighboring nodes of node . In practice, we choose V = 15% nodes in the graph and = 1. Later, we use a a multi-layer perception (MLP) on the top of mean pooling of all nodes in the contextual sub-graph excluding the central node and the atom type prediction loss is formulated as the cross entropy loss between predicted node type and ground-truth node type: = Mlp (Mean({h : ∈ N ( )})), The self-supervised module (of both bond reconstruction and atom type prediction) is illustrated in Figure 1(b) . Joint Loss. The loss for task T in the meta-training process is formulated as the summation over the above three losses, where 1 and 2 are trade-off parameters that control the importance of different losses. In practice, we set 1 = 2 = 0.1. Task-aware Attention. With the new model parameter ′ obtained from the support set data of task T , i.e., ′ = − ∇ L T ( ), the model is further updated as follows: where is meta-learning rate, L ′ T is the joint loss over query set of T . In other words, the model parameters are further updated over losses of all sampled tasks through gradient descent. The traditional Bond Direction -, Endupright, Enddownright meta-learning methods (e.g., MAML [5] ) treat each task with the same weight when optimizing the meta-leaner (i.e., (T ) are same for all tasks), which cannot reflect how important of different property prediction tasks are. Therefore, considering different property prediction tasks contribute differently to the meta-learner optimization, we further introduce a self-attentive weight to measure task importance. Particular, we use self-attentive mechanism [16] to calculate importance of each task: (11) where T is the set of all tasks, H T denotes the task embedding which is computed by averaging all molecular embeddings of T . Figure 1(c) illustrates the details of self-attentive task weight computation. The meta-training process is described in Algorithm 1. During meta-testing, we first utilize the fewshot support set of new tasks to update parameters of Meta-MGNN via one or a small number of gradient descent steps using Eq. (4), then evaluate performance in query set. In this section, we conduct extensive experiments on two public datasets (Tox21 and Sider) to compare performances of different models and show related analysis. We evaluate different methods on the Tox21 and Sider datasets. They are collected from MoleculeNet [33] , which is a large scale benchmark dataset for molecular machine learning. Tox21 has 7,831 instances with 12 different tasks; Sider has 1,427 instances with 27 different tasks. In each task, molecules are divided into positive instances and negative instances (i.e., binary labels). A positive instance means that a molecule has a specific property, and a negative instance means that a molecule does not have the property. We manually split 3 tasks from Tox21 and 6 tasks from Sider for meta-testing. The details of these two datasets are as follows: • Tox21 1 : Toxicity on 12 biological targets, including nuclear receptors and stress response pathways. • Sider: 27 system organ classes where molecules are marketed drugs and adverse drug reactions [13] . Dataset Processing. The raw data of molecules are given as SMILES strings. We transfer SMILES strings to molecular graphs by using Rdkit.Chem [14] . Then we extracted a set of node and bond features which can preserve the molecular structure best to use in the experiments. The details about features are listed in Table 1 . We compare our model with multiple baseline models. • GraphSAGE [9] . It generates the nodes' embedding by sampling and aggregating their neighbors' embeddings, which can effectively capture the graph information. • GCN [12] . It is a widely used graph-based model, which contains an effective convolutional neural network component. GCN outperforms various models by learning both local graph structure and features of nodes. • MAML [5] . It builds a task-agnostic algorithm for few-shot learning, where training a model's parameters using a small number of gradient updates will lead to fast learning on new tasks. • Seq3seq [43] . It is a Seq2Seq model for molecular property prediction. The loss function contains both self-recovery loss and inference task loss. • EGNN [11] . It is an edge-labeling graph neural network for few-shot learning which is proved a well-generalizable model for low-data problem. • PreGNN [10] . This model develops self-supervised learning to pretrain GNN for molecular property prediction. It captures both useful local and global information. We evaluate the performance of each model using ROC-AUC. We consider each molecular property as an independent task for fewshot learning. We use 3 and 6 tasks as test tasks of Tox21 and Sider data, respectively. Each task is a binary label classification task. Table 2 and Figure 3 report the result for each test task. For both datasets, we consider 2-way classification with 1 and 5 shots. We take graph isomorphism network (GIN) [34] as base graph neural network. In our experiment, we utilize the supervised-contextpred pre-trained GIN of PreGNN [10] . The GIN layer number is set as 5. We set all embedding dimensions to 300. The same feature will share the same initial embedding. We set the update step in training tasks as 5 and the update step in testing tasks as 10. We set the trade-off weight of self-supervised module as 0.1. We use Pytorch to implement the model and run it on a GPU. Overall Performance. The overall performances of all methods are reported in Table 2 . According to this table, we can find that Meta-MGNN outperforms all baseline models on both Tox21 and Sider datasets. Specifically, for 1-shot learning, the average improvements are +1.04% and +1.80% on Tox21 and on Sider, respectively. The values equal +0.84% and +1.87% for 5-shot learning. In addition, we observe that PreGNN [10] and EGNN [11] perform the best among all baseline methods on average. However, the baseline methods do not have stable performance on different tasks. In other words, they may perform well on one task, but perform poorly on another task. In comparison, the performance of Meta-MGNN is stable. It has the best performance for all tasks in both datasest. Analyzing Meta-MGNN Structure. MAML demonstrates superior performance than the other two GNN models (GraphSage and GCN). It makes sense since MAML trains the model through metalearning, making it better adapt to new tasks with few data samples. Besides taking advantage of meta-learning, Meta-MGNN also utilizes pre-trained graph neural network model (PreGNN) [10] to initialize model parameters. PreGNN uses a large amount of molecules data to pre-train the graph neural network, which leads to better parameter initialization. Therefore, by taking advantages of both meta-learning and pre-training, Meta-MGNN demonstrates superior performance than the other baseline methods. Performance on Different Datasets. From Table 2 , we can observe that our proposed Meta-MGNN outperforms the best baseline method by +1.80% for 1-shot learning and +1.87% for 5-shots learning on Sider dataset. However, the average improvements on Tox21 are +1.04% for 1-shot learning and +0.84% for 5-shots learning, [10] , and MAML [5] . The blue dots denote negative labels in SR-MMP (a molecular property). The orange dots represent positive labels in SR-MMP. Our model can better discriminate embeddings of these two kinds of labels than the other methods. which are smaller than those in Sider. As we know, the major advantage of few-shot learning model is to make model predict new tasks better by using a small number of data samples. Obviously, training with more tasks allows the model to learn more knowledge. The advantages of the few-shot learning model can be better reflected on the dataset which contains more tasks. Since Sider has more tasks than Tox21, Meta-MGNN can deliver greater improvements on Sider than on Tox21. Additionally, we also find that the overall performance on Tox21 is better than that on Sider for all models. This is due to the larger size of Tox21, which improves the generalization capabilities of these deep learning models, as reflected by the evaluation scores. Settings. Besides comparing with baseline methods, we also implement model variants (ablation studies) to show the effectiveness of different model components. The details of different model variants are illustrated as follows (also shown in the Table 3 ): • M1. Pre-trained graph neural network. We take GIN [34] as our base graph neural network and pre-train it by both supervised and unsupervised (Context Prediction) pre-training strategies. • M2. Graph neural network model (without pre-training) trained with the meta learning process. • M3. Our base model, which is based on GIN [34] and learned with meta-learning algorithms. This model is also pre-trained by supervised and unsupervised (Context Prediction) pre-training. These models are based on M3 and augmented with the self-supervised module. M4, M5, and M6 are augmented with the bond reconstruction, the atom-type prediction, and both of them, respectively. This is to analyze the effectiveness of the self-supervised module. • M7. It is based on M3 and enhanced with task-aware attention to incorporate the importance of different tasks. This is to analyze the effectiveness of task weight in meta-learning. • M8. It is based on M3 and augmented with both both self-supervised module and self-attentive task weight. Performance Comparison and Analysis. The performances of all model variants are shown in Figure 2 . The three sub-figures in the first row are model performance on Tox21. The sub-figures in the second row and third row are model performance on Sider. There are several findings from these figures. First, M2 has the worst results in all cases, illustrating the significant impact of the pretraining step for graph neural networks. Second, the performance of M3 is better than M1 and M2, which indicates the effectiveness of combining both the pre-training and few-shot learning strategies. Third, adding different self-supervised components (bond reconstruction and atom type prediction) can further improve model performance, as reflected by the better performances of M4, M5, and M6 over M3. Among these three variants, M6 has the best performance as it adds both self-supervised tasks. Additionally, M7 outperforms M3, demonstrating the benefit of incorporating taskattention weight into the meta-learning process. At last, M8 (the proposed model) has the best performances in most cases, which shows the best capability graph neural network model trained by meta-learning process and augmented with both self-supervised module and task-aware attention. According to these findings, we can conclude that different model components indeed bring benefits to model design and improve performance. To better show the effectiveness of our model, we visualize the molecular embeddings generated by our proposed Meta-MGNN, PreGNN [10] and MAML [5] using t-SNE [19] , which are shown in Figure 3 . Specifically, it shows the embedding result of testing datasets from SR-MMP (a molecular property). The blue plots and orange plots represent molecules without SR-MMP property and with SR-MMP property, respectively. It can be observed that our model achieves better performance in discriminating two kinds of molecules than the other two models. In Figure 3 (a), the bottom left corner of the figure is mostly occupied by orange dots and the blues ones are mostly in the upper right corner. However, most orange plots are mixed with blue plots in Figure 3 (b) and 3(c). In this work, we proposed a few-shot learning approach for the molecular property prediction problem, which is important and has not been well studied. We proposed a novel model called Meta-MGNN. Meta-MGNN utilized a graph neural network (with pretraining) to learn molecular embeddings and further employed a meta-learning process to learn well-initialized model parameters that could be fast adapted to new molecular properties with few-shot data samples. A self-supervised module and self-attentive task weight were further proposed and incorporate into the metalearning framework, which benefited the whole model. We evaluated our model on two public multi-task datasets and the comparison of the experimental results showed that our model can outperform state-of-art methods. The effectiveness of each model component was also verified. The initial success of this study suggests following studies. The future work might consider better task embedding formulation when computing task weight in metalearning. It is also possible to fuse both graph and sequence model to learn molecular embeddings for the meta-learning process. Sider Si-T1 Sider Si-T1 Low data drug discovery with one-shot learning Meta-learning with differentiable closed-form solvers Bert: Pre-training of deep bidirectional transformers for language understanding Graph neural networks for social recommendation Model-agnostic metalearning for fast adaptation of deep networks Few-shot learning with graph neural networks Neural message passing for quantum chemistry GraSeq: graph and sequence fusion learning for molecular property prediction Inductive representation learning on large graphs Strategies for pre-training graph neural networks Edgelabeling graph neural network for few-shot learning Semi-supervised classification with graph convolutional networks The SIDER database of drugs and side effects Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling Gradient-based meta-learning with learned layerwise metric and subspace A structured self-attentive sentence embedding Chemi-Net: a molecular graph convolutional network for accurate drug property prediction Molecular property prediction: A multilevel quantum interactions modeling perspective Visualizing data using t-SNE Molecular geometry prediction using a deep generative graph neural network Distributed representations of words and phrases and their compositionality Similarity maps-a visualization strategy for molecular fingerprints and machine-learning methods Computational methods in drug discovery Session-based social recommendation via dynamic graph attention networks Learning to compare: relation network for few-shot learning Applications of machine learning in drug discovery and development Graph attention networks Matching networks for one shot learning SMILES-BERT: Large scale unsupervised pre-training for molecular property prediction An analysis of the attrition of drug candidates from four major pharmaceutical companies SMILES. 2. Algorithm for generation of unique SMILES notation A comprehensive survey on graph neural networks MoleculeNet: a benchmark for molecular machine learning How powerful are graph neural networks Seq2seq fingerprint: An unsupervised deep molecular embedding for drug discovery Learning from multiple cities: A meta-learning approach for spatial-temporal prediction Graph few-shot learning via knowledge transfer Identifying referential intention with heterogeneous contexts Heterogeneous graph neural network Few-shot knowledge graph completion Few-shot multi-hop relation reasoning over knowledge bases Metagan: An adversarial approach to few-shot learning Seq3seq fingerprint: towards end-to-end semi-supervised deep drug discovery Early anomaly detection by learning and forecasting behavior Identifying structureproperty relationships through SMILES syntax analysis with self-attention mechanism Meta-GNN: On few-shot node classification in graph meta-learning This work was supported in part by National Science Foundation grant CCI-1925607. We thank the anonymous referees for their valuable comments and helpful suggestions.