key: cord-0103060-2vhtwtt7
authors: Ji, Yuanfeng; Zhang, Lu; Wu, Jiaxiang; Wu, Bingzhe; Huang, Long-Kai; Xu, Tingyang; Rong, Yu; Li, Lanqing; Ren, Jie; Xue, Ding; Lai, Houtim; Xu, Shaoyong; Feng, Jing; Liu, Wei; Luo, Ping; Zhou, Shuigeng; Huang, Junzhou; Zhao, Peilin; Bian, Yatao
title: DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise Annotations
date: 2022-01-24
journal: nan
DOI: nan
sha: dcb21a5a0bc8b6d1bcfff10659e192a95ea20773
doc_id: 103060
cord_uid: 2vhtwtt7

AI-aided drug discovery (AIDD) is gaining increasing popularity due to its promise of making the search for new pharmaceuticals quicker, cheaper and more efficient. In spite of its extensive use in many fields, such as ADMET prediction, virtual screening, protein folding and generative chemistry, little has been explored in terms of the out-of-distribution (OOD) learning problem with emph{noise}, which is inevitable in real world AIDD applications. In this work, we present DrugOOD, a systematic OOD dataset curator and benchmark for AI-aided drug discovery, which comes with an open-source Python package that fully automates the data curation and OOD benchmarking processes. We focus on one of the most crucial problems in AIDD: drug target binding affinity prediction, which involves both macromolecule (protein target) and small-molecule (drug compound). In contrast to only providing fixed datasets, DrugOOD offers automated dataset curator with user-friendly customization scripts, rich domain annotations aligned with biochemistry knowledge, realistic noise annotations and rigorous benchmarking of state-of-the-art OOD algorithms. Since the molecular data is often modeled as irregular graphs using graph neural network (GNN) backbones, DrugOOD also serves as a valuable testbed for emph{graph OOD learning} problems. Extensive empirical studies have shown a significant performance gap between in-distribution and out-of-distribution experiments, which highlights the need to develop better schemes that can allow for OOD generalization under noise for AIDD.

The traditional drug discovery process is extremely time-consuming and expensive. Typically, the development of a new drug takes nearly a decade and costs about $3 billion (Pushpakom et al., 2019) , whereas about 90% of experimental drugs fail during lab, animal or human testing. Meanwhile, the number of drugs approved every year per dollar spent on development has plateaued or decreased for most of the past decade (Nosengo, 2016) . To accelerate the development for new drugs, drugmakers and investors turn their attention to artificial intelligence (Muratov et al., 2020) techniques for drug discovery, which aims at rapidly identifying new compounds and modeling complex mechanisms in the body to automate previously manual processes (Schneider, 2018) .

The applications of AI aided drug discovery is being continuously extended in the pharmaceutical field, ranging from ADMET prediction (Wu et al., 2018; Rong et al., 2020) , target identification (Zeng et al., 2020; Mamoshina et al., 2018) , protein structure prediction and protein design (Jumper et al., 2021; Baek et al., 2021; Gao et al., 2020) , retrosynthetic analysis (Coley et al., 2017; Segler et al., 2018; Yan et al., 2020a) , search of antibiotics (Stokes et al., 2020) , generative chemistry (Sanchez-Lengeling et al., 2017; Simonovsky & Komodakis, 2018) , drug repurposing for emerging diseases (Gysi et al., 2021) to virtual screening (Hu et al., 2016; Karimi et al., 2019; Lim et al., 2019) . Among them, virtual screening is one of the most important yet challenging applications. The aim of virtual screening is to pinpoint a small set of compounds with high binding affinity for a given target protein in the presence of a large number of candidate compounds. A crucial task in solving the virtual screening problem is to develop computational approaches to predict the binding affinity of a given drug-target pair, which is the main task studied in this paper.

In the field of AI-aided drug discovery, the problem of distribution shift, where the training distribution differs from the test distribution, is ubiquitous. For instance, when performing virtual screening for hit finding, the prediction model is typically trained on known target proteins. However, a "black swan" event like COVID-19 can occur, resulting in a new target with unseen data distribution. The performance on the new target will significantly degrades. To handle the performance degradation caused by distribution shift, it is essential to develop robust and generalizable algorithms for this challenging setting in AIDD. Despite its importance in the real-world problem, curated OOD datasets and benchmarks are currently lacking in addressing generalization in AIaided drug discovery.

Another essential issue in the field of AI-aided drug discovery is the label noise. The AI model are typical trained on public datasets, such as ChEMBL, whereas the bioassay data in the dataset are often noisy (Kramer et al., 2012; Cortés-Ciriano & Bender, 2016) . For example, the activity data provided in ChEMBL is extracted manually from full text articles in seven Medicinal Chemistry journals (Mendez et al., 2019) . Various factors can cause noise in the data provided in ChEMBL, including but not limited to different confidence levels for activities measured through experiments, unit-transcription errors, repeated citations of single measurements and different "cut-off" noise 2 . Figure 1 shows Figure 1 : DrugOOD provides large-scale, realistic, and diverse datasets for Drug AI OOD research. Specifically, DrugOOD focuses on the problem of domain generalization, in which we train and test the model on disjoint domains, e.g., molecules in a new assay environment. Top Left: Based on the ChEMBL database, we present an automated dataset curator for customizing OOD datasets flexibly. Top Right: DrugOOD releases realized exemplar datasets spanning different domain shifts. In each dataset, each data sample (x, y, d) is associated with a domain annotation d. We use the background colours blue and green to denote the seen data and unseen test data. Bottom: Examples with different noise levels from the DrugOOD dataset. DrugOOD identifies and annotates three noise levels (left to right: core, refined, general) according to several criteria, and as the level increases, data volume increases and more noisy sources are involved.

examples with different noisy levels. Meanwhile, real world data with noise annotations is lacking for learning tasks under noise labels (Angluin & Laird, 1988; Han et al., 2020) . To help accelerate research by focusing community attention and simplifying systematic comparisons between data collection and implementation method, we present DrugOOD, a systematic OOD dataset curator and benchmark for AI-aided drug discovery which comes with an open-source Python package that fully automates the data curation process and OOD benchmarking process. We focus on the most challenging OOD setting: domain generalization problem in AI-aided drug discovery, though DrugOOD can be easily adapted to other OOD settings, such as subpopulation shift and domain adaptation (Zhuang et al., 2020) . Our dataset is also the first AIDD dataset curator with realistic noise annotations, that can serve as an important testbed for the setting of learning under noise.

Notably, we present an automated dataset curator based on the large-scale bioassay deposition website ChEMBL (Mendez et al., 2019) , in contrast to just providing a set of curated datasets. Figure 2 gives an overview of the automated dataset curator. Using this dataset curator, potential researchers/practitioners can generate new OOD datasets based on their specific needs by simply re-configuring the curation process, i.e., modifying the YAML files in the python package. Specifically, we also realize this dataset curator by generating 96 OOD datasets spanning various domains, noise annotations and measurement types. This mechanism comes with two advantages: i) It ensures that our released datasets and benchmarks are fully reproducible. ii) It allows great flexibility for future usage since it is often difficult, even for domain experts, to agree on one specific configuration. As an example, using EC50 as a measure of affinity, agreeing on a threshold for partitioning to find active/inactive pairs may be challenging.

As OOD learning subsumes or is closed related to other learning settings with distribution shift, such as domain adaption (Zhuang et al., 2020) , transfer learning (Pan & Yang, 2009) , and zero-shot learning (Romera-Paredes & Torr, 2015; Wang et al., 2019b) , DrugOOD can also serve as a benchmark dataset or be used to generate datasets to study affinity prediction problems in AIDD under these learning settings. The following components summarize our major contributions:

1. Automated Dataset Curator: We provide a fully customizable pipeline for curating OOD datasets for AI-aided drug discovery from the large-scale bioassay deposition website ChEMBL.

We present various approaches to generate specific domains that are aligned with the domain knowledge of biochemistry.

3. Realistic noise annotations: We annotate real-world noise according to the measurement confidence score, "cut-off" noise etc, offering a valuable testbed for learning under real-world noise.

4. Rigorous OOD benchmarking: We benchmark six SOTA OOD algorithms with various backbones for the 96 realized dataset instances and gain insight into OOD learning under noise for AIDD.

Paper Organizations. Section 2 presents background and related work on AI-aided drug discovery, existing OOD algorithms, datasets, benchmarks and affinity predictionrelated materials. In Section 3 we provide details on the automated dataset curator with real-world domain and noise annotations. We present specifics on benchmarking SOTA OOD algorithms in Section 4. Section 5 gives implementation details and package usage guidelines. We present experimental results and corresponding discussions in Section 6. Lastly, Section 7 discusses and concludes the paper.

In this section, we review the current progress in binding affinity prediction problems, one of highly active research areas in AIDD. The performance of affinity prediction is often limited by OOD issues and noisy labels, which motivates us to propose the DrugOOD database to explicitly tackle such problems. Lastly, we summarize general methods for OOD and noisy labels, together with representation learning for affinity prediction for virtual screening, which are later used for benchmark tests on the DrugOOD datasets.

The ability of AI techniques has been dramatically boosted in various domains, mainly due to wide-spread applications of deep neural networks. We have witnessed a growing number of researches attempting to solve traditional problems in the drug discovery with more advanced AI models. There have been several surveys (Sliwoski et al., 2013; Jing et al., 2018; Yang et al., 2019b; Paul et al., 2021; Deng et al., 2021; Bender & Cortés-Ciriano, 2021) summarizing recent advances and problems in this area, covering key aspects including major applications, representative techniques, and critical assessment benchmarks. As pointed out in (Yang et al., 2019b; Paul et al., 2021) , most of AI-driven applications can be roughly categorized into two domains, i.e., molecule generation and molecule screening. Molecule generation aims at adopting generative models to produce a large pool of candidate drug molecules with certain restraints satisfied (Simonovsky & Komodakis, 2018; Sanchez-Lengeling et al., 2017; Satorras et al., 2021) . On the other hand, molecule screening attempts to identify the most promising molecule(s) based on a wide range of predicted properties (Yang et al., 2019a; Feinberg et al., 2020; Jiménez et al., 2018) . Other typical applications of AI techniques in drug discovery include target identification (Zeng et al., 2020; Mamoshina et al., 2018) , target structure prediction (Jumper et al., 2021; Baek et al., 2021) , drug re-purposing (Aliper et al., 2016; Issa et al., 2021; Pham et al., 2021) , and molecule retrosynthesis (Coley et al., 2017; Zheng et al., 2019; Chen et al., 2020) .

For conducting virtual screening on candidate molecules, both target-independent (e.g. ADMET) and target-dependent (e.g. binding affinity) properties are critical. The former ones measure how likely the molecule itself is qualified as a candidate drug, for instance, it should not induce severe liver toxicity to human (Zhang et al., 2016; Asilar et al., 2020) . On the other hand, target-dependent properties consider the tendency of its potential interaction with the target (and other unrelated proteins), which often heavily depends on the joint formulation of candidate molecule and target (Hu et al., 2016; Karimi et al., 2019; Lim et al., 2019) . In this paper, we mainly concentrate on the binding affinity between molecule and protein target, which falls into the domain of predicting targetdependent properties. In this circumstance, the out-of-distribution issue may result in severe performance degradation (e.g., when the target distribution dramatically differs between model training and inference), which leads to the major motivation of this paper.

ChEMBL (Davies et al., 2015; Mendez et al., 2019) is a large-scale open-access database consists of small molecules and their biological activity data. Such information is mainly extracted from medicinal chemistry journal articles, supplemented with data collected from approved drugs and clinical development candidates. It now contains over 2.1 million distinct compounds and 18.6 million records of their activities, which involve over 14,500 targets.

BindingDB (Gilson et al., 2016) collects experimental interaction data between proteins and small molecules, primarily from scientific articles and US patents. BindingDB also gathers selected data entries from other related databases, including PubChem (Wang et al., 2009) , ChEMBL (Mendez et al., 2019) , PDSP Ki (Roth et al., 2000) , and CSAR (Carl-son & Dunbar Jr, 2011) . Advanced search tools, hypothesis generation schemes (from targets to compounds and vice versa), and virtual compound screening methods are also integrated in the database.

PDBbind (Liu et al., 2014) is created to collect biomolecular complexes from the PDB database (Burley et al., 2020) , with experimental binding affinity data curated from original reference papers. The latest release of PDBbind (version 2020) consists of over 23,000 biomolecular complexes, whose majority are protein-ligand complexes (19, 443) and protein-protein complexes (2,852), and the remaining part are mainly protein-nucleic acid and nucleic acid-ligand complexes.

It is often critical to access the target structure before estimating the affinity of candidate molecules, since the affinity is jointly determined by the interaction between molecule and target. Depending on the availability of known target structures, affinity prediction methods can be roughly divided into two categories: ligand-based and structure-based.

Based on the hypothesis that structurally analogous compounds tend to have similar biological activities (Johnson & Maggiora, 1990) , ligand-based affinity prediction methods are developed. The ultimate goal is to identify promising compounds from a large candidate library, based on their similarities to known active compounds for an interested target. Several approaches are proposed to filter compounds based on chemical similarity measurements, e.g., Tanimoto coefficients (Kim & Skolnick, 2008) and similarity ensemble approach (SEA) (Keiser et al., 2007) . Such methods heavily rely on hand-crafted or learnt compound representations, describing various properties including molecule weight, geometry, volume, surface areas, ring content, etc. On the other hand, quantitative structure-activity relationship (QSAR) based approaches attempt to explicitly formulate the relationship between structural properties of chemical compounds and their biological activities (Kwon et al., 2019) . Various machine learning techniques have been cooperated with QSAR-based affinity prediction, including linear regression (Luco & Ferretti, 1997) , random forest (Svetnik et al., 2003) , support vector machine (Zakharov et al., 2016) , and neural networks (Burden & Winkler, 1999; Pradeep et al., 2016) . Particularly, multi-task neural networks (Dahl et al., 2014) alleviate the over-fitting issue by optimizing over multiple bioassasys simultaneously, and was adopted to achieve the best performance in the Merck Molecular Activity Challenge 3 .

Despite the satisfying performance of ligand-based affinity prediction approaches in certain scenarios, they do not take target structures in consideration. However, the interaction between target and molecule is indeed essential in accurately predicting the binding affinity, which leads more and more researches to focus on structure-based affinity prediction.

In contrast to ligand-based approaches, structure-based methods (Lim et al., 2021) usually take structures of protein targets and/or protein-ligand complexes as inputs for affinity prediction. Some work (Wallach et al., 2015; Li et al., 2021b) predicts the binding affinity from experimentally determined proteinligand co-crystal structures, but such data is highly expensive and time-consuming to obtain in practice. Others turn to computation-based docking routines (Trott & Olson, 2010; Koes et al., 2013; McNutt et al., 2021; Bao et al., 2021) to estimate protein-ligand complex structures through sampling and ranking, and then formulate the structure-affinity relationship via various models. Ballester & Mitchell (2010) propose the RF-Score approach to use random forest to implicitly capture binding effects based on a series of carefully designed hand-crafted features. 3D convolutional neural networks are adopted in (Stepniewska-Dziubinska et al., 2018; Jiménez et al., 2018) , where protein-ligand complex structures are discretized into 3D voxels and then fed into the model for affinity prediction. However, such discretization fails to capture the 3D rotational and translational invariance of 3D structures, and thus relies on heavy data augmentation to overcome such limitations. Graph neural networks are adopted in Jiang et al. (2021) to simultaneously formulate the intra-molecular and inter-molecular interactions, where nodes correspond to ligand/protein atoms, and edges are defined by both covalent and non-covalent linkages.

For most machine learning based approaches, it is usually desirable that the data distribution of training and evaluation subsets are as close as possible. However, this often does not hold true for affinity prediction tasks, e.g., the scaffold of small molecules and/or family of protein targets encountered during inference may be unseen throughout the model training process. Simply dividing the database into training and evaluation subsets on a per target-molecule basis may lead to over-optimistic performance, which is unrealistic for real-world applications.

Nonetheless, most databases for experimental binding affinities do not provide an official data split or its generation pipeline for model training and evaluation. To make things even more complicated, binding affinity annotations could be highly noisy, due to different experimental settings, affinity measurements, and confidence scores. Researchers need to manually pre-process raw data entries and re-organize them into the standard format, which is not only laborious and burdensome, but also unfavorable for a fair comparison against existing baselines. Therefore, we propose DrugOOD as a highly customizable curator for OOD datasets with noisy labels explicitly considered, so as to promote more efficient development of affinity prediction approaches in AIDD.

The out-of-distribution issue has attracted an ever-growing research interest in recent years, due to its importance in improving the generalization ability in real-world applications. Several databases have been constructed with great emphasis placed on the out-of-distribution generalization performance, mainly consist of computer vision and natural language processing tasks.

In Koh et al. (2021) , the WILDS benchmark is proposed to reflect various levels of distribution shifts that may occur in real-world scenarios. It considers two common types of distribution shifts: domain generalization and sub-population shift. A total of 10 datasets are included, covering shifts across cameras for wildlife monitoring, hospitals for tumor identification, users for product rating estimation, andƒ scaffolds for biochemical property prediction, etc. Sagawa et al. (2021) further extend this database to include unlabeled data for unsupervised domain adaptation.

DomainBed (Gulrajani & Lopez-Paz, 2020) consists of 7 multi-domain image classification datasets, including Colored MNIST (Arjovsky et al., 2019) , Rotated MNIST (Ghifary et al., 2015) , PACS , VLCS (Fang et al., 2013) , Office-Home (Venkateswara et al., 2017) , Terra Incognita (Beery et al., 2018) , and DomainNet (Peng et al., 2019) . Furthermore, authors point out the importance of model selection strategy in the domain generalization task, and conduct thorough benchmark tests over 9 baseline algorithms and 3 model selection criteria. As it turns out, empirical risk minimization (ERM) (Vapnik, 1999 ) with a careful implementation achieves state-of-the-art performance across all datasets, even when compared against various domain generalization algorithms. Ye et al. (2021) analyze the performance comparison between ERM and domain generalization algorithms on DomainBed, and point out that the distribution shift is composed of diversity shift and correlation shift, and existing domain generalization algorithms are only optimized towards one of them. They further propose additional datasets, of which WILDS-Camelyon17 is dominated by diversity shift, and NICO and CelebA are dominated by correlation shift.

As described above, general OOD databases are mostly built with image and text data, with one exception being the ODGB-MolPCBA dataset from WILDS , which aims at predicting biochemical properties from molecular graphs. The distribution shift is mainly caused by disjoint molecular scaffolds between training and test subsets. This is indeed critical for accurate prediction of target-independent properties, but is still insufficient for affinity prediction where the target information should be explicitly exploited. TDC (Huang et al., 2021) as another concurrent AIDD benchmark, offers SBAP datasets collated from BindingDB with a temporal split by patent year between 2013-2021, which is still limited in scope for drug OOD problems. In contrast, DrugOOD covers comprehensive sources of out-of-distribution in affinity prediction, and provides a dataset curator for highly customizable generation of OOD datasets. Additionally, noisy annotations are taken into consideration, so that algorithms can be evaluated in a more realistic setting, which further bridges the gap between researches and pharmaceutical applications.

In additional to an automated dataset curator in DrugOOD, we also provide rigorous benchmark tests over state-of-the-art OOD algorithms, with graph neural networks and BERTlike models used for representation learning from structural and sequential data. Next, we briefly review these OOD and representation learning algorithms, while more detailed descriptions can be found in Section 4.

The out-of-distribution and noisy label issues have been extensively studied in the machine learning community, due to their importance in improving the generalization ability and robustness. Here, we summarize recent progress in these two areas respectively, with a few overlaps since some approaches are proposed to jointly tackle these two issues in one shot.

To improve the model generalization ability over out-of-distribution test samples, some work focuses on aligning feature representations across different domains. The minimization of feature discrepancy can be conducted over various distance metrics, including second-order statistics (Sun & Saenko, 2016) , maximum mean discrepancy (Tzeng et al., 2014) and Wasserstein distance , or measured by adversarial networks (Ganin et al., 2016) . Others apply data augmentation to generate new samples or domains to promote the consistency of feature representations, such as Mixup across existing domains Yan et al., 2020b) , or in an adversarial manner Qiao et al., 2020) .

With the label distribution further taken into consideration, recent work aims at enhancing the correlation between domain-invariant representations and labels. For instance, invariant risk minimization (Arjovsky et al., 2019) seeks for a data representation, so that the optimal classifier trained on top of this representation matches for all domains. Additional regularization terms are proposed to align gradients across domains (Koyama & Yamaguchi, 2021) , reduce the variance of risks of all domains (Krueger et al., 2021b) , or smooth inter-domain interpolation paths (Chuang & Mroueh, 2021) .

There is a rich body of literature trying to combat with the label-noise issue, starting from the seminal work (Angluin & Laird, 1988) for traditional statistical learning to recent work for deep learning (Han et al., 2018b Song et al., 2020a) . Generally speaking, previous methods attempt to handle noisy labels mainly from three aspects (Han et al., 2020) : training data correction (van Rooyen et al., 2015; van Rooyen & Williamson, 2017) , objective function design (Azadi et al., 2016; Wang et al., 2017) , and optimization algorithm design (Jiang et al., 2018; Han et al., 2018b) .

From the training data perspective, prior work (van Rooyen & Williamson, 2017) firstly estimates the noise transition matrix, which characterizes the relationship between clean and noisy labels, and then employs the estimated matrix to correct noisy training labels. Typical work in this line includes using an adaptation layer to model the noise transition matrix (Sukhbaatar et al., 2015) , label smoothing (Lukasik et al., 2020) , and human-in-theloop estimation (Han et al., 2018a) .

Other work turns to the design of objective functions, which aims at introducing specific regularization into the original empirical loss function to mitigate the label noise effect. In Azadi et al. (2016) , authors propose a group sparse regularizer on the response of image classifier to force weights of irrelevant or noisy groups towards zero. Zhang et al. (2017) introduce Mixup as an implicit regularizer, which constructs virtual training samples by linear interpolation between pairs of samples and their labels. Such approach not only regularizes the model to factor simple linear behavior, but also alleviates the memorization of corrupted labels. Similar work following this idea includes label ensemble (Laine & Aila, 2017) and importance re-weighting (Liu & Tao, 2015) .

From the view of optimization algorithm design, some work (Han et al., 2018b; Li et al., 2020) introduces novel optimization algorithms or schedule strategies to solve the label noise issue. One common approach in this line is to train a single neural network via small loss tricks (Ren et al., 2018; Jiang et al., 2018) . Besides, other work proposes to co-train two neural networks via small-loss tricks, including co-teaching (Han et al., 2018b) and co-teaching+ .

As a fundamental building block of ligand-and structure-based affinity prediction, machine learning models for encoding molecules and proteins have attracted a lot of attention in the research community. Based on the input data and model type, existing studies can be divided into following three categories.

Hand-crafted feature based backbones. Many early studies enlist domain experts to design features from molecules and proteins which contain rich biological and chemical knowledge, such as Morgan fingerprints (Morgan, 1965) , circular fingerprints (Glen et al., 2006) , and extended-connectivity fingerprint (Rogers & Hahn, 2010) . Then, these handcrafted features are fed to machine learning models to produce meaningful embeddings for downstream tasks. Typical models include logistic regression (Kleinbaum et al., 2002) , random forest (Breiman, 2001) , influence relevance voting (Swamidass et al., 2009) , and neural networks (Ramsundar et al., 2015) .

Sequence-based backbones. Since both molecules and proteins have their own sequential formats, SMILES (Weininger, 1988) and amino-acid sequence (Sanger, 1952) , it is reasonable to utilize models which can naively deal with the sequential data, including 1D-CNN (Hirohara et al., 2018) , RNN (Goh et al., 2017) , and BERT (Wang et al., 2019a; Brandes et al., 2021) . Specifically, some studies (Xu et al., 2017; Min et al., 2021) introduce self-supervised techniques from natural language processing to generate highquality embeddings. However, 1D linearization of molecular structure highly depends on the traverse order of molecular graphs, which means that two atoms that are close in the sequence may be far apart and thus uncorrelated in the actual 2D/3D structures (e.g., the two oxygen atoms in "CC(CCCCCCO)O"), therefore hinders language models to learn effective representations that rely heavily on the relative position of tokens .

To eliminate the information loss in the sequence-based models, recently, some studies start to explore the more complex graph-based representation of molecules and proteins and utilize graph neural networks (GNNs) to produce embeddings which encode the information of chemical structures, such as GIN (Xu et al., 2018) , GCN (Kipf & Welling, 2016) , and GAT (Velickovic et al., 2018) . Other than directly applying the vanilla GNN model as the backbone, several studies try to incorporate the domain knowledge in the model design, such as ATi-FPGNN , NF (Duvenaud et al., 2015) , Weave (Kearnes et al., 2016) , MGCN , MV-GNN (Ma et al., 2020) , CMPNN , MPNN (Gilmer et al., 2017) , and DMPNN (Yang et al., 2019a) . Other than standard GNN models, several studies exploit Transformer-like GNN models to enhance the expressive power for training with a large number of molecules and proteins, such as GTransformer and Graphormer (Ying et al., 2021) .

DrugOOD Datasets

• Core, Refined, General • 4 Measurement Types

• IC50, EC50, KI, Potency • 5 Domains:

• Assay, Scaffold, Size • Protein, Protein Family Figure 2 : Overview of the automated dataset curator. We mainly implement three major steps based on the ChEMBL data source: noise filtering, uncertainty processing, and domain splitting. We have built-in 96 configuration files to generate the realized datasets with the configuration of two tasks, three noise levels, four measurement types, and five domains.

We construct all the datasets based on ChEMBL (Mendez et al., 2019) , which is a largescale, open-access drug discovery database that aims to capture medicinal chemistry data and knowledge across the pharmaceutical research and development process. We use the latest release in the SQLite format: ChEMBL 29 4 . Moreover, we consider the setting of OOD and different noise levels, which is an inevitable problem when the machine learning model is applied to the drug development process. For example, when predicting SBAP bioactivity in practice, the target protein used in the model inference could be very different from that in the training set and even does not belong to the same protein family. The real-world domain gap will invoke challenges to the accuracy of the model. On the other hand, the data used in the wild often have various kinds of noise, e.g. activities measured through experiments often have different confidence levels and different "cutoff" noise. Therefore, it is necessary to construct data sets with varying levels of noise in order to better align with the real scenarios.

In the process of drug development, many predictive tasks are involved. Here, we consider two crucial tasks from computer aided drug discovery (Sliwoski et al., 2013) : ligand based affinity prediction (LBAP) and structure based affinity prediction (SBAP). In LBAP, we follow the common practice and do not involve any protein target information, which is usually used in the activity prediction for one specific protein target. In SBAP, we consider both the target and drug information to predict the binding activity of ligands, aiming to develop models that can generalize across different protein targets.

ChEMBL contains the experimental results reported in previously published papers, sorted by "assays". However, the activity values for different assays and even different molecules in the same assay may have different accuracy and confidence levels. For example, assays that record the results of high-throughput screens (HTS) usually have less accurate activity values. The data with different accuracy constitute different noise annotations. Therefore, we set up filters with different stringency to screen for datasets with varying noise levels.

Our filters consist assay filters and sample filters, which aim to screen for the required assays and samples. Specifically, we have built-in 5 assay filters and 3 sample filters in advance. The details are shown as follows.

• Measurement Type: Filtering out the assays with required measurement types, e.g. IC50, EC50. Due to the big difference in meaning between measurement types, and its difficulty to merge them, we generate different datasets for each measurement types.

• Number of Molecules: Assay noise is strongly related to the number of molecules in an assay. For example, large assays are often derived from high-throughput screens that are very noisy. Hence we set different molecules number limits for different noise levels.

• Units of Values: The units of activity values recorded in ChEMBL are chaotic, e.g. nM, %, None. The conversion between some units is easy, such as nM and µM. But most of them cannot be converted between each other. • Confidence Score: Due to the complex settings of different experimental assays, sometimes it is not certain that the target that interacts with the compound is our designated target. Confidence score of an assay indicating how accurately the assigned target(s) represent(s) the actually assay target.

• Target Type: There are dozens of target types in ChEMBL, from 'single protein' to 'organism'. Different target types have different confidence levels. For SBAP tasks, the target type of a 'single protein' is more reliable than others.

• Value Relation: In many cases, the precise activity values of certain molecules could not be obtained, but given a rough range. ChEMBL records qualitative values with different relationships, e.g. '=', '>', '<', '»'. Obviously, '=' means the value is accurate, and the others means the value is not accurate.

• Missing Value: Filtering samples with any missing value.

• Legal SMILES: Filtering samples with illegal molecules.

The configurations of each filter for three different noise levels are shown in the Table 1. One can see that the three noise levels are now annotated jointly by "Confidence means the user configuration will specify the measurement type. '-' denotes no restriction. indicates the conditions need to be met. "SP": single protein. "PC": protein complex. "PF": protein family. One can see that the three noise levels are now annotated jointly by "Confidence Score", "Value Relation", "Number of Molecules" and "Target Type", which are shown in blue color. Score", "Value Relation", "Number of Molecules" and "Target Type", which are shown in blue color. In Figure 1 , we display some examples of the SBAP dataset, demonstrating different kinds of noise levels.

As mentioned above, ChEMBL records many activity values in an uncertainty way, and they were reported as above or below the highest or lowest concentration tested. Here, we follow the practice in pQSAR 2.0 (Martin et al., 2017) and offset them by 10-fold. Meanwhile, since the same molecule may be reported in different sources, the same molecule may also appear in multiple assays in ChEMBL. We call this phenomenon "multiple measurements". Following the common practice (Hu et al., 2020) , we average all the multiple measurements. For LBAP task, we average all the activity values for the same molecule before domain split. While for SBAP task, when the molecule and target pair is the same, we average their activity values.

While ChEMBL record activity values as floating numbers, benchmarking OOD tasks as regression tasks is known to be extremely hard because various noises such as uncertainty measurements. Additionally, in the process of drug development, practitioners habitually consider whether a compound is active/inactive. Therefore, using a binary classification task is more robust and good enough to make decisions in the drug development. However, in practice the threshold for binary classification depends on the specific circumstances of the drug development project. Here, we choose to use an adaptive threshold method that can adapt to a wider range of situations. In particular, the median value of all compounds in the generated dataset defines the threshold, but the range of allowed thresholds are fixed to be 4 ≤ pValue ≤ 6, where pValue = − log 10 (Activity Value). If the median is outside this range, a fixed threshold pValue = 5 is applied, which follows the common practice (Mayr et al., 2018; Stanley et al., 2021) in drug discovery. In this way, we can try our best to keep the dataset balanced while making the generated tasks meaningful.

As mentioned before, distributional shift is a common phenomenon in the drug development process. In order to make our benchmark more in line with the needs of drug discovery and development, we consider the OOD setting in the benchmark. In the process of drug research and development, when predicting the bioactivities of small molecules, we may encounter very different molecular scaffolds, sizes and so on from the model training set. These differences may also be reflected in the target in the SBAP task. Hence, for LBAP task, we consider the following three domains: assay, scaffold and molecule size. For SBAP tasks, in addition to the three domains mentioned above, we also consider two additional target-specific domains: protein and protein family. The user can also easily customize the domain through the configuration file and generate the corresponding dataset. The details of the five domains are as follows.

• Assay: Samples in the same assay are put into the same domain. Due to the great differences in different assay environments, the activity values measured by different assays will have a large shift. At the same time, different assays have very different types of experiments and test targets.

• Scaffold: Samples with the same molecular scaffold belong to the same domain. The molecular properties of different scaffolds are often quite different.

• Size: A domain consists of samples with the same number of atoms. As a result, we can test the model's performance on molecules that are quite different.

• Protein: In SBAP task, samples with the same target protein are in the same domain, in order to test the model performance when meeting a never-seen-before protein target.

• Protein Family: In SBAP tasks, samples with targets from the same family of proteins are in the same domain. Compared with protein domains, there are much less protein family domains albeit with greater differences among each other.

Based on the generated domains, we need to split them into training, OOD validation and OOD testing set. Our goal is to make the domain shifts between the training set and the OOD validation/testing set as significant as possible. This leads to the question of how to measure the differences between different domains and how to sort them into the training set and validation/testing set. Here, we design a general pipeline, that is, firstly generate domain descriptor for each domain, and then sort the domains with descriptors. Then the sorted domains are sequentially divided into training set, OOD validation and testing set (Figure 3 (a) ). Meanwhile, the number of domains in different splits are controlled by the number of total samples in each splits, and the proportion of sample numbers is trying to be kept at 6:2:2 for training, validation and testing set.

In the DrugOOD framework, we have built-in two domain descriptor generation methods as follows.

• Domain Capacity: The domain descriptor is the number of samples in this domain.

In practice, we found that the number of samples in a domain can well represent the characteristics of a domain. For example, an assay with 5000 molecules is usually different from an assay with 10 molecules in assay type. For SBAP, a protein family that contains 1000 different proteins is very likely to be different from the protein family that contains only 10 proteins. The descriptor is applied to the domains of assay, protein and protein family.

• Molecular Size: For the size domain, the size itself is already a good domain descriptor and can be applied to sort the domains, which we found to effectively increase the generalization gap in practice. The descriptor is applied to the domain of size and scaffold.

Generating in-distribution data subsets. After splitting the training, OOD validation and OOD test sets, we split the ID validation and ID testing sets from the resultant training set (Figure 3 (b) ). We follow similar settings as of WILDS , the ID validation and testing sets are merged from the randomly selected samples of each domain in the training set. The ratio of the number of samples in the OOD and ID validation/test sets can be easily modified through the configuration files. After this process, we get the final training, ID validation, ID test, OOD validation, OOD test sets (Figure 3 (c) ). 

The overview of the datasets curation pipeline is shown in Figure 2 . It mainly includes four major steps: filtering data with different noise levels, processing uncertainty and multi-measurements, binary classification task with adaptive threshold and domain split.

We have built-in 96 configuration files to generate different datasets with the configuration of 2 tasks, 3 noise levels, 4 measurement types, and 5 domains. With our DrugOOD dataset curation, the user can easily obtain the required datasets through customizing the configuration files.

Here, we show some statistics of datasets generated by our built-in configuration files. Table 2 and Table 3 show the statistics of domains and samples in LBAP task and SBAP task under IC50 measurement type, respectively. We can see that as the number of samples increases, the noise level also increases. Meanwhile, for the same noise level, there are huge differences in the number of domains generated by different domain split methods, which will challenge the applicability of OOD algorithm in different domain numbers. In order to show the comparison of data volume under different measurement types, we count the samples of different measurement types under different noise levels, as shown in Figure 4 . As we can see, the number of samples varies greatly under different measurement types in ChEMBL. Meanwhile, different measurement types may also bring different noise levels. Our curation can generate specific measurement types of datasets according to the needs of specific drug development scenarios. More statistical information of the DrugOOD datasets under different setting are summarized in Table 10  and Table 11 .

Our benchmark implements and evaluates algorithms from various perspectives, including architecture design and domain generalization methods, to cover a wide range of approaches to addressing the distribution shift problem. We believe this is the first paper to comprehensively evaluate a large set of approaches under various settings for the DrugOOD problem. 

It's known that the expressive power of the model is largely depend on the network architecture. How to design the network architecture for the better ability to fit the target function and robustness to the noise is a popular area of out-of-distribution data problem. Based on the DGL-LifeSci package (Li et al., 2021a) , we benchmark and evaluate following graph- (Duvenaud et al., 2015) and GTransformer . For the Sequence based input, we adopt the BERT (Devlin et al., 2018) and Protein-BERT (Brandes et al., 2021) as feature extractors. The backbones are extended to regression and classification tasks by a readout function and an MLP layer.

We use a standard model structure for each type of data: GIN (Kipf & Welling, 2016) for molecular graphs and BERT (Devlin et al., 2018) for protein amino acid sequences. In addition, the other models mentioned above were also used for some of the datasets in Table 4 : The in-distribution (ID) vs out of distribution (OOD) of datasets with measurement type of IC50 trained with empirical risk minimization. The ID test datasets are drawn from the training data with the same distribution, and the OOD test data are distinct from the training data, which are described in Section 3. We adopt the Area under the ROC Curve (AUROC) to estimate model performance; the higher score is better. In all tables in this paper, we report in parentheses the standard deviation of 3 replications, which measures the variability among replications. All datasets show performance drops due to distribution shift, with substantially better ID performance than OOD performance. More experimental results under different setting are shown in Table 12 in Appendix.

In-dist Out-of-Dist Gap measuring the effect of model structure on generalization ability. Following the model selection strategy in , we use a distinct OOD validation set for model early stopping and hyper-parameter tuning. The OOD validation set is drawn from a similar distribution of the training set, which distant from the OOD test set. For example, in the assay-based datasets, the training, validation, testing each consists of molecules from distinct sets of the assay environment. We detail the experimental protocol in Section 6. Table 4 shows that for each dataset with the measurement type of IC50, the OOD performance is always significantly lower than the performance of the corresponding ID setting.

In machine learning, models are commonly optimized by the empirical risk minimization (ERM), which trains the model to minimize the average training loss. To improve the model robustness under the distribution shift, current methods tend to learn invariant representations that can generalize across domains. There are two main directions: domain alignment and invariant predictors. Common approaches to domain alignment is to minimize the divergence of feature distributions from different domains across distance metrics, such as maximum mean discrepancy (Tzeng et al., 2014; Long et al., 2015) and adversarial loss (Ganin et al., 2016; Li et al., 2018) , Wasserstein distance (Zhou et al., 2020a) . In addition, other conventional methods along this line of research are adopting data augmentation. For example, Mixup (Zhang et al., 2017 ) augmentation proposes to construct additional virtual training data by convex combination of both samples and labels from the original datasets. Follow-up works applied a similar idea to generate more domain and enhance consistency of features during training Zhou et al., 2020b; Xu et al., 2020; Yan et al., 2020b; Shu et al., 2021; Wang et al., 2020; Yao et al., 2022) , or synthesize unseen domain in an adversarial way to imitate the challenging test domains Qiao et al., 2020; Volpi et al., 2018) .

For learning invariant predictors, the core idea is to enhance the correlations between the invariant representation and the labels. Representatively, invariant risk minimization (IRM) (Arjovsky et al., 2019) penalizes the feature representation of each domain that has a different optimal linear classifier, intending to find a predictor that performs well in all domains. Following up IRM, subsequent approaches propose powerful regularizers by penalizing variance of risk across domains (Krueger et al., 2021a) , by adjusting gradients across domains (Koyama & Yamaguchi, 2020) , by smoothing interpolation paths across domains (Chuang & Mroueh, 2021 ). An alternative to IRM is to combat spurious domain correlation, a core challenge for the sub-population shift problem , by directly optimizing the worst-group performance with Distributionally Robust Optimization (Sagawa et al., 2019; Zhang et al., 2020; Zhou et al., 2021a) , generating additional samples around the minority groups (Goel et al., 2020) , and re-weighting among groups with various size (Sagawa et al., 2020) , or additional regulations (Chang et al., 2020) .

We implement and evaluate the following representative OOD methods:

• ERM: ERM optimizes the model by minimizing the average empirical loss on observed training data.

• IRM: IRM penalizes feature distributions for domains that have different optimum predictors.

• DeepCoral: DeepCoral penalizes differences in the means and covariances of the feature distributions (i.e., the distribution of last layer activations in a neural network) for each domain. Conceptually, DeepCoral is similar to other methods that encourage feature representations to have the same distribution across domains.

• DANN: Like IRM and DeepCoral, DANN encourages feature representations to be consistent across domains. • GroupDro: GroupDro uses distributionally robust optimization methods to minimize worst-case losses, aiming to combat spurious correlations explicitly. Table 5 summarizes the experimental results on the datasets with IC50 measurement type, showing that latest OOD algorithms exhibit no clear improvement over the simple ERM algorithm. There may be several reasons for this: 1). the molecular graph data are different from visual and textual input by nature, thus making it challenging to use conventional strategies directly; 2). these algorithms are usually designed for datasets that contain enough data per domain, so it is difficult to apply directly to datasets that have a large number of domains but few samples per domain. There is a need for improved approaches to realistic DrugOOD problems, based on the results. Additionally, current OOD research focus almost exclusively on the single-instance prediction tasks while overlook multi-instance prediction tasks, and how to better handle the distribution shift in multiple instance domains (e.g., molecule and protein inputs in the SBAP task) remains an open problem. Lastly, while not explored in this paper, the large-scale realistic datasets always come with non-negligible inherent noise, both aleatoric and epistemic (Lazic & Williams, 2021) . And how to incorporate noise learning with OOD generalization to boost model's robustness and generalization in the meantime is an important research direction.

DrugOOD develops a comprehensive benchmark for developing and evaluating OOD generalization algorithms. Different with other codebases, DrugOOD builds on the OpenMM-Lab project , owning the following features:

Customizable dataset. DrugOOD supports various formats of data, providing related processing and converting tools. We provide 96 realized sub-datasets in advance. In addition to this, users can additionally specify additional conditions to easily customize new datasets from the original source dataset.

Building on the design of OpenMMLab projects, we decompose the framework into different components and one can easily construct new OOD algorithms by combing these modules. Support for multiple frameworks out of the box. DrugOOD codebase directly supports various popular and contemporary OOD generalization algorithms, e.g. DeepCoral, IRM, DANN, Mixup etc.

With the above abstraction, our benchmark framework is illustrated in Figure 5 . By simply creating some new components and assembling existing ones, the researchers can develop their approach efficiently. Figure 5 : Overview of the DrugOOD benchmark. DrugOOD conducts a comprehensive benchmark for developing and evaluating OOD generalization algorithms for AIDD. After loading any of the datasets generated by the data curator, users can flexibly combine different types of modules, including algorithms, backbones, etc., to develop OOD generalization algorithms in a flexible and disciplined manner.

The DrugOOD package provides a simple, standardized dataset curator based on the largescale bioassay deposition website ChEMBL (Mendez et al., 2019) , by proving a modified curation files, the researcher can easily re-configure the curation process. Specifically, we have provided 96 built-in configuration files for generating OOD dataset spanning various domains, noise annotations and measurement types. The Listing 1 provides a simple example, which covers all of the steps of generating an OOD dataset from ChEMBL. 1 # configure curation pipeline 2 curator = dict ( Code Listing 1: Dataset curation example

As shown in Listing 2, DrugOOD provides a flexible and uniform interface for building data pipelines, allowing users to easily and quickly adjust the experimental data flow. Meanwhile, Standardized and automated evaluation of specified dataset partitions can be easily implemented by a few lines of code. 1 dataset_type = ' MOL ' 2 subset = ' lbap_ic50_core_assay ' 3 data_prefix = " data / drugood " 4 # data prepossess pipeline 5 pipeline = [ 

DrugOOD also supports popular and contemporary OOD generalization algorithm out of box. Users can easily configure different modules to construct and develop new OOD generalization algorithms effectively. We provide an example of building algorithm in few lines code, as shown in Listing 3. Code Listing 3: Algorithm configuration example

We present typical experimental results and corresponding analysis in this section. More results and details are deferred to the Appendix.

Precisely predicting the affinity score of small molecules will greatly boost the process of drug discovery by reducing the needs of costly laboratory experiments. However, the experimental data available for training such models is limited compared to the extremely diverse and combinatorially large universe of candidate molecules that we would want to make predictions on. In this paper, we study the domain variation in experimental assay, molecules sizes, molecule sizes between training and test molecules.

Problem Definition. For the task of LBAP, we study a domain generalization problem where model needs to be generalized to the molecules from different domain splits. Aligned with the knowledge of biochemistry, we define the following three domains: assay, scaffold and molecular size. As an illustration, we treat the LBAP problem as a binary classification problem, where the input x is the graph data of a small molecule, label y is the ground truth (active or inactive) of binary affinity classification, and the d represents domain identifier for one specific domain splits. Data Info. As mentioned before, we preprocess the ChEMBL dataset and generate in total 36 exemplar datasets with varying noise levels, measurements types, and domain definitions. Each small molecule in each dataset is represented as a graph, where the nodes are atoms and edges are chemical bonds. Following the pre-processing strategy in , we preprocess the molecules via the RDKit package (Landrum, 2013) . Input node features are 39-dimensional vectors including atomic symbol, hybridization, hydrogens and so on. Input edge features are 10-dimensional vectors including bond type, conjugation, ring and bond stereo chemistry. Following the detailed splitting strategy in Figure 3 of Section 3.4, we split the dataset via three types of domain annotations: Assay: Samples in the same assay are divided into the same domain. Due to the big differences of experimental environments and protein targets in various assays, the bioactivity values measured by those assays will have a large shift. In this setting, we split the dataset along assays. This split provides a realistic estimate of model performance in prospective experimental settings by separating different molecules into different experimental environments. We assign the assays that contain large number of samples to the training set, and the assays with small number of samples to the test set. After such an assignment, the domain shifts between the training set and the OOD validation/testing set become sufficiently large. The proportion of samples is kept at around 6:2:2 in training, validation, test set.

Here takes the DrugOOD-lbap-core-ic50-assay dataset as an illustration here.

• Train: Contain total 34,179 molecules from the largest 311 assays with an average of 110 molecules per assay.

• Validation (ID): Contain total 11314 molecules from the same 311 assays as in the training sets.

• Test (ID): Contain total 11,683 molecules from the same 311 assay environments as in the training sets.

• Validation (OOD): Contain total 19,028 molecules from the next largest 314 assays with an average of 60.6 molecules per assay.

• Test (OOD): Contain total 19,302 molecules from the smallest 314 assays with an average of 27.6 molecules per assay. Figure 6 illustrates the analysis of the assay domain in the realized DrugOOD dataset. As shown in Figure 6 (a), the statistics of the assays in terms of the numbers of molecules in each assay, which implies that the scale of the experiments is highly skewed, with the test set containing the assay with the fewest molecules. However, the difference in assay environments does not significantly change the statistics of the learning target in each split. In Figure 6 , the label statistics still remain very similar in the training/validation/test splits, indicating that the main distribution variation comes from differences in the detection environment.

Scaffold: Scaffold split has been widely used in previous benchmarks Hu et al., 2020) , which splits datasets based on scaffold structure. Similarily, we assign the largest scaffolds to training set and smallest scaffolds to the test set to ensure its maximal diversity in scaffold structure. Take the DrugOOD-lbap-core-ic50scaffold dataset as examples, the detailed information of splits are:

• Train: Contain in total 21,519 molecules from the largest 6,881 scaffolds with an average of 3.12 molecules per scaffold. In the same way, we plot the statics of the scaffolds in terms of the size of each scaffold. As shown in Figure 7 , we again observe that the distribution is highly skewed and the test partition contains the smallest scaffolds.

We put samples with the same atomic number into a domain, and separate molecules with different atomic sizes into different subsets for simulating the realistic distribution shift. We organize the molecules with the largest atomic sizes into the training set and those with smaller atomic sizes are assigned to the test set to ensure sufficient variability of atomic sizes between the training and test data. Taking the DrugOOD-lbap-core-ic50-size dataset dataset as an example, the details of the split are as follows:

• Train: The largest 190 size groups, with overall 36,597 molecules, and an average of 192.61 molecules per group. • Validation (OOD): The next largest 4 size groups in addition to id data, with in total 17,660 molecules, and an average of 4,415 molecules per group.

• Test (OOD): The smallest 18 size groups, with in total 19,048 molecules, and an average of 1,058.22 molecules per group.

Evaluation. We evaluate models' performance by the area under the receiver operating characteristic (AUROC), which indicates the ability of a classifier to distinguish between classes (e.g., inactive or active). The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes. Meanwhile, we also provide the results of accuracy metric.

We train the GIN model as the baseline model on each dataset from scratch with a learning rate at 1e-4, a batch size at 256 samples and without L2-regularization. Each small molecules are pre-processed by the RDKit package to generate 39-dimensional node features and 9-dimensional edge features as input. To avoid performance degradation caused by inappropriate hyper-parameters, following the strategy in WILDS , we did a grid search strategy over learning rates {0.00003, 0.0001, 0.0005, 0.001, 0.01}, batch size {64, 128, 256, 512, 1024}. We report averaged results aggregated over 3 random seeds.

ERM results and performance drops. As shown in Table 4 and .97% points AUC score when the assay, scaffold, size split is used, respectively, suggesting that these splits are indeed harder than conventional random split, and can be used to estimate the realistic ID-OOD gap in the task of ligand based affinity prediction. More experimental results on different DrugOOD lbap datasets are shown in Table 12 of the Appendix. Table 7 shows the performance of other conventional generalization algorithms. For a fair comparison, all algorithms adopt the same backbone network. Besides, we also make additional grid searches on algorithms' specific hyperparameters separately, IRM's penalty weight in {1, 10, 100, 1000} and penalty anneal iteration in {100, 500, 1000}. DeepCoral's penalty weight in {0.1, 1, 10}, GroupDro's step size in {0.001, 0.01, 0.1}, DANN's inverse factor between {0.0001, 1} and Mixup's probability and interpolate strength between {0.0001, 1}.

As shown in Table 7 , ERM almost performs better than DeepCoral, IRM, and Group DRO, all of which use assay, scaffolds, size as the domains, indicating these existing methods can not solve the DrugOOD problem. Moreover, similar to the findings in the WILDS benchmark, current existing methods make model hard to fit the training data, For instance, under the scaffold domain split, DeepCoral, IRM achieves 81.6%, 77.66% AUC score in the In-Distribution validation set, respectively, while the AUC score pf ERM baseline is 94.84%. Also, these methods are primarily designed for the case when each group contains a decent number of examples, which is not the common case for the drug development scenario. Finally, the SOTA OOD algorithm does not work well in the DrugOOD setting, suggesting that robust methods need to be developed to solve the OOD problem for graph data. 

For the LBAP task, three types of domains are defined, and we study the performance of the model under these different domain splits. Figure 8 shows the performance degradation for different domain partitions, the gap values are computed on the test set and averaged over all measurement types. From the figure, we can conclude the following. 1) Among several divisions, the size domain brings the largest performance degradation, which is consistent with daily experimental findings, where molecules of different sizes often have very different properties. 2) As the noise level of the data set increases, the performance degradation of different domains is somewhat mitigated, and the increased data to some extent increase the generalization ability of the model. In addition, the degradation of the model performance is not further mitigated as the data continues to increase, indicating that the increase in the amount of data brings limited improvement and that a truly effective method needs to be developed to address the DrugOOD problem.

We also investigated the ID and OOD performance at different noise levels. As shown in Table 8 , we summarize different algorithms' ID and OD performance under three noise levels. From the table, one can observe that: 1) In the presence of more noise, the introduced noise produces pollution, which progressively affects the network's performance.

2) The increasing noise level brings more data, which provides more information about the dataset. From the core to refined level, we can see a narrowing of the gap between ID and OOD, however, the improvement from the refined level to general level reaches a bottleneck without significant improvement. 3) By combining the above two points, we can see that the introduction of large amounts of data with noise affects the learning of the model to a certain degree and affects its performance, but these noisy data, in turn, provide additional information so that the model is able to improve a small amount of generalization ability. 

6.2.1 Setup Compared with the LBAP task, the SBAP task considers target protein information. In our benchmark, we represent proteins as protein sequences cause the ChEMBL database only provide protein sequence information. However, DrugOOD can be easily extended by incorporating 3D structure information of targets by refering to protein sturcute depositing database, such as PDB and uniprot (Consortium, 2014) . This will be left as important future work.

We use a general SBAP prediction network that extracts molecular and protein features separately, which are then concatenated together and fed into a fully connected layer to predict interaction probabilities. For the feature extraction of small molecules, we follow the setting of LBAP task and use the same network and hyperparameters. For protein sequence, we use the pre-train BERT (Devlin et al., 2018) : 'bert-base-uncase' to extract a 768 dimensional protein feature. Then, the features of molecules and protein are concatenated and fed into a one-layer fully connected layer to predict the interaction probabilities.

The results of different algorithms on sub-datasets with IC50 as measurement type and protein as domain are shown in Table 9 . From the table, we can see that: 1) The perfor- Table 9 : Baseline results on datasets: DrugOOD-sbap-core-ic50-protein (first row), DrugOODsbap-refined-ic50-protein (second row) and DrugOOD-sbap-general-ic50-protein (third row) dataset. In-distribution (ID) results correspond to the train-to-train setting. Parentheses show standard deviation across three replicates. mance of OOD is degraded relative to that of ID. On the validation set of core noise level, the OOD performance of ERM degrades by 16.71% in AUC relative to the ID performance, while the performance drop expanded to 21.84% on test set. 2) OOD algorithms designed for computer vision tasks hardly work in SBAP scenarios. The performance of algorithms designed for OOD scenarios are difficult to match the performance of ERM, which means in order to promote the development of AI aided drug discovery, it is necessary to design OOD algorithms with the consideration of characteristics of drug development scenarios. Next, we will further analyze the experimental results from different aspects.

Different domain split methods may result in different distribution shifts, and therefore will also bring different challenges to the algorithms. Here, we analyze 5 built-in domain splits for the SBAP task. Figure 9 shows the performance gap of different domain splits. We conduct analysis under different noise levels and algorithms. The values of gap are calculated on the testing set and averaged over all measurements types. From Figure 9 , we can see that: 1) The noise level of the dataset has a great impact on the performance gap of ID and OOD. On one hand, each noise level has a different domain split method that can bring the biggest performance gap across all algorithms. For example, in core noise level, the domain 'protein family' brings the largest performance gap across all algorithms, while it changes to be 'size' in refined noise level. On the other hand, as the noise level increases, the performance gap gradually narrows. Combining with Table 9 , we observe that this is caused by the degradation of ID performance. 2) Among all domain split methods, the gap brought by scaffold is relatively small. We speculate that this may be due to the fact that scaffold split induces more domain than other split methods, which makes the training set cover a wider portion of the underlying distribution. 

Here, we study the effect of different noise levels on ID and OOD performance. Figure 10 shows the ID and OOD performance of 4 algorithms with different noises. The values are averaged over all measurement types. It can be seen that the performance of ID generally decreases with the increase of noise level. The introduction of noise hurts the performance of the model despite the larger amount of data, which suggests that we need to pay more attention to the noise of the data source in the realistic scenario. How to design an algorithm that is robust to both noise and distribution shift is a worthy research direction. As for OOD performance, we can see that different domain split shows different changing trend while the noise level increases. However, we notice that the OOD performance of different algorithms under a particular domain exhibits similar behavior for ERM, which means that existing OOD algorithms do not incorporate the noise of the real scene, nor are they designed to mitigate its impact. Figure 11 : Average performance of ID (left) and OOD (right) for different measurement types. We present the performance of varying measurement types with the change of noise levels (a) and OOD algorithms (b).

The automated dataset curator supports variant measurement types, e.g., EC50, IC50. As different measurement types will generate datasets with different distributions and noise, we analyze the performance of different measurement types by varying the noise levels and algorithms. The results are shown in Figure 11 . One can see that: 1) As a result of different measurement types, ID and OOD performance can differ, for example, the Potency measurement is generally low in both ID and OOD, possibly due to the high noise level in the Potency measurement. 2) Our benchmark algorithms are robust to measurement types since they have acceptable accuracy for almost all types of measurement.

In this work we have presented an automated dataset curator and benchmark based on the large-scale bioassay deposition website ChEMBL, in order to facilitate OOD research for AI-aided drug discovery. It is very worthwhile to explore more in the following respects.

As observed in current benchmark results, existing general OOD methods do not significantly outperform the baseline ERM method. Most of these OOD methods are designed and validated with visual and/or textual data, which may fail in capturing critical information for the affinity prediction problem. This implies that to further improve the performance under various out-of-distribution scenarios, it is essential to develop more advanced OOD methods, particularly with drug-related domain knowledge integrated.

Another key characteristic of DrugOOD database is that the majority of data falls into the highest noise level ("general"). Simply discarding such noisy labels and only referring to high-quality ones may severely limit the model performance due to insufficient training data. It could be worth investigating that whether large-scale unsupervised pre-training can be utilized to construct better representations for molecules and target proteins, which are critical to accurate affinity predictions.

Additionally, learning with noisy labels has been extensively studied in the general context, but it may be crucial to take the generation process of noisy affinity annotations into consideration. This includes different experimental precision, measurement types, activity relation annotation types, etc. It is possible that the data quality can be further improved with carefully-designed denoising techniques, so that more accurate affinity prediction models can be trained. Table 11 : List of the 60 SBAP datasets in DrugOOD. Pos and Neg denote the numbers of positive and negative data points, respectively. D # represents the number of domains and C # represent the number of data points.

Pos # Neg # Train ID Val ID Test OOD Val OOD Test

Deep learning applications for predicting pharmacological properties of drugs and drug repurposing using transcriptomic data

Learning from noisy examples

Invariant risk minimization

Image based liver toxicity prediction

Auxiliary image regularization for deep cnns with noisy labels

Accurate prediction of protein structures and interactions using a 3-track network

A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking

Deepbsp-a machine learning method for accurate prediction of protein-ligand docking structures

Recognition in terra incognita

Artificial intelligence in drug discovery: what is realistic, what are illusions? part 1: ways to make an impact, and why we are not there yet

Proteinbert: A universal deep-learning model of protein sequence and function. bioRxiv

Random forests. Machine learning

Robust qsar models using bayesian regularized neural networks

Rcsb protein data bank: powerful new tools for exploring 3d structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences

A call to arms: what you can do for computational drug discovery

Invariant rationalization

Retro*: learning retrosynthetic planning with neural guided a* search

MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv e-prints, art

Fair mixup: Fairness via interpolation

Computerassisted retrosynthesis based on molecular similarity

UniProt: a hub for protein information

How consistent are publicly reported cytotoxicity data? large-scale statistical analysis of the concordance of public independent cytotoxicity measurements

Multi-task neural networks for qsar predictions

Chembl web services: streamlining access to drug discovery data and utilities

Artificial intelligence in drug discovery: applications and techniques

Pretraining of deep bidirectional transformers for language understanding

Convolutional networks on graphs for learning molecular fingerprints

Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias

Improvement in admet prediction with multitask deep featurization

Domain-adversarial training of neural networks. The journal of machine learning research

Deep learning in protein structural modeling and design. Patterns

Domain generalization for object recognition with multi-task autoencoders

Neural message passing for quantum chemistry

Bindingdb in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology

Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to adme

Model patching: Closing the subgroup performance gap with data augmentation

Smiles2vec: An interpretable general-purpose deep neural network for predicting chemical properties

search of lost domain generalization

Network medicine framework for identifying drug-repurposing opportunities for covid-19

Masking: A new perspective of noisy supervision

Co-teaching: Robust training of deep neural networks with extremely noisy labels

A survey of label-noise representation learning: Past, present and future

Deep self-learning from noisy labels

Towards non-iid image classification: A dataset and baselines

Convolutional neural network based on smiles representation of compounds for detecting chemical motif

Large-scale prediction of drugtarget interactions from deep representations

KinaseMD: kinase mutations and drug response database

Open Graph Benchmark: Datasets for Machine Learning on Graphs. arXiv e-prints, art

Therapeutics data commons: machine learning datasets and tasks for therapeutics

Machine and deep learning approaches for cancer drug repurposing

In-teractionGraphNet: A novel and efficient deep graph representation learning framework for accurate protein-ligand interaction predictions

Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels

Gianni. K deep: protein-ligand absolute binding affinity prediction via 3d-convolutional neural networks

Deep learning for drug design: an artificial intelligence paradigm for drug discovery in the big data era

Concepts and applications of molecular similarity

Highly accurate protein structure prediction with alphafold

Deepaffinity: interpretable deep learning of compound-protein affinity through unified recurrent and convolutional neural networks

Molecular graph convolutions: moving beyond fingerprints

Relating protein pharmacology by ligand chemistry

Assessment of programs for ligand binding affinity prediction

Semi-supervised classification with graph convolutional networks

Logistic regression

Lessons learned in empirical scoring with smina from the csar 2011 benchmarking exercise

Wilds: A benchmark of in-the-wild distribution shifts

Out-of-distribution generalization with maximal invariant predictor

When is invariance useful in an out-ofdistribution generalization problem ?

The experimental uncertainty of heterogeneous public k i data

Out-of-distribution generalization via risk extrapolation (rex)

Out-of-distribution generalization via risk extrapolation (rex)

Comprehensive ensemble in qsar prediction for drug discovery

Temporal ensembling for semi-supervised learning

Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling

Quantifying sources of uncertainty in drug discovery predictions with probabilistic models

Deeper, broader and artier domain generalization

Domain generalization with adversarial feature learning

Dividemix: Learning with noisy labels as semi-supervised learning

Dgl-lifesci: An open-source toolkit for deep learning on graphs in life science

Structure-aware interactive graph neural networks for the prediction of protein-ligand binding affinity

Predicting drug-target interaction using a novel graph neural network with 3d structure-embedded graph representation

A review on compound-protein interaction prediction methods: Data, format, representation and model

Classification with noisy labels by importance reweighting

Pdb-wide collection of binding data: current status of the pdbbind database

Deep learning face attributes in the wild

Learning transferable features with deep adaptation networks

Molecular property prediction: A multilevel quantum interactions modeling perspective

Qsar based on multiple linear regression and pls methods for the anti-hiv activity of a large group of hept derivatives

Does label smoothing mitigate label noise? In ICML

Multi-view graph neural networks for molecular property prediction

Machine learning on human muscle transcriptomic data for biomarker discovery and tissue-specific drug target identification

Profile-qsar 2.0: Kinase virtual screening accuracy comparable to four-concentration ic50s for realistically novel compounds

Large-scale comparison of machine learning methods for drug target prediction on chembl

Gnina 1.0: molecular docking with deep learning

Chembl: towards direct deposition of bioassay data

Pre-training of deep bidirectional protein sequence representations with structural information

The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service

Qsar without borders

Can you teach old drugs new tricks?

A survey on transfer learning

Artificial intelligence in drug discovery and development

Moment matching for multi-source domain adaptation

A deep learning framework for high-throughput mechanism-driven phenotype compound screening and its application to covid-19 drug repurposing

An ensemble model of qsar tools for regulatory risk assessment

Drug repurposing: progress, challenges and recommendations

Learning to learn single domain generalization

Massively multitask networks for drug discovery

Learning to reweight examples for robust deep learning

Extended-connectivity fingerprints

An embarrassingly simple approach to zero-shot learning

Self-supervised graph transformer on large-scale molecular data

The multiplicity of serotonin receptors: uselessly diverse molecules or an embarrassment of riches?

Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization

An investigation of why overparameterization exacerbates spurious correlations

Extending the wilds benchmark for unsupervised adaptation

Optimizing distributions over molecular space. an objective-reinforced generative adversarial network for inverse-design chemistry (organic

The arrangement of amino acids in proteins

E(n) equivariant normalizing flows for molecule generation in 3d

Automating drug discovery

Schnet: A continuous-filter convolutional neural network for modeling quantum interactions

Planning chemical syntheses with deep neural networks and symbolic ai

Open domain generalization with domain-augmented meta-learning

GraphVAE: Towards generation of small graphs using variational autoencoders

Computational methods in drug discovery

Learning from noisy labels with deep neural networks: A survey

Communicative representation learning on attributed molecular graphs

FS-mol: A few-shot learning dataset of molecules

Development and evaluation of a deep learning model for protein-ligand binding affinity prediction

A deep learning approach to antibiotic discovery

Training convolutional networks with noisy labels

Deep coral: Correlation alignment for deep domain adaptation

Random forest: A classification and regression tool for compound classification and qsar modeling

Influence relevance voting: an accurate and interpretable virtual high throughput screening method

Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading

Deep domain confusion: Maximizing for domain invariance

A theory of learning with corrupted labels

Learning with symmetric label noise: The importance of being unhinged

The nature of statistical learning theory

Graph attention networks

Deep hashing network for unsupervised domain adaptation

Generalizing to unseen domains via adversarial data augmentation

Atomnet: a deep convolutional neural network for bioactivity prediction in structure-based drug discovery

Chemical-reaction-aware molecule representation learning

Smilesbert: large scale unsupervised pre-training for molecular property prediction

A survey of zero-shot learning: Settings, methods, and applications

Pubchem: a public information system for analyzing bioactivities of small molecules

Robust probabilistic modeling with bayesian data reweighting

Heterogeneous domain generalization via domain mixup

Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules

Moleculenet: a benchmark for molecular machine learning

Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism

How powerful are graph neural networks?

Adversarial domain adaptation with domain mixup

Seq2seq fingerprint: An unsupervised deep molecular embedding for drug discovery

Improve unsupervised domain adaptation with mixup training

Analyzing learned molecular representations for property prediction

Concepts of artificial intelligence for computer-assisted drug discovery

Improving out-of-distribution robustness via selective augmentation

Ood-bench: Benchmarking and understanding out-of-distribution generalization datasets and algorithms

Do transformers really perform bad for graph representation? arXiv preprint

How does disagreement help generalization against label corruption? In ICML

Domain randomization and pyramid consistency: Simulationto-real generalization without accessing target domain data

Qsar modeling and prediction of drug-drug interactions

Target identification among known drugs by deep learning from heterogeneous networks

In silico prediction of drug induced liver toxicity using substructure pattern recognition method

mixup: Beyond empirical risk minimization

Suvrit. Coping with label shift via distributionally robust optimisation

Maximum-entropy adversarial data augmentation for improved generalization and robustness

Predicting retrosynthetic reactions using self-corrected transformer neural networks

Examining and combating spurious features under distribution shift

Domain generalization with optimal transport and metric learning

Domain generalization via optimal transport with metric similarity learning

Deep domainadversarial image generation for domain generalisation

A comprehensive survey on transfer learning

DrugOOD-sbap-core-ic50-protein-family

DrugOOD-sbap-core-ic50-scaffold

DrugOOD-sbap-core-ic50-size

DrugOOD-sbap-core-ec50-protein-family

DrugOOD-sbap-core-ki-protein

DrugOOD-sbap-core-ki-size

DrugOOD-sbap-core-potency-size 18665

DrugOOD-sbap-refined-ic50-protein-family

DrugOOD-sbap-refined-ec50-assay

DrugOOD-sbap-refined-ec50

DrugOOD-sbap-refined-ki-protein-family

DrugOOD-sbap-refined-ki-scaffold

DrugOOD-sbap-refined-ki-size

DrugOOD-sbap-refined-potency-protein-family 41827

DrugOOD-sbap-general-ic50-protein

DrugOOD-sbap-general-ic50-protein-family

DrugOOD-sbap-general-ic50-scaffold

DrugOOD-sbap-general-ec50-protein-family

DrugOOD-sbap-general-ec50-size

DrugOOD-sbap-general-ki-protein-family 264464 14083 2 145756 2 48585 2

DrugOOD-sbap-general-ki-scaffold

DrugOOD-sbap-general-ki-size

DrugOOD-sbap-general-potency-assay 40174

DrugOOD-sbap-general-potency-protein-family 42339

DrugOOD-sbap-general-potency-size

The in-distribution (ID) vs. out of distribution (OOD) of DrugOOD lbap datasets trained with empirical risk minimization. The ID test datasets are drawn from the training data with the same distribution, and the OOD test data are distinct from the training data

DrugOOD-lbap-core-potency-scaffold

DrugOOD-lbap-refined-potency-assay

The in-distribution (ID) vs. out of distribution (OOD) of DrugOOD sbap datasets trained with empirical risk minimization. The ID test dataset are drawn from the training data with the same distribution, and the OOD test data are distinct from the training data

DrugOOD-sbap-core-potency-scaffold

DrugOOD-sbap-refined-ec50-protein

DrugOOD-sbap-general-ic50-scaffold

 Table 10 : List of 36 lbap datasets in DrugOOD. Pos and Neg denote the numbers of positive and negative data points, respectively. D # represents the number of domains, and C # represent the number of data points.

Pos # Neg # Train ID Val ID Test OOD Val OOD Test