key: cord-0128934-zcxkf1dv authors: Jin, Wengong; Barzilay, Regina; Jaakkola, Tommi title: Domain Extrapolation via Regret Minimization date: 2020-06-06 journal: nan DOI: nan sha: 0ad5f44fc605d1d2759a22c05a4883b9ec05b234 doc_id: 128934 cord_uid: zcxkf1dv Many real prediction tasks such as molecular property prediction require ability to extrapolate to unseen domains. The success in these tasks typically hinges on finding a good representation. In this paper, we extend invariant risk minimization (IRM) by recasting the simultaneous optimality condition in terms of regret, finding instead a representation that enables the predictor to be optimal against an oracle with hindsight access on held-out environments. The change refocuses the principle on generalization and doesn't collapse even with strong predictors that can perfectly fit all the training data. Our regret minimization (RGM) approach can be further combined with adaptive domain perturbations to handle combinatorially defined environments. We evaluate our method on two real-world applications: molecule property prediction and protein homology detection and show that RGM significantly outperforms previous state-of-the-art domain generalization techniques. Training data in many emerging applications is necessarily limited, fragmented, or otherwise heterogeneous. It is therefore important to ensure that model predictions derived from such data generalize substantially beyond where the training samples lie. For instance, in molecule property prediction [32] , models are often evaluated under scaffold split, which introduces structural separation between the chemical spaces of training and test compounds. In protein homology detection [29] , models are evaluated under protein superfamily split where entire evolutionary groups are held out from the training set, forcing models to generalize across larger evolutionary gaps. The key technological challenge is to be able to estimate models that can extrapolate beyond their training data. The ability to extrapolate implies a notion of invariance to the differences between the available training data and where predictions are sought. A recently proposed approach known as invariant risk minimization (IRM) [4] seeks to find predictors that are simultaneously optimal across different such scenarios (called environments). Indeed, one can apply IRM with environments corresponding to molecules sharing the same scaffold [6] or proteins from the same family [12] (see Figure 1 ). However, this is challenging since, for example, scaffolds are substructure descriptors (combinatorially defined) and can often uniquely identify each example in the training set. Another difficulty is that IRM collapses to empirical risk minimization (ERM) if the model can achieve zero training error across the environments -a scenario typical with over-parameterized models [35] . To address these difficulties we propose a new method called regret minimization (RGM). This new approach seeks to find a feature-predictor combination that generalizes well to unseen environments. We quantify generalization in terms of regret that guides the feature extractor φ, encouraging it to focus on information that enables generalization. The setup is easily simulated by using part of the training set as held-out environments E e . Specifically, our regret measures how feature extractor φ enables a predictor f −e • φ trained without E e to perform well on E e in comparison to an oracle f e • φ with hindsight access to E e . Since our regret measures the ability to predict, it need not collapse even with powerful models. To handle combinatorial environments, we appeal to domain perturbation and introduce two additional, dynamically defined environments. The perturbed environments operate Figure 1 : Examples of combinatorial domains. Left: For molecules, each domain is defined by a scaffold (subgraph of a molecular graph). Middle: Proteins are hierarchically split into domains called protein superfamilies (figure adapted from [8] ). Right: In a molecule property prediction task [32] , there are over 1000 scaffold domains with 75% of them having a single example. over the same set of training examples, but differ in terms of their associated representations φ(·). The idea is to explicitly highlight to the predictor domain variability that it should not rely on. Our method is evaluated on both synthetic and real datasets such as molecule property prediction and protein classification. We illustrate on the synthetic dataset how RGM overcomes some of the IRM challenges. On the real datasets, we compare RGM with various domain generalization techniques including CrossGrad [30] , MLDG [21] as well as IRM. Our method significantly outperforms all these baselines, with a wide margin on molecule property prediction (COVID dataset: 0.654 versus 0.402 AUC; BACE dataset: 0.590 versus 0.530 AUC). Domain generalization (DG) Unlike domain adaptation [10, 7] , DG assumes samples from target domains are not available during training. DG has been widely studied in computer vision [17, 27, 20, 23, 24] , where domain shift is typically caused by different image styles or dataset bias [19] . As a result, each domain contains fair amount of data and the number of distinct domains is relative small (e.g., commonly adopted PACS and VLCS benchmarks [13, 20] contain only four domains). We study domain generalization in combinatorially defined domains, where the number of domains is much larger. For instance, in a protein homology detection benchmark [15, 29] , there are over 1000 domains defined by protein families. Our method is related to prior DG methods in two aspects: • Simulated domain shift: Meta-learning based DG methods [21, 5, 22, 25, 11] Learning invariant representation One way of domain extrapolation is to enforce an appropriate invariance constraint over learned representations [28, 16, 4] . Various strategies for invariant feature learning have been proposed. They can be roughly divided into three categories: • Domain adversarial training (DANN) [16] enforces the latent representation Z = φ(X) to have the same distribution across different domains (environments) E. If we denote by P (X|E) the data distribution in environment E, then we require P (φ(X)|E i ) = P (φ(X)|E j ) for all i, j. With some abuse of notation, we can write this condition as Z ⊥ E. A single predictor is learned based on Z = φ(X), i.e., all the domains share the same predictor. As a result, the predicted label distribution P (Y ) will also be the same across the domains. This can be problematic when the training and test domains have very different label distributions [36] . • Conditional domain adaptation [26] instead conditions the invariance criterion on the label, i.e., P (φ(X), Y |E) = P (φ(X), Y ) for all E. The formulation allows the label distribution to vary between domains, but the constraint becomes too restrictive when domains are combinatorially defined and many domains E have only one example (x E , y E ) ( Figure 1 ). In this case, P (φ(X), Y |E) degenerates to a Dirac distribution δ(φ(x E ), y E ) and the constraint P (φ(X), Y ) = δ(φ(x E ), y E ) will require the representation φ to map all x E to the same vector within each class. As a result, CDAN (as well as DANN) require each domain to have fair numbers of examples in practice. Combes et al. [9] also analyzes the limitation of CDAN in general, non-combinatorial domains. • Invariant risk minimization (IRM) [4] requires that the predictor f operating on Z = φ(X) is simultaneously optimal across different environments. The associated conditional independence criterion is Y ⊥ E | Z. In other words, knowing the environment should not provide any additional information about Y beyond the features Z = φ(X). However, IRM tend to collapse to ERM when the model is over-parameterized and perfectly fits the training set (see §3). Moreover, when most of the domains E can uniquely specify X in the training set, E would act similarly to X and the IRM principle reduces to Y ⊥ X | Z, which is not a useful criterion for domain extrapolation. We propose to handle this issue via domain perturbation (see §4). The IRM principle provides a useful way to think about domain extrapolation but it does not work well with strong predictors. Indeed, a zero training error reduces the IRM criterion to standard risk minimization or ERM. The main reason for this collapse is that the simultaneous optimality condition in IRM is not applied in a predictive sense (as regret). To see this, consider a training set ) be the empirical loss of predictor f operating on feature representation φ, i.e., f • φ, in environment E e . The specific form of the loss depends on the task. IRM finds f and φ as the solution to the constrained optimization problem: min The key simultaneous optimality constraint can be satisfied trivially if the model achieves zero training error across the environments, i.e. ∀e : L e (f • φ) = 0. The setting is not uncommon with over-parameterized neural networks even if labels were set at random [35] . We can replace the simultaneous optimality constraint in IRM in terms of a predictive regret. This is analogous to one-step regret in on-line learning but cast here in terms of heldout environments. We calculate this regret for each held-out environment as the comparison between the losses of two auxiliary predictors that are trained with and without access to E e . Specifically, we define the regret as where the two auxiliary predictors are obtained from Note that the oracle predictor f e is trained and tested on the same environment E e while f −e is estimated based on all the environments except E e but evaluated on E e . The regret is always nonnegative since it is impossible for f −e to beat the oracle. Note that, unlike in IRM, even when f −e and φ are strong enough to ensure zero training loss across environments they are trained on, i.e., ∀k = e : L k (f −e • φ) = 0, the combination may still generalize poorly to a held-out environment E e giving L e (f −e • φ) > 0. In fact, the regret expresses a stronger requirement that f −e should be nearly as good as the best predictor with hindsight, analogously to on-line regret. Note that R e (φ) does not depend on the predictor f we are seeking to estimate; it is a function of the representation φ as well as the auxiliary pair of predictors f −e and f e . For notational simplicity, we suppress the dependence on f −e and f e . The overall regret R(φ) = e R e (φ) expresses our stated goal of finding a representation φ that facilitates extrapolation to each held-out training environment. Our RGM objective then balances the ERM loss against the predictive regret: representation φ and predictor f are found by minimizing Algorithm 1 RGM training 1: for each training step do 2: Randomly choose an environment e 3: Sample a batch B e from Ee 4: Sample a batch B −e from E\{Ee} 5: Compute Compute L e (fe • φ) on B e 8: Compute regret R e (φ) 9: Back-propagate gradients 10: end for In the backward pass, the gradient of L e (f e • φ) goes through a gradient reversal layer [16] which negates the gradient during back-propagation. Predictor f −e is updated by a separate objective L −e (f −e • φ) and its gradient does not impact φ. Predictor f is trained on all environments (omitted in the right figure due to space limit). Optimization Our regret minimization (RGM) can be thought of as finding a stationary point of a multi-player game with several players: f , φ as well as auxiliary predictors {f −e } and {f e }. Our predictor f and representation φ find their best response strategies by minimizing assuming that {f −e } and {f e } remain fixed. The auxiliary predictors minimize . The auxiliary objectives depend on the representation φ but this is not exposed to φ, reflecting an inherent asymmetry in the multi-player game formulation. The RGM game objective is solved via stochastic gradient descent. In each step, we randomly choose an environment E e ∈ E and sample a batch B e = {(x i , y i )} from E e . We also sample an associated batch B −e from the other environments E\{E e }. The loss for φ and f e are implemented by a gradient reversal layer [16] . The setup allows us to optimize all the players in a single forward-backward pass operating on the two batches (see Figure 2 ). 1 Both our proposed regret minimization as well as IRM assume that the set of environments are given as input, provided by the user. The environments exemplify nuisance variation that needs to be discounted so they play a critical role in determining whether the approach is successful. The setting becomes challenging when the natural environments are combinatorially defined. For example, in molecule property prediction, each environment is defined by a scaffold, which is a subgraph of a molecule (see Figure 1 ). Since scaffolds are combinatorial descriptors, they often uniquely identify each molecule in the training set. It is not helpful to create single-example environments as the model would see any variation from one example to another as nuisance, not able to associate nuisance primarily to scaffold variation. A straightforward approach to combinatorial or large numbers of environments is to cluster them into fewer, coarser sets, and apply RGM over the coarse environments. For simplicity, we cluster the training environments E into just two coarse environments E 0 , E 1 . The advantage is that we only need to realize two auxiliary predictors {f −e }, {f e }, e ∈ {0, 1} instead of |E| predictors. The construction of the coarse environments depends on the application (see §5). Figure 3 : a) Scaffold classifier g predicts the scaffold (subgraph of a molecule); b) Domain perturbation. The perturbed environmentẼ 3 contains less scaffold information; c) RGM with domain perturbation. We introduce additional oracle predictors f e,δ for the perturbed environments. Domain perturbation While using coarse environments is computationally beneficial, this clearly loses the ability to highlight finer nuisance variation from scaffolds or protein families. To counter this, we introduce and measure regret on additional environments {Ẽ e } created specifically to highlight fine-grained variation of scaffolds or protein families but in an efficient manner. We define these additional environments via perturbations, as discussed in detail below. Both E e and its associated perturbed environmentẼ e serve as held-out environments to the predictor f −e . These give rise to regret terms relative to oracles that can fit specifically to each environment, now includingẼ e . These additional regret terms will further drive the feature representation φ. The goal is to learn to generalize well to finer-grained variations of scaffolds or protein families that we may encounter at test time. We propose additional environments through gradient-based domain perturbations. Specifically, for each coarse environment E e , e ∈ {0, 1}, we construct another environment whose representations are perturbed: Note that E e andẼ e are defined over the same set of examples but differ in the representation that the predictors operate on when calculating the regret. The perturbation δ(x) is defined through a parametric scaffold (or protein family) classifier g(φ(x)). The associated classification loss is (s(x), g(φ(x))), where s(x) is the scaffold (or protein family) label of x (see Figure 3a) . We define the perturbation δ(x) in terms of the gradient: where α is a step size parameter. The direction of perturbation creates a modified representatioñ z = φ(x) + δ(x) that contains less information about the scaffold (or protein family) than the original representation z = φ(x). The impact on domain classifier output is illustrated in Figure 3b . Note that the variation between z andz highlights how finer scaffold information remains in the representation; the associated regret terms then require that this variation does not affect quality of prediction. Integration with RGM We augment the RGM objective in Eq.(4) with two additional terms. First, the scaffold (or protein family) classifier g is trained together with the feature mapping φ to minimize Second, we add regret terms R e (φ + δ) specific to the perturbed environments to encourage the model to extrapolate to them as well. The new objective for the main players f , φ, and g then becomes where we have introduced a new oracle predictor f e,δ = arg min f L e (f • (φ + δ)) for perturbed environmentẼ e , in addition to f e for the original environment E e (see Figure 3c ). Note that f −e minimizes a separate objective L −e (f −e •φ) = (x,y)∈E1−e (y, f −e (φ(x))), which does not include the perturbed examples. Perturbations represent additional simulated test scenarios that we wish to generalize to. The training procedure is shown in Algorithm 2. Remark While the perturbation δ is defined on the basis of φ as well as the classifier g, we do not include the dependence during back-propagation. We verified that incorporating this higher order gradient would not improve our empirical results. Another subtlety in the objective is that φ is adjusted to also help the classifier g. In other words, the representation φ is in part optimized to retain information about molecular scaffolds or protein families. This encourages the perturbation to be meaningful and relevant to downstream tasks. Sample a mini-batch B e = {(x 1 , y 1 ), · · · , (x m , y m )} from coarse environment E e Sample another mini-batch B 1−e from environment E 1−e Compute scaffold (or protein family) classification loss L g (g • φ) = i s(x i ), g(φ(x i )) 6: Construct gradient perturbation δ(x i ) = α∇ φ(xi) s(x i ), g(φ(x i )) Compute oracle predictor losses L e (f e • φ) and L e (f e,δ • (φ + δ)) on B e 10: Compute regrets for the coarse and perturbed environments: R e (φ) and R e (φ + δ) Back-propagate gradients and update model parameters 12: end for We evaluate our method on three tasks. We first construct a synthetic task to verify the weakness of IRM and study the behaviour of RGM. Then we test our method on protein classification and molecule property prediction tasks where the environments are combinatorially defined. In both tasks, we test our method under two settings: 1) RGM combined with domain perturbation (named as RGM-DP); 2) standard RGM trained on the coarse environments E 0 , E 1 used in RGM-DP. Baselines On the synthetic dataset, we mainly compare with IRM [4] . For the other two tasks, we compare our method with ERM (environments aggregated) and more domain extrapolation methods: • DANN [16] , CDAN [26] and IRM [4] seek to learn domain-invariant features. As mentioned in section 2, these methods require each domain to have fair amount of data. Thus, they are trained on the coarse environments E 0 , E 1 used in RGM-DP instead of the original combinatorial environments (i.e., molecular scaffolds and protein superfamilies). • MLDG [21] simulates domain shift by dividing domains into meta-training and meta-testing. • CrossGrad [30] augments the dataset with domain-guided perturbations of inputs. Since it requires the input to be continuous, we perform domain perturbation on learned features φ(x) instead. DANN, CDAN and IRM are trained on the coarse environments and comparable to standard RGM. MLDG and CrossGrad are trained on combinatorial environments and comparable to RGM-DP. Data We first compare the behavior of IRM and RGM on an inter-twinning moons problem [16] , where the domain shift is caused by rotation (see Figure 4 ). The training set contains two environments E 0 , E 20 . As for E 0 , we generate a lower moon and an upper moon labeled 0 and 1 respectively, each containing 1000 examples. Setup The feature extractor φ and predictor f is a two-layer MLP with hidden dimension 300 and ReLU activation. For RGM, we set λ = 0.5. For IRM, we use the official implementation from Arjovsky et al. [4] based on gradient penalty. Both methods are optimized by Adam with 2 regularization weight λ 2 = 0.01, where IRM performs the best on the OOD validation set. Our results are shown in Figure 5 . Our method significantly outperforms IRM (84.6% vs 74.7% with the OOD validation). IRM test accuracy is close to ERM under in-domain validation setting. IRM is able to outperform ERM under OOD validation setting because it provides additional extrapolation signal. For ablation study, we train models with different 2 regularization weight and OOD validation so that models yield zero training error. As shown in Figure 5 (right), IRM's test accuracy becomes similar to ERM when λ 2 ≤ 10 −3 as its training accuracy reaches 100%. This . IRM test accuracy becomes almost the same as ERM when λ 2 ≤ 10 −3 as it achieves 100% training accuracy. This shows that IRM collapse to ERM when the model perfectly fits the training set. shows that IRM will collapse to ERM when model perfectly fits the training set. In contrast, RGM test accuracy is around 80.0% even when λ 2 = 10 −4 and training error is zero. Data The training data is a collection of pairs {(x i , y i )}, where x i is a molecular graph and y i is its property label (binary). The environment of each compound x i is defined as its Murcko scaffold [6] , which is a subgraph of x i with side chains removed. We consider the following four datasets: • Tox21, BACE and BBBP are three classification datasets from the MoleculeNet benchmark [32] , which contain 7.8K, 1.5K and 2K molecules respectively. Following [14] , we split each dataset based on molecular weight (MW). This setup is much harder than commonly used random split as it requires models to extrapolate to new chemical space. The training set consists of simple molecules with MW < τ . The test set molecules are more complex, with MW > τ + 100. The validation set contains molecules with τ ≤ MW ≤ τ + 100. We set τ = 400 for Tox21, BBBP and τ = 500 for BACE (as BACE compounds have larger molecular weight on average). • COVID-19: During recent pandemic, many research groups released their experimental data of antiviral activities against COVID-19. However, these datasets are heterogeneous due to different experimental conditions. This requires our model to ignore spurious correlation caused by dataset bias in order to generalize. We consider three antiviral datasets from PubChem [1] , Diamond Light Source [2] and Jeon et al. [18] . The training set contains 10K molecules from PubChem and 700 compounds from Diamond. The validation set contains 180 compounds from Diamond. The test set consists of 50 compounds from Jeon et al. [18] , a different data source from the training set. Model The feature extractor φ is a graph convolutional network [34] which translates a molecular graph into a continuous vector. The predictor f is a MLP that takes φ(x) as input and predicts the label. Since scaffold is a combinatorial object with a large number of possible values, we train the environment classifier by negative sampling. Specifically, for a given molecule x i with scaffold s i , we randomly sample n other molecules and take their associated scaffolds {s k } as negative examples. Details of model architecture are discussed in the appendix. RGM setup For RGM-DP, we construct two coarse environments E 0 , E 1 as the following. On the COVID-19 dataset, E 1 consists of 700 compounds from the Diamond dataset and E 0 is the PubChem dataset. The two coarse groups are created to highlight the dataset bias. For other datasets, E 1 consists of molecules with τ − 50 < MW < τ and E 0 = E − E 1 . We set λ g = 0.1 in all datasets. Results Following standard practice, we report AUROC score averaged across five independent runs. As shown in Table 1 , our methods significantly outperformed other baselines (e.g., 0.654 vs 0.402 on COVID). On the COVID dataset, the difference between RGM and RGM-DP is small because the domain shift is mostly caused by dataset bias rather than scaffold changes. Indeed, RGM-DP shows a clear improvement over standard RGM on the BACE dataset (0.590 vs 0.532), since the domain shift is caused by scaffold changes (i.e., complex molecules usually have much larger scaffolds). Data We evaluate our method on a remote homology classification benchmark used in Rao et al. [29] . The dataset consists of pairs {(x, y)}, where x is a protein sequence and y is its fold class. It is split into 12K for training, 736 for validation and 718 for testing by [29] . Importantly, the provided split ensures that there is no protein superfamily that appears in both training and testing. Each superfamily represents an evolutionary group, i.e., proteins from different superfamilies are structurally different. This requires models to generalize across large evolutionary gaps. In total, the dataset contains 1823 environments defined by protein superfamilies. Model The protein encoder φ(x) = MLP(φ 1 (x), φ 2 (x)) contains two modules: φ 1 (x) is a TAPE protein embedding learned by a pre-trained transformer network [29] ; φ 2 (x) is a LSTM network that embeds associated protein secondary structures and other features. The predictor f is a feed-forward network that takes φ(x) as input and predicts its fold label. The environment classifier g also takes φ(x) as input and predicts the superfamily label of x (out of 1823 classes). RGM setup For RGM-DP, we construct two coarse environments E 0 , E 1 as the following. E 1 contains all protein superfamilies which have less than 10 proteins and E 0 = E − E 1 . The coarse environments are divided based on the size of superfamilies because the validation set mostly contains protein superfamilies of small size. We set λ g = 1 and λ = 0.1. Following [29] , we report the top-1 and top-5 accuracy in Table 1 . For reference, the top-1 and top-5 accuracy of TAPE transformer [29] are 21.0% and 37.0%. 2 Our ERM baseline achieves better results as we incorporate additional features. The proposed RGM-DP outperforms all the baselines in both top-1 and top-5 accuracy. The vanilla RGM operating on coarse environments also outperforms other baselines in top-1 accuracy. Indeed, RGM-DP performs better than RGM because it operates on protein superfamilies and receives stronger extrapolation signal. In this paper, we propose regret minimization for domain extrapolation, which seeks to find a predictor that generalizes as well as an oracle that would have hindsight access to unseen domains. Our method significantly outperforms all baselines on both synthetic and real-world tasks. In section 3, we discussed that IRM may collapse to ERM when the predictor is powerful enough to perfectly fit the training set. Under some additional assumptions, we can further show that the ERM optimal predictor f is optimal for IRM even if the model has non-zero training error. In particular, we assume that the conditional distribution p(y|x) is environment-dependent and the environment e can be inferred from x alone, i.e., p(x, y, e) = p(e)p(x|e)p(y|x, e); p(y|x, e) = p(y|x, e(x)) For molecules and proteins, the second assumption is valid because the environment labels (scaffold, protein family) can be inferred from x. Under this assumption, we can rephrase the IRM objective as min Model hyperparameters Both the feature extractor φ and predictor are two-layer MLPs with hidden dimension 300. For both ERM, IRM and RGM, we consider λ 2 ∈ {10 −4 , 10 −3 , 10 −2 , 10 −1 }. For IRM, the gradient penalty weight is set to 10000 as in [4] . For RGM, we consider λ ∈ {0.1, 0.5, 1.0} and λ = 0.5 works the best on the OOD validation set. Data The four property prediction datasets are provided in the supplementary material, along with the train/val/test splits. The size of the training, validation and test sets are listed in Table 2 . Model hyperparameters For the feature extractor φ, we adopt the GCN implementation from Yang et al. [34] . We use their default hyperparameters across all the datasets and baselines. Specifically, the GCN contains three convolution layers with hidden dimension 300. The predictor f is a two-layer MLP with hidden dimenion 300 and ReLU activation. The model is trained with Adam optimizer for 30 epochs with batch size 50 and learning rate linearly annealed from 10 −3 to 10 −4 . The environment classifier g is a MLP that maps a compound or its scaffold to a feature vector. The model is trained by negative sampling since scaffold is a combinatorial object. Specifically, for a given molecule x i in a mini-batch B, we treat other molecules in the batch and take their associated scaffolds {s k } as negative examples. The probability that x i is mapped to its correct scaffold s i = s(x i ) is then defined as p(s i | x i , B) = exp{g(φ(x i )) g(φ(s i ))} k∈B exp{g(φ(x i )) g(φ(s k ))} The environment classification loss is − i log p(s i | x i , B) for a mini-batch B. The classifier g is a two-layer MLP with hidden dimension 300 and ReLU activation. For RGM and RGM-DP, we consider λ ∈ {0.05, 0.1, 0.2} and λ g ∈ {0.1, 1} and select the best hyper-parameter for each dataset. λ g = 0.1 consistently works the best across all the datasets. Data The protein homology dataset is downloaded from Rao et al. [29] . Each protein x is represented by a sequence of amino acids, along with the predicted secondary structure labels, predicted solvent accessibility labels, and alignment-based features. For RGM-DP, the two coarse groups E 0 , E 1 have 8594 and 3718 examples respectively. The protein encoder φ(x) = ReLU(W 1 φ 1 (x) + W 2 φ 2 (x)). φ 1 (x) is a 768-dimensional TAPE embedding given by a pre-trained transformer [29] . φ 2 (x) is a bidirectional LSTM that embeds the secondary structures, solvent accessibility and alignment-based features. The LSTM has one recurrent layer and its hidden dimension is 300. The predictor f is a linear layer which worked better than MLP under ERM. The environment classifier is a two-layer MLP whose hidden size is 300 and output size is 1823 (the number of protein superfamilies). The model is trained with Adam optimizer for 10 epochs, with batch size 32 and learning rate linearly annealed from 10 −3 to 10 −4 . For RGM and RGM-DP, we consider λ ∈ {0.01, 0.1} and λ g ∈ {0.1, 1} and λ = 0.1, λ g = 1 work the best on the validation set. National center for biotechnology information. pubchem database. source=the scripps research institute molecular screening center Sars-cov-2 main protease structure and xchem fragment screen. Diamond Light Source Kush Varshney, and Amit Dhurandhar. Invariant risk minimization games Metareg: Towards domain generalization using meta-regularization The properties of known drugs. 1. molecular frameworks A theory of learning from different domains Learning protein sequence embeddings using information from structure Domain adaptation with conditional distribution matching and generalized label shift Learning from multiple sources Domain generalization via model-agnostic learning of semantic features The Pfam protein families database in 2019 Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias Step change improvement in admet prediction with potentialnet deep featurization Scope: Structural classification of proteins-extended, integrating scop and astral data and classification of new structures Domain-adversarial training of neural networks Domain generalization for object recognition with multi-task autoencoders Identification of antiviral drug candidates against sars-cov-2 from fda-approved drugs. bioRxiv Undoing the damage of dataset bias Deeper, broader and artier domain generalization Learning to generalize: Meta-learning for domain generalization Episodic training for domain generalization Domain generalization with adversarial feature learning Deep domain generalization via conditional invariant adversarial networks Feature-critic networks for heterogeneous domain generalization Conditional adversarial domain adaptation Unified deep supervised domain adaptation and generalization Domain generalization via invariant feature representation Evaluating protein transfer learning with tape Generalizing across domains via cross-gradient training Generalizing to unseen domains via adversarial data augmentation Moleculenet: a benchmark for molecular machine learning Analyzing learned molecular representations for property prediction Understanding deep learning requires rethinking generalization On learning invariant representation for domain adaptation The authors would like to thank Tianxiao Shen, Adam Yala, Benson Chen, Octavian Ganea, Yujia Bao and Rachel Wu for their insightful comments. This work was supported by MLPDS, DARPA AMD project and J-Clinic. Among many benefits, the proposed algorithm advances state-of-the-art in drug discovery. As the current COVID pandemics illustrates, the lack of quality training data hinders utilization of ML algorithms in search for antivirals. This data issue is not specific to COVID, and is common in many therapeutic areas. The proposed approach enables us to effectively utilize readily available, heterogeneous data to model bioactivity, reducing prohibitive cost and time associated with traditional drug discovery workflow. Currently, the method is utilized for virtual screening of COVID antivirals. We cannot see negative consequences from this research: at worst, it will degenerate to the performance of the base algorithm, the model aims to improve. In terms of bias, the algorithm is explicitly designed to minimize the impact of nuisance variations on model prediction capacity.