key: cord-0619660-nwnna3o2 authors: Chan, Alvin; Madani, Ali; Krause, Ben; Naik, Nikhil title: Deep Extrapolation for Attribute-Enhanced Generation date: 2021-07-07 journal: nan DOI: nan sha: 8aaf7e956bb165f209d48b5a63dd7e618eb8076b doc_id: 619660 cord_uid: nwnna3o2 Attribute extrapolation in sample generation is challenging for deep neural networks operating beyond the training distribution. We formulate a new task for extrapolation in sequence generation, focusing on natural language and proteins, and propose GENhance, a generative framework that enhances attributes through a learned latent space. Trained on movie reviews and a computed protein stability dataset, GENhance can generate strongly-positive text reviews and highly stable protein sequences without being exposed to similar data during training. We release our benchmark tasks and models to contribute to the study of generative modeling extrapolation and data-driven design in biology and chemistry. Deep generative neural networks can generate realistic data across data-types, from sequences to images to time-series data, with applications in domains such as natural language processing (NLP), computer vision, and speech. Beyond canonical domains, the scientific application of synthetic design of proteins, molecules, and materials can be cast as generative modeling of sequences, graphs, or images (Anand & Huang, 2018; De Cao & Kipf, 2018; Madani et al., 2020 Madani et al., , 2021 . Most often, the goal is to design or generate a sample that improves upon the attribute label of interest ( Fig. 1-(left) ), which we term attribute-enhanced generation. Examples include generating a protein sequence with higher binding affinity or a nanomaterial structure with an energetically favorable state, as compared to all of the samples in the training distribution. In these scientific fields, traditional methods for synthetic object design with improved attributes are iterative and expensive, relying on labor-or compute-intensive methods (Bepler & Berger, 2021; Wu et al., 2021; Hie & Yang, 2021) . Hence, deep generative models that can design new proteins, molecules, and materials with improved attributes have the potential to dramatically accelerate design research. Beyond scientific applications, extrapolation in generation has potential applications in NLP, such as reducing toxicity or operating in low-resource settings. It is, however, a well-known challenge for deep neural networks to generate samples beyond the training distribution (Arora et al., 2017; Radford et al., 2019; Xu et al., 2020) . In this work, we develop a method for extrapolation, particularly for sequences. Our approach, called GENhance, is designed to generate an enhanced sequence using a learned latent space. GENhance consists of a generator (sampler) and a discriminator (ranker) that are jointly trained to minimize generation and discrimination losses, regularized by latent vector smoothing and a cycle-consistency loss. We evaluate GENhance in two data domains. First, we use the Stanford Sentiment Treebank (SST), a natural language benchmark containing movie reviews with five discrete sentiment attributes (Socher et al., 2013) , to show that GENhance generates strongly positive reviews, after training with no 2 Related Work Generalization to Low Data Regimes: Previous approaches aim to generalize classification and regression to low data settings. Imbalanced classification methods upsample or downsample classes (Chawla et al., 2002; García & Herrera, 2009) or reweight the training cost function (Huang et al., 2016; Cao et al., 2019; Cui et al., 2019) . improve the generalization of regression models in extrapolation and interpolation of the continuous data domain by smoothing both the label and features of the training data. Unlike prior work in this area, GENhance aims to generate samples in low/no data settings. Methods that can better generalize discriminators to these regions are complimentary and orthogonal to our work. Data-Driven Design: Data-driven design aims to learn a distribution over a high-dimensional input space that is optimized for a fitness function corresponding to a desirable property. Design methods often iterate sampling from a generator, and then updating the generator to assign a higher probability to inputs that a discriminator predicts to have higher fitness (Bedbrook et al., 2019; Biswas et al., 2021; Mansouri Tehrani et al., 2018) . Auto-focused oracles (Fannjiang & Listgarten, 2020) also adapt discriminators throughout this optimization process to using re-weighting of the training examples in the cost function to make them more reliable in the regions where the generator is more likely to generate. CbAS (Brookes et al., 2019) and DbAS (Brookes & Listgarten, 2018 ) use a fixed discriminator/oracle model and iteratively learns the distribution of inputs conditioned on a desirable property using importance sampling. CbAS is an improved version of DbAS which also re-weights samples based on how close they are to the original training data. We view these techniques as complementary as GENhance proposes a model-specific architecture for optimizing attributes. Das et al. (2021) train VAE to learn latent space and use latent space classifiers to sample latent vectors through rejection sampling and decode them into sequences that would have the target attribute/label. Hawkins-Hooker et al. (2021) also decode generations of a VAE by conditioning on latent vectors that correspond to the target attribute/label. Hoffman et al. (2020) seek to optimize molecular designs by using zeroth-order optimization on query-based prediction of candidate molecules' properties. Gómez-Bombarelli et al. (2018) build a Gaussian Process (GP) regression model trained with latent vectors to predict their inputs' labels and use gradient-based optimization on the GP to find sequences with target attributes. Compared with these previous works, the core difference in our approach is the combination of cycle-consistency and contrastive discriminatory objective to train the generator and discriminator as one model. Our work is also related to controllable text generation, which aims to generate text that corresponds to a user-specified attribute (Kikuchi et al., 2016; Ficler & Goldberg, 2017) . CTRL (Keskar et al., 2019) generates controlled fluent texts through the use of control codes which are meta-data prepended to the text during generation. Krause et al. (2020) use a generative discriminator resulting from contrasting predictions from opposing control codes to guide generation. CoCon (Chan et al., 2021) performs zero-shot controllable text generation without attribute labels. (Ziegler et al., 2019) optimizes language generation for desirable attributes via human in-the-loop reinforcement learning. Similarly to GENhance, PPLM (Dathathri et al., 2020) applies a discriminator on top of the latent space of a generative model to guide generation, however, GENhance uses an autoencoder rather than a language model. Lastly, text style transfer methods have used autoencoders with disentangled style latent representations (Shen et al., 2017; Hu et al., 2017; Yang et al., 2018) . Unlike text style transfer and previous approaches toward controllable text generation, GENhance differs, aside from its model formulation, in that the goal is to optimize and extrapolate a particular attribute beyond the training distribution. Our goal is to generate sequences with target attribute values that are better than the the training data. Formally, assume that there is a ground-truth oracle (O) that maps each sample (x ∈ R d ) to the target attribute value (y ∈ R) , i.e., y = O(x). Given a dataset of oracle labeled samples (D), we aim to generate new sequences where its ground-truth attribute value is better than that of this dataset: To generate samples that satisfy this criterion with high probability, we develop a sampling-ranking framework that consists of a sampler S that proposes a pool of candidate sequences and a ranker R model to infer the relative scores of these candidates. First, we describe two baseline generation techniques that are natural choices for this task and build on them to develop our GENhance model. The first baseline, Gen-Disc, is a rejection sampling approach that uses a generator model as the sampler S and a separate discriminator model as the ranker R . The generator is trained to model the training distribution p(x) through a language modeling objective where it learns to auto-regressively (Manning et al., 1999; Bengio et al., 2003) construct the training samples, where l is the length of the training sequence. The discriminator model is trained with an objective to predict the relative ranking of sequences from a pair of samples, based on their attribute. Given two training samples (x a , y a ) and (x b , y b ), the pairwise contrastive loss is: where f disc denotes the discriminator which outputs a scalar score value for each input sequence. We employ the contrastive loss term for this objective since it can be applied to both continuousand discrete-labeled samples. After training, we sample candidate sequences from the generator model in an auto-regressive fashion and use the discriminator to rank the sequences according to the discriminator's output score values. Figure 2 : GENhance is an encoder-decoder framework with a latent space between the two. GENhance is trained to extrapolate beyond the training distribution of attributes, by learning the latent space using a combination of contrastive, smoothing and cycle consistency losses, in addition to the reconstruction loss for autoregressive generation. Traditional methods for data-driven design rely on iterative optimization of candidates with better attributes. To mimic this process, we design a method that generates candidates using Metropolis-Hastings MCMC sampling from a population of better candidates. We start with an initial population of sequences, sampled from the training set. In the sampling step, new candidates are proposed by making edits to samples from this initial population, scored with the ranker R, and compared with the previous population of candidates. The probability that new generations are kept in the population depends on the score predicted by R. The cycle of sampling and ranking repeats until terminated. The ranker R takes the form of a neural network, identical to the discriminator model in the Gen-Disc setup. In the Gen-Disc framework, since the generator is trained only to model the training distribution, there is no direct way to steer the generation distribution towards a certain direction beyond the training data. In the MCMC framework, it might be challenging to find desirable candidates with stochastic mutation operations, since the search space can be large and high-dimensional for many design problems (Kumar & Levine, 2019) . To overcome these limitations, we propose using a learned latent space (Kingma & Welling, 2013) to control the attributes of generated sequences. Architecture: GENhance is an encoder-decoder framework with a latent space between its encoder (EN C) and decoder (DEC) modules ( Figure 2 ). The latent vector (z ∈ R dz ) of a particular input sequence (x ∈ R dx ) is the output from the encoder module, i.e., z = EN C(x). In our experiments, z is the hidden state at the location of a < cls > token (Devlin et al., 2018) that is prepended to the input sequence. Within the latent vector z, representation relevant and irrelevant to the attribute of interest is stored in z || and z ⊥ respectively, i.e., z = [z || ; z ⊥ ]. To train the encoder to store information about the target attribute in z || , we train it with a contrastive objective that aims to learn which of the two samples has the better value of the attribute: (4) where (x a , y a ) and (x b , y b ) are a pair of training samples, each containing an input sequence x and its label y. f || is an operation that maps z || to a scalar value. Here we use z || of dimension 1 and f || is simply an identity operation. We train GENhance to generate sequences using an objective where the decoder will autoregressively reconstruct a sequence while conditioned on the latent vector z. For an input sequence x of length l, parameterizing the EN C with θ and DEC with ψ, we get the reconstruction loss: To ensure that the perturbed latent vector z would result in plausible generations, we include a smoothing objective, the deterministic Wasserstein autoencoder-maximum mean discrepancy (WAE-MMD) objective (Tolstikhin et al., 2017) , to train the latent space as it has shown to be effective for discrete sequences. The WAE-MMD term (defined in the Supplement A.1) penalizes divergence of the latent vectors z from a target prior distribution P z , which is a unit Gaussian in our case. To help learn a better latent space and stronger discriminator within GENhance, we propose a cycleconsistency learning objective (L cyc-con ) to train EN C to correctly predict the relative rank between two reconstructed inputs: The intuition behind this objective is two-fold. First, since the discriminator (EN C) is used to rank generated sequences during inference, we can improve its performance on these synthetic sequences by also training the discriminator (EN C) on generated sequences (x) during the training phase. Secondly, by backpropagating the L cyc-con term through GENhance, it could learn a latent space which generates sequences that are easy for the discriminator to rank accurately. Combining all the training objectives, we can optimize using stochastic gradient descent to approximate the optimal parameters for GENhance: θ * , ψ * = arg min θ,ψ (λ contrast L contrast + λ recon L recon + λ smooth L smooth + λ cyc-con L cyc-con ) To the best of our knowledge, this is the first instance of using cycle-consistency with contrastive loss to train a generative model. After training, we can sample candidates from GENhance's latent space and rank the generated samples with the scalar scores output by the GENhance's EN C. First, we encode a training sample with EN C to sample a latent vector z. To obtain the latent vector encoding for a new candidate sequence with an improved attribute, we can make a perturbation (∆z || ) to the target attribute-aligned latent component z || . At the final step, GENhance's DEC conditions on this perturbed latent vector (z ) to generate the improved candidatex : The perturbation ∆z || is determined as the direction that increases the f || 's score output, i.e., ∂z || . For a linear layer f || , this term is the weight of the layer while for our case where f || is an identity operator, ∆z || is simply a scalar. After generating a pool of candidates with GENhance, we can rank and filter out top-scoring candidates with the GENhance EN C's predicted score: 4 Experiments and Results Dataset The Stanford Sentiment Treebank-5 (SST-5) (Socher et al., 2013) contains movie reviews from Rotten Tomatoes, which are labeled with one of the five ordinally increasing sentiment labels: Training For both the Gen-Disc and MCMC models, we train the discriminator model by finetuning a publicly available (Wolf et al., 2019) pretrained T5-base encoder (Raffel et al., 2019) . The generator module of both Gen-Disc and GENhance are trained by finetuning the whole pretrained T5-base encoder-decoder model. The Gen-Disc generator is trained with a language modeling objective by feeding in an empty string as the encoder's input and minimizing the cross-entropy loss between the decoder's output tokens and training sample's tokens through teacher forcing. For GENhance, the training samples are both fed in as the T5 encoder's input and used as the label for the decoder's output for the reconstruction objective. Further details on training settings on four NVIDIA A100 GPUs are found in the Supplement A.1 to A.3. Evaluation: We generate 25,000 candidate sequences from each model and use their respective discriminator module to rank the sequences into pools of top-100, 1000 and 10000 sequences. The percentage of candidates containing target attributes ('Strong-Positive' & 'Weak-Positive') are computed by using a ground-truth oracle model. In our experiments, we use a pretrained BERTlarge (Devlin et al., 2018) model that is finetuned on the full training SST-5 training set (including 'Strong-Positive' & 'Weak-Positive' samples), with a classification objective. This oracle model is trained with a batch size of 32 for 30 epochs and achieves an accuracy of 92.5% for strong-positive vs neutral/negative classification. 'Neutral'-labeled SST-5 sequences are used as the initial sequences for the MCMC baselines and as the input sequence for the GENhance models. ∆z || perturbations of magnitude equal to 5% of the standard deviation of the training samples' z || are used for all GENhance generations. We develop an additional performance metric, E[%SP], the expected percentage of 'Strong-Positive' generated sequences. The metric was developed to (i) have a statistically-relevant measure with an expectation value and (ii) use the 'Strong-Positive' labels alone to maximize Oracle label fidelity, as 'Strong-Positive' labels are almost perfectly distinguishable from the train labels of 'Neutral' and lower. It is computed with the following steps: a) Randomly sample 1000 of generations from the 25000 generations, b) filter out top-100 candidates based on discriminator's ranking, c) compute % 'Strong-Positive' in top-100 with ground-truth oracle model and d) repeat step a) to c) for 100 rounds and average % strong-positive. As a proxy for text quality, we compute the perplexity value for each generation using a pretrained GPT-2 large model (Radford et al., 2019) and average their values across the top-K pools. To guide the MCMC substitution step (MCMC-T5), we use a pretrained T5-base model, since random token substitution would degrade fluency. During mutation, a span of 1 or 2 tokens is masked and the masked sequence is fed into the T5 model to generate a replacement. Finally, to further evaluate generated text aside from the oracle model's scores, we conducted a human evaluation study to examine the positiveness and fluency of our text generations. The study Results: GENhance outperforms all baselines and ablation variants for all % positive metrics (Table 1 and 2). 49.7% of the more challenging 'Strong-Positive' sequences generated by GENhance are correct, which is almost twice the % of samples generated by Gen-Disc, the strongest baseline. All models see performance drops in the % positive metrics for the No-Pos training setup as compared to the 200-Pos setup, except for MCMC-Random, which is significantly worse in both cases. This reflects the greater challenge in generating desirable candidates when there are no positive samples in the training data. GENhance also outperforms other baselines in the top-1000 and top-10000 pools of candidates (See Supplement A.5). Figure 3 shows that the baselines and GENhance can generate more positive sequences than the training, with GENhance showing the largest distribution shift towards candidates with enhanced positiveness. GENhance generations also have lower perplexity values (i.e., better text quality) than the baseline and ablation methods, except for Gen-Disc, which explicitly models the training distribution (more in Supplement A.5). In fact, the average GENhance perplexity value (118) is close to the perplexity of SST-5 test samples (101.3). According to the human evaluation (Table 4) , GENhance succeeds in generating positive, fluent text that outperforms baselines. In the setup where the models were exposed to 200 positive training samples (200-Pos), GENhance outperformed all baselines, including 'Neutral' and 'Weak-Positive' training samples, in the positiveness of text generations. For the setting where the models were not exposed to any positive samples (No-Pos), GENhance is comparable to the positive train samples in Table 3 : GENhance enhances 'neutral' sequences from SST-5 to be 'strongly-positive', as seen in these generated samples. Positive words in blue and negative words in red. Original Text (Attribute: 'Neutral') Generated Text (Attribute: 'Strongly-positive') A melancholy, emotional film. A melodramatic film, this is a powerful story. An ambitious and moving but bleak film. An ambitious and light-hearted film, it is a strong and moving story. Some stunning visuals -and some staggeringly boring cinema. An engaging collection of fantastic visuals -and yet at the same time stunningly striking. You'll laugh for not quite and hour and a half, but come out feeling strangely unsatisfied. You will laugh very well and laugh so much you will end up feeling great afterwards A dark, dull thriller with a parting shot that misfires. A dark and compelling thriller that ends with a bitter, compelling punch. positiveness and outperforms all the other baselines. Likewise, the fluency of GENhance's generations either match or outperform that of the training samples and baselines. Ablation Study: Both latent smoothing and cycle-consistency objectives contribute to generating sequences with improved attributes and text quality. Without the cycle-consistency objective, we observe a drop in performance across all metrics, indicating that the objective is vital in helping the latent space and encoder generalize to sequences outside the training distribution. When the latent smoothing objective is removed, especially for the more challenging No-Pos setup, the generation quality drops, as indicated by the large increase in perplexity. This indicates that the smoothing objective is important in learning a latent space that is amenable to perturbations that control its attributes while maintaining generation quality. Dataset: Designing a protein with an optimized property (e.g. stability) is of immense interest to synthetic biology and drug discovery. Here, we create a new synthetic dataset of stability for mutations of human angiotensin-converting enzyme 2 (ACE2) protein. Since the SARS-CoV-2 virus binds to ACE2 to gain entry into human organs, ACE2 has emerged as a promising target for COVID-19 therapeutic protein design (Chan et al., 2020) . Our optimization problem is to generate an ACE2-like protein sequence that is more stable than samples in the training set. As a proxy for experimentally measured stability of a protein sequence, we use the free energy calculation via FoldX (Schymkowitz et al., 2005) which provides an automated, computational oracle for testing extrapolation methods in silico. In particular, we measure the change in free energy from wild-type, ddG or ∆∆G, between the folded and unfolded state of a protein sequence with the known ACE2 structure. A lower ddG value indicates a sequence that is more stable. We mutate the N-terminus subdomain of ACE2. The protein is represented as a sequence of 83 amino acids starting from the N-terminus side, with a vocabulary of 20 amino acids. We curate 250K ACE2 variants by mutating the wild-type (natural) ACE2 subdomain through substitutions and Figure 4 : GENhance-generated ACE2-like sequences show the largest shift in ddG distribution (i.e., stability improvement) from the training set. Top-100 ranked generated sequences are shown. ddG values are binned with an interval of 1. Triangles denote the distributions' mean values. Evaluation: For evaluation, we generate 250,000 sequences from each model while eliminating generations without the constant region (NTNITEEN) or with a different sequence length from the wild-type sequence. We then use the methods' respective discriminator modules to rank the candidates into pools of top-10, 100 and 1000 sequences. The top-K sequences' ddG values are then computed with the FoldX software, taking the average over five simulation runs. The top-5% most stable sequences are used as the initial sequences for MCMC and as the input sequence for GENhance. ∆z || perturbations of magnitude equal to 25% of the std. dev. of the training samples' z || are used for all the GENhance models. Following Fannjiang & Listgarten (2020), we also measure percent chance of improvement (PCI yτ ) over the best label (most negative ddG) in training data. To have a statistically-relevant metric, we developed the expected minimum ddG value (E[min]) which is computed by the following steps: a) Randomly sample 10000 of generations from the 250,000 generations, b) filter out top-10 candidates based on discriminator's ranking, c) compute ddG top-10 with FoldX oracle software to find the minimum ddG value among these 10 candidates and d) repeat step a) to c) for 100 rounds and average the minimum ddG values across these rounds. In addition to Gen-Disc and MCMC, we include CbAS (Brookes et al., 2019) and DbAS (Brookes & Listgarten, 2018) as baselines for comparison in this task. Both CbAS and DbAS use the same model architecture as the baseline Gen-Disc. We first sample generations from the baseline Gen-Disc's generator then retrain the generator on the generations re-weighted by the discriminator's scores. We used the initial hyperparameters from CbAS and conducted a grid search on a) M, number of generation per iterations (50, 100, 200), b) Q, percentile threshold (75, 90), c) temperature of a sigmoid score-based weight computation (0.1, 1, 10) and report the best results for CbAS. The DbAS' hyperparameter values mirror those of CbAS in our experiments. Results: GENhance outperforms all baselines on all metrics (Table 5) in designing more stable sequences. GENhance sequences have the lowest mean ddG value, with a significant fraction of generations more stable than the most stable sequence in the training set, as indicated by the higher PCI yτ values. GENhance also has the lowest E[min] value, indicating that it may be well-suited to find stable protein candidates in laboratory experiments where only small numbers of candidates can be evaluated due to cost. Even though MCMC and CbAS fare better than the simpler baseline Gen-Disc setup, we observe that GENhance outperforms both baselines on all three metrics measured. The distribution of generated samples by GENhance shows the largest shift towards more stable sequences as compared to the original training distribution (Figure 4 ). Ablation Study: Similar to SST-5 experiments, GENhance outperforms its ablation variants on all metrics. Rather surprisingly, we observe a drop in performance when the latent smoothing objective is added, which we speculate is due to the tension between the GENhance's reconstruction and its encoder's contrastive training objectives. With the cycle-consistency objective, we see a boost to GENhance that outperforms the vanilla variant, indicating that this objective aids in stabilizing the convergence of these two training objectives. To further study their contribution to GENhance's performance, we use GENhance's encoder to rank sequences generated by the generator in the baseline Gen-Disc setup and observe a boost in performance (Supplement Table 12 ). This suggests that GENhance's superior performance is due to both more accurate ranking by its encoder and better generation by its decoder. Discussion: There are two main features of GENhance that may contribute to its ability to generate attribute-enhanced sequences. First, compared to the baselines which mainly rely on the discriminator's prediction to filter out promising candidates, GENhance can additionally use its latent space to steer the general distribution of generated candidates towards a target region (e.g., more stable or positive sequences). Second, unlike the discriminators in the baselines which were trained only on training samples, the encoder used in GENhance was also trained on GENhance-generated sequences through the cycle-consistency loss. This may contribute to the better-ranking performance for GENhance, increasing the fraction of desirable candidates. In conclusion, we formalize the task of attribute-enhanced generation that aims to create improved samples with target attributes beyond the training distribution. Scientific applications can include the design of proteins, materials, and molecules without expensive, iterative procedures for discovery. To achieve this, we proposed GENhance, a generative model with a trained latent space, that generates sequences that outperform both the training data and baseline methods in natural language and protein engineering tasks. In the future, we aim to expand GENhance to other data types beyond sequences and study generation in scenarios where new data samples could be actively acquired. We also open-source our curated benchmark datasets with computational oracles along with all models and evaluation metrics/scripts to enable further research in extrapolation: https://github.com/salesforce/genhance. Broader Impact: Extrapolation involves designing samples, whether text, proteins, molecules, or materials, that have attributes that are unseen in training. If our technique or a future iteration thereof is adopted broadly, care should be taken in terms of the end use-cases of these designed/optimized samples and downstream effects to ensure safe, non-nefarious, and ethical applications. For projects in any domain, active oversight during project initiation, experimental optimization, and deployment phases should be put in place to ensure safe usage and limitation of unintended harmful effects. where {z 1 , . . . , z n } ∼ P z are samples from the target prior distribution P z and {z 1 , . . . ,z n } ∼ Q z are the decoded latent vectors in GENhance. For k, we use a random-feature Gaussian kernel with σ = 14 and random feature dimension of 500. The discriminator used in the MCMC algorithm is trained with a linear learning rate scheduler and max learning rate of 1e-4. For SST-5, with a random 10% used as the validation set. For SST-5 experiments, the discriminator is trained with a learning rate of 1e-4 and a batch size of 16 for 25 epochs with the contrastive objective. For ACE2 experiments, we train the discriminator model by finetuning a pretrained protein T5-base encoder on the full set of 250K sequences, with a random 10% used as the validation set. The discriminator is trained with a learning rate of 5e-6 and a batch size of 32 for 10 epochs with the contrastive objective. The MCMC algorithm used in our experiments largely follows the setup in Biswas et al. (2021) . The steps for the MCMC algorithm are: 1. Initialize: set the initial sequence as state sequence, s. 2. Propose a new sequence s * by editing s (e.g., token/span substitution). 3. Compute the acceptance probabililty: min 1, exp (ỹ * −ỹ T ) whereỹ * andỹ are the fitness score of the proposed sequence s * and state sequence s respectively and T is the temperature. In our experiments, the fitness score is the predicted score from the discriminator model, i.e.,ỹ = f disc (s). 4. If accepted, set proposed sequence as the new state sequence: s ← s * . For SST-5, we use a temperature T = 0.1 and 100 iterations. Any sequence candidates with an Levenshtein distance larger than 30% the length of the original text sequence are not accepted in the population pool. For the ACE2 experiments, we use a temperature T = 0.1 and 1000 iterations. Any sequence candidates with an edit distance larger than 18 (vs the wild-type sequence) are not accepted in the population pool. The MCMC's mutation rate, determined by the value of temperature, relates to how often the initial sequence is replaced by a fitter mutant sequence. A higher temperature will produce more explorative trajectories which may get trapped in low fitness regions while a lower temperature results in more exploitative trajectories which may not advance beyond local optima. The number of MCMC iterations determines the length of the mutation trajectory. A large number of iterations may result in sequences that are highly mutated, lying far outside the distribution of the training data, resulting in poor discriminator ranking performance. Conversely, a small number of iterations result in smaller edits that may not be enough to reach the global optima of the fitness landscape. We will include this discussion in the revision and refer readers to Biswas et al. (2021) ) for more details on the MCMC method. We conducted a grid search for the best MCMC parameters: mutation rate/temperature (4 orders of magnitude), MCMC iterations (3 orders of magnitude) and edit distance constraints (3 values) and report the best results in this paper. The discriminator used in the baseline Gen-Disc algorithm is trained with a linear learning rate scheduler and max learning rate of 1e-4. For SST-5, with a random 10% used as the validation set. For SST-5 experiments, the discriminator is trained with a learning rate of 1e-4 and a batch size of 16 for 25 epochs with the contrastive objective. For ACE2 experiments, the discriminator is trained with a learning rate of 5e-6 and a batch size of 32 for 10 epochs with the contrastive objective. The baseline generator is trained with a language modeling objective by feeding in a empty string as the encoder's input and minimizing the CE loss between the decoder's output tokens and training sample's tokens through teacher forcing. The generator is trained with a batch size of 16, a linear learning rate scheduler and max learning rate of 1e-4. The generator is trained for 25 epochs for the SST-5 experiments and for 12 epochs for the ACE2 experiments. The generator is trained on the full training set for the SST-5 experiments and on the top 50% most stable protein sequences for the ACE experiments. SST-5 For the ground-truth oracle model, we use a pretrained BERT-large (Devlin et al., 2018) model that is finetuned on the full training SST-5 training set (including 'Strong-Positive' & 'Weak-Positive' samples), with a classification objective. This oracle model is trained with a batch size of 32 for 30 epochs. ACE2 The force field and energy calculations of FoldX 1 were used to calculate ddG values of the mutated protein sequences for the initial training data and generated sequences. Although energy calculations with any software have limitations in accuracy, FoldX provides a computational technique to evaluate generation quality in high-throughputmaking it an attractive testbed for rapid experimentation of novel ML methods for attribute-enhanced generation in proteins. The protein used within this study was human ACE2, PDB:6LZG. We particularly focused on a subdomain that interacts with the RBD domain of SARS-CoV-2 S1 spike-spanning residues S19 to Q102 inclusive. FoldX 5.0 was used with the RepairPDB and BuildModel modules to extract the change in free energy with respective to wild-type for the given crystallographic structure. The oracle evaluation model's output was an average ddG value over 5 runs. We refer readers to Usmanova et al. (2018) for best practices. Generative modeling for protein structures Generalization and equilibrium in generative adversarial nets (gans) Machine learning-guided channelrhodopsin engineering enables minimally invasive optogenetics A neural probabilistic language model Learning the protein language: Evolution, structure, and function Low-n protein engineering with data-efficient deep learning Conditioning by adaptive sampling for robust design Design by adaptive sampling Learning imbalanced datasets with label-distribution-aware margin loss Cocon: A self-supervised approach for controlled text generation Engineering human ace2 to optimize binding to the spike protein of sars coronavirus 2 Smote: synthetic minority over-sampling technique Class-balanced loss based on effective number of samples Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations Plug and play language models: A simple approach to controlled text generation Molgan: An implicit generative model for small molecular graphs Pre-training of deep bidirectional transformers for language understanding Autofocused oracles for model-based design Controlling linguistic style aspects in neural language generation Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy Automatic chemical design using a data-driven continuous representation of molecules Generating functional protein variants with variational autoencoders Adaptive machine learning for protein engineering Optimizing molecules using efficient queries from property evaluations Toward controlled generation of text Learning deep representation for imbalanced classification Ctrl: A conditional transformer language model for controllable generation Controlling output length in neural encoder-decoders Auto-encoding variational bayes Gedi: Generative discriminator guided sequence generation Model inversion networks for model-based optimization A stable variational autoencoder for text modelling Exploring the limits of transfer learning with a unified text-to-text transformer The foldx web server: an online force field Style transfer from non-parallel text by cross-alignment Recursive deep models for semantic compositionality over a sentiment treebank Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches Self-consistency test reveals systematic bias in programs for prediction change of stability upon mutation Huggingface's transformers: State-of-the-art natural language processing Protein sequence design with deep generative models How neural networks extrapolate: From feedforward to graph neural networks Delving into deep imbalanced regression Unsupervised text style transfer using language models as discriminators Fine-tuning language models from human preferences Perplexity score for for GENhance model ablation variants and baseline techniques in the 200-Pos training setup, with varying top-K values. (↓ better) for all values below Figure 5 : Comparison between baseline candidate sampling techniques and GENhance. (left) In a baseline generator-discriminator setup, a generator is trained to model the distribution of training sequences while a discriminator is trained to predict input sequences' attribute values (∆∆G). When sampling candidates from the generator, the discriminator model ranks generated sequences which can then be evaluated by ground-truth oracle for evaluation. (center) In MH-MCMC, instead of a generator model, candidate sequences are sampled through an iterative process where highestranked generations are kept in the population and randomly mutated to generate offspring candidate sequences. (right) GENhance acts as a generator that conditions on an input sequence. Instead of a separately trained discriminator, the GENhance encoder is used to rank the candidate sequences. For GENhance, the training sample are both fed in as the MT5 encoder's input and used as the label for the decoder's output for the reconstruction objective. We use only the 125K most stable sequences as the training set to bias the generation towards more stable candidates. We use a warmup phase for GENhance where the WAE-MMD and cycle-consistency objectives are turned on in the midpoint of the training phase for better training stability . λ is set to 1 for all secondary objectives. For both 200-Pos and No-Pos, 10% of the training samples are randomly selected to make up the validation set. All GENhance models are trained with a linear learning rate scheduler and max learning rate of 1e-4. For SST-5, all GENhance variants are trained for 25 epochs, with batch size of 16. For the ACE2 task, GENhance w/o smoothing & CC (cycle-consistency) is trained for 12 epochs while the GENhance w/o CC (cycle-consistency) and full version are trained for 24 epochs for full convergence. The WAE-MMD objective used in our experiments largely follows the setting in Tolstikhin et al. (2017) . For a positive-definite reproducing kernel k : Z × Z → R, the maximum mean discrepancy (MMD k ) is:where H k is the RKHS of real-valued functions mapping Z to R. If k is characteristic then MMD k defines a metric and can be used as a divergence measure. Since MMD has an unbiased U-statistic estimator, it can be used together with stochastic gradient descent (SGD) during training in the following form:MMD k (P z , Q z ) = 1 n(n − 1) l =j k(z l , z j ) + 1 n(n − 1) l =j k(z l ,z j ) − 2 n 2 l,j k(z l ,z j ), {z 1 , . . . , z n } ∼ P z , {z 1 , . . . ,z n } ∼ Q z