key: cord-0261372-wcpq7zwn
authors: Wang, Jue; Lisanza, Sidney; Juergens, David; Tischer, Doug; Anishchenko, Ivan; Baek, Minkyung; Watson, Joseph L.; Chun, Jung Ho; Milles, Lukas F.; Dauparas, Justas; Expòsit, Marc; Yang, Wei; Saragovi, Amijai; Ovchinnikov, Sergey; Baker, David
title: Deep learning methods for designing proteins scaffolding functional sites
date: 2021-11-12
journal: bioRxiv
DOI: 10.1101/2021.11.10.468128
sha: 4448c2a5c5beccfb2a554a7a2a175d78dcf8d599
doc_id: 261372
cord_uid: wcpq7zwn

Current approaches to de novo design of proteins harboring a desired binding or catalytic motif require pre-specification of an overall fold or secondary structure composition, and hence considerable trial and error can be required to identify protein structures capable of scaffolding an arbitrary functional site. Here we describe two complementary approaches to the general functional site design problem that employ the RosettaFold and AlphaFold neural networks which map input sequences to predicted structures. In the first “constrained hallucination” approach, we carry out gradient descent in sequence space to optimize a loss function which simultaneously rewards recapitulation of the desired functional site and the ideality of the surrounding scaffold, supplemented with problem-specific interaction terms, to design candidate immunogens presenting epitopes recognized by neutralizing antibodies, receptor traps for escape-resistant viral inhibition, metalloproteins and enzymes, and target binding proteins with designed interfaces expanding around known binding motifs. In the second “missing information recovery” approach, we start from the desired functional site and jointly fill in the missing sequence and structure information needed to complete the protein in a single forward pass through an updated RoseTTAFold trained to recover sequence from structure in addition to structure from sequence. We show that the two approaches have considerable synergy, and AlphaFold2 structure prediction calculations suggest that the approaches can accurately generate proteins containing a very wide array of functional sites.

sequences to predicted structures, could be adapted for this purpose. Completely new proteins can be designed using trRosetta by starting from a random amino acid sequence, and carrying out Monte Carlo sampling in sequence space maximizing the probability that the sequence folds to some (unspecified) three dimensional structure (12) . We refer to this process as "hallucination" as it produces solutions that the network considers ideal proteins but do not correspond to any actual natural protein (Fig. 1A) ; crystal and NMR structures confirm that the hallucinated sequences fold to the hallucinated structures (12) . trRosetta can also be used to design sequences that fold into a target backbone structure by carrying out sequence optimization using a structure recapitulation loss function that rewards similarity of the predicted structure to the target structure (13) . We sought to extend this approach to scaffold functional sites using trRosetta by sampling in sequence space with a combination of the hallucination loss to favor folding to a unique structure, and a structure recapitulation loss to favor formation of the desired functional site (rather than the entire structure as in (13) ; Fig. 1B ; Methods). While we succeeded in generating structures that had segments which closely recapitulated functional sites, Rosetta structure predictions suggested that the sequences poorly encoded the structures, and hence we used Rosetta design calculations to generate more optimal sequences (14) . Several designs targeting PD-L1 generated by constrained hallucination with binding motifs derived from PD-1, followed by Rosetta design, were found to have binding affinities in the mid-nanomolar range (Fig. S1 ). While this experimental validation is encouraging, the requirement for sequence design using Rosetta is at odds with property (3) above-the joint design of sequence and structure.

We found following the development of RosettaFold (15) that using it, rather than trRosetta, to guide motif-constrained hallucination resulted in designed protein sequences that more strongly encoded their structures (Fig. S2) , likely reflecting the better overall modeling of protein sequence-structure relationships evidenced by the superior structure prediction performance (15) . Constrained hallucination with RosettaFold has the further advantages that since 3D coordinates are explicitly modeled (trRosetta only generates residue-residue distances and orientations), motif recapitulation can be assessed at the coordinate level, and additional problem-specific loss terms can be implemented in coordinate space that assess interactions with a protein target (Fig. 1B, 1D ).

In the following sections, we explore the use of the constrained RosettaFold hallucination method to design proteins containing a wide range of functionally diverse motifs ( Fig. 2-4 , Table   S1 ). It is impractical to experimentally validate many designs for many different applications; we instead evaluate these designs using the AlphaFold (AF) protein structure prediction network (16) which has very high accuracy on de novo designed proteins (17) . Although RoseTTAFold was inspired by AF, the two models were developed and trained independently, and hence AF predictions can be regarded as an orthogonal in silico test of whether RF designed sequences fold into the intended structures, analogous to traditional ab initio folding benchmarks (13, 18) .

For almost all problems, we obtained designs that are closely recapitulated by AF with overall and motif RMSD typically <2 and <1 respectively with model confidence pLDDT > 80 (Table S2 ). While solving current challenges with protein design clearly requires making and characterizing proteins in the lab, this in silico AF test is well suited for testing performance of design methods on a wide range of problems, and is quite stringent, as discussed below.

We first applied the constrained hallucination method to the problem of antigen presentation for immunogen design, where the goal is to scaffold a native epitope recognized by a neutralizing antibody as accurately as possible (and thus elicit antibodies binding the target protein upon immunization). Additional interactions with the target antibody are undesirable because the goal is to elicit antibodies recognizing the original antigen, and hence we incorporate an additional repulsive term assessed on the complex 3D coordinates in the composite loss function to penalize interactions with the antibody beyond those present in the epitope being scaffolded ( Fig. 1D, S3) . As a test case, we focused on respiratory syncytial virus, a leading cause of infant mortality whose F protein (RSV-F) contains antigenic epitopes for which structures with neutralizing antibodies have been determined (7, 9, 10) . We sought to scaffold RSV-F site II, a contiguous helix-turn-helix motif that had previously been grafted successfully onto a 3-helix bundle architecture (7), as well as RSV-F site V, a helix-turn-strand motif that has not yet been scaffolded successfully (19) . We were able to hallucinate designs for both epitopes with a variety of folds and motifs recapitulated to sub-angstrom C RMSD in the AF predicted structure of the designed sequence ( Fig. 2A, Fig. S8 , S11; for these and all designs below, full amino acid sequence and PDB files are in the SM, and comparisons of the design models to AF predictions, in Fig. S8 -10--since they are virtually identical, to save space we show only one of these in the main text figures).

We next applied the hallucination method to the design of receptor traps, which neutralize viruses by mimicking their natural binding targets and thus are inherently robust against mutational escape. We again augmented the loss function with an explicit penalty on interactions beyond those present in the receptor to avoid opportunities for viral escape. As a test case, we scaffolded the interfacial helix of human angiotensin-converting enzyme 2 (hACE2) interacting with the receptor-binding domain (RBD) of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike protein (20) . The hallucinated hACE2 mimetics have a diverse set of helical topologies, and AF2 structure predictions recapitulate the binding interface with sub Å accuracy (Fig. 2B , S8, S10).

We next explored the scaffolding of functional sites involved in metal-binding and catalysis. We designed scaffolds around a di-iron binding site, which is important in biological systems for iron storage (21) and also potentially harnessable for catalysis (22, 23) . The motif, composed of four roughly parallel helical segments from E. coli bacterioferritin (cytochrome b1), was recapitulated with sub-angstrom RMSDs (Fig. 3A ), in scaffolds with quite different helix connectivities than the parent (Fig. S9 ). For the calcium-binding EF-hand motif (24) composed of a 12 residue loop flanked by helices, the hallucination method readily generates a variety of scaffolds recapitulating either 1 or 2 EF-hand motifs within 0.5 Å RMSD of the calcium binding motif (Fig.   3C ). When tasked with scaffolding one EF-hand motif, the method chooses to buttress the loop with a helix, avoiding the need for another long loop.

We next sought to hallucinate enzyme active sites. Carbonic anhydrase II, which catalyzes the interconversion of carbon dioxide and bicarbonate, enables CO 2 transport in humans (25), plays a key role in photosynthesis (26) , and is emerging as a tool for CO 2 sequestration (27) . The active site contains 3 Zn 2+ coordinating histidines (PDB ID 5yui: His94,His96,His119) on two strands, and a hydrophobic loop containing Thr199 which sequesters and orients the CO 2 .

Despite the complexity of the irregular, discontinuous, 3 segment site, the method generated designs with sub angstrom motif RMSDs with correct His placement for Zn 2+ coordination (Fig.   3E , S9); these are less than 100 residues, significantly smaller than the 261 residue long native protein.

To enable specification of sidechain geometry, we carried out iterative gradient descent using gradient information obtained by backpropagation through the AF neural network rather than RF, which currently does not explicitly model side chains (see Methods). As a test, we used the catalytic sidechain geometry of in mammals (28) . In initial experiments, we were only able to obtain designs that fully recapitulated the catalytic sidechain geometry when optimization was over a multiple sequence alignment rather than a single sequence; the landscape may be too rugged with the high resolution sidechain-based loss in the single sequence case. To overcome this problem, we developed a two-stage approach; with a first stage using both AF and trRosetta (to reduce the structure-prediction resolution and thus smoothen the loss landscape) and a description of the active site at the backbone level, followed by a second all-atom AF-only stage once the overall backbone was roughly in place. This two-stage approach led to multiple plausible solutions with predicted structures having a nearly exact match to the catalytic sidechain geometry (Fig. 3G,   S9 ); however, we cannot use AF as an independent test of design accuracy in this case (given the very large number of model parameters, direct optimization against the output of a neural network has the potential to identify false optima, and hence independent in silico validation is important).

We next sought to design binding proteins which extend beyond an input binding motif to make additional favorable interactions with the target by explicitly including the sequence and structure of the target in the hallucination process (Figs S6, Methods) . We designed binders of the anti-inflammatory cytokine interleukin 10 (IL-10) -receptor that incorporate one of the two discontinuous binding sites in the domain-swapped IL10 dimer in a single chain; the resulting scaffolds recapitulate the IL10 binding region within 0.5A (Fig. 4A, S10) . Starting from the complement cascade protein C3d which enhances immune responses to covalently attached antigens (29) we designed binders to complement receptor 2 (CR2) present on B-cell and dendritic cells (30) . The designs are much smaller (<100 AAs) than native C3d (306 AAs), recapitulate the binding interface with sub angstrom accuracy (Fig. 4B, S6C) .

As a test of building around beta strand motifs, we sought to design binders of the immune checkpoint protein CTLA-4 starting from B7-2, which binds CTLA-4 through four beta strands.

Starting from a single five residue strand, hallucination in the presence of CTLA-4 generated designs having both alpha-beta and all beta topologies with novel binding modes and comparable interface contacts to native B7-2 (Fig. 4C, S10 ). As expected, designs hallucinated in the presence of the target had considerably better Rosetta protein-protein interface metrics (4) (binding free energy, etc) than those designed without the receptor (Fig. S6) .

While quite powerful and general, the constrained hallucination approach is compute intensive, as a forward and backward pass through the network is required for each gradient descent step in sequence optimization. In the original training of RosettaFold for structure prediction a small fraction (15%) of tokens in the MSA are masked, and the network learns to recover this missing sequence information in addition to predicting structure. We reasoned that this ability to recover sequence information along with structural information could provide a second solution to the functional site scaffolding problem: given a functional site description, a forward pass through the network could potentially be used to complete, or "inpaint", both protein sequence and structure ( Fig. 1C; Methods) . Here, the design challenge is formulated as an information recovery problem, analogous to the completion of a sentence given its first few words using language models (31) and completion of corrupted images using inpainting methods (32) . As illustrated in Fig. 1E , a wide variety of protein structure prediction and design challenges can be similarly formulated as missing information recovery problems. We began from a RoseTTAFold model trained for structure prediction (15) and carried out further training on both fixedbackbone sequence design and fixed-sequence structure prediction tasks (Methods; Fig. S13 ; Algorithm S1). After training, the mean amino acid sequence recovery of the resulting model, denoted RF joint , on a de novo protein test set was 33% ( Fig. 5A ; this is similar to Rosetta fixed backbone design performance), and there was also a slight increase in structure prediction accuracy (Fig. 5B) . Thus, the model can both recover missing structure information given sequence and missing sequence information given structure.

We next considered design challenges where both sequence and structure information were missing for a portion of the protein. For smaller masked regions, the sequences and structures recovered by RF joint are close to those of the input native structure, and as the size of the masked regions increases the divergence of both sequence and structure increases as expected (Fig. S14) . The extent of variation in the resulting designs can be controlled by the amount of input sequence and structure information provided (Fig. S18C ). Since the calculations require a single forward pass (including recycling outputs back as input) through the network, only 1-10 seconds on an NVIDIA RTX2080 GPU (Methods) are required to generate both sequence and structure.

Encouraged by the excellent performance of RF joint on simultaneous sequence and structure recovery despite being only trained on recovery of one or the other, we sought to improve this further by explicitly training on joint sequence/structure recovery tasks. Sequence and structure diversity is useful when designing proteins containing functional motifs, as subtle variations in the structure of the motif can drastically affect function (33) , and hence we trained this new model to predict the sequence and structure of masked regions between two provided residue coordinates, in the absence of structural and sequence information of the residues flanking the two residue coordinates (to force the model to place structural elements based more on larger protein context than the local structure of the immediately connected chain segments). With this second model, which we call RF joint2 , the two residue coordinates can, at inference time, be varied, enabling the rapid generation of further sequence/structure diversity ( Fig. 5D ; a similar problem has been explored using Rosetta (33) ). Of note, the degree of diversification in the inpainted region can be controlled by varying the distance by which the two residue coordinates are translated (Fig. 5D, left panel) , while the structure of the templated (unmasked) protein remains remarkably stable.

We next explored the use of RF joint and RF joint2 to generate complete protein structures around the functional sites described in Figs 2-4, and found that success depended on the size and context of the input functional regi on. With the RF joint model, we found that best results were obtained for the more minimalist functional sites by first building up extended versions using the constrained hallucination approach. Many alternative structure and sequence completions can then be generated by RF joint in a network forward pass ( Figure 6A, Figure S18 ). Almost all designs shown have subangstrom RMSD from the AF prediction to the native motif and <2 Å RMSD between design model and AF prediction (Fig. 6A, Fig. S19 ), and > 80 pLDDT. Diverse ensembles of such solutions to a specific design challenge can be very rapidly generated by varying the input sequence and structure information (Fig. S18) . While RF joint struggled to generate well-predicted proteins from native/minimalist motifs, we found that RF joint2 was able to generate complete and confidently-predicted (by AF2) protein models from smaller regions, such as a single EF hand motif (Fig. S18B) . Further, RF joint2 could simultaneously scaffold two motifs while retaining good (<1 Å RMSD) alignment to both (Fig. 6B, top row) . Remarkably, in some cases, RF joint2 was able to generate well-predicted scaffolds to complex, multi-chain motifs taken directly from a native crystal structure (Fig. 6B, middle and bottom row) , as well as translationally symmetric proteins (Fig. S20) , provided little more than the desired motif, in a single forward pass through the network.

Tests on the full range of challenges described here suggest that the two function design approaches are complementary: the constrained hallucination approach can build protein structures harboring minimalist functional sites but is quite compute and memory intensive since it requires a forward and backward pass (to generate gradient information to guide sequence optimization) through the neural network at each step of sequence optimization (Methods), while the missing information recovery method in most but not all cases requires extended functional site description but is much less compute intensive, and generally outperforms the hallucination method when more starting information is provided, as illustrated by the lower RMSDs on constrained regions (Fig. S15) . This difference in performance can be understood by considering the manifold in sequence-structure space corresponding to folded proteins; the space of all possible sequence-structure pairs is far larger than the set of sequence-structure pairs of folded proteins, and hence this manifold occupies a tiny fraction of the overall space.

The missing information recovery approach can be viewed as projecting an incomplete or corrupted input sequence-structure pair onto the subset of this manifold (as represented by RosettaFold) containing the functional site--if insufficient starting information is provided, this projection is not necessarily well determined, but with sufficient information, it readily produces protein-like solutions, updating sequence and structure information simultaneously. The loss function used in the hallucination approach is constructed with the goal that minima lie in the protein manifold, but there will likely not be a perfect correspondence, and hence stochastic optimization of the loss function in sequence space may not produce as protein-like solutions as the inpainting approach--on the other hand, since stochastic search can be initiated from any starting point, the hallucination approach can start from minimalist functional site descriptions, or, as in the fully unconstrained case (12) , no sequence and structural information at all.

New protein design methods have traditionally been evaluated by experimental testing, and for actual applications it is essential to make and characterize proteins in the lab. The high structure prediction accuracy of AF2 now enables evaluation of new design methodology in silico, which has the considerable advantage that a much wider variety of design challenges can be evaluated. In the work described here, AF2 was not used for any of the design calculations except for the sidechain active site design case of Fig. 3E , and hence provides an independent test of design accuracy. Both the backbone design challenge--generating a plausible protein backbone with a geometry capable of hosting a desired site, and the sequence design challenge--generating a sequence which strongly encodes this backbone, are quite formidable.

For the backbone design problem, the very large set of structures predicted for naturally occurring proteins using AF and recently made available (34) provides an excellent point of comparison: for the RSV-F site V immunogen design challenge described above, the frequency of non-homologous proteins in the AF proteomes database and the Protein Data Bank (PDB) (35) matching the functional site with equal or lower RMSDs than our designs was 3.9x10 -6 ( Fig.   S17 ; Supplementary Text); similarly low frequencies of suitable natural scaffolds in the PDB were observed for other targets (Table S3 ). For the sequence design problem, the accuracy of native protein structure prediction based on single amino acid sequences provides a point of comparison; as shown in Fig. S16 , our designs are predicted more confidently from sequence than the vast majority of native proteins with known crystal structures, and on par with structurally validated de novo designed proteins. This success in designing sequences confidently predicted to fold to structures harboring a wide range of functional sites derives in part from a key advance over classical protein design pipelines, which treat backbone generation and sequence design as two separate problems: our methods simultaneously generate both sequence and structure, taking advantage of the ability of RoseTTAFold to reason over and jointly optimize both data types.

The deep learning methods presented here are quite general, requiring no inputs other than the structure and sequence of the desired functional site, and unlike current non-deep-learning methods, do not require specification of the secondary structure or topology of the scaffold, and simultaneously generate both sequence and structure. Despite a recent surge of interest in using machine learning to design protein sequences (36) (37) (38) (39) (40) (41) (42) (43) , the design of protein structure is relatively underexplored, likely due to the difficulty of efficiently representing and learning structure (44) . Generative adversarial networks (GANs) and variational autoencoders (VAEs) trained on specific fold families have been used to design biophysically plausible protein backbones (45, 46) , but not ones containing functional sites. RoseTTAFold and Alphafold have been trained on the entire PDB, and thus generalize from a very wide range of known protein structures. Our "activation maximization" hallucination approach enables use of arbitrary loss functions tailored to specific problems without retraining for any sequence length.

Complementary to this, the ability of our "missing information recovery" inpainting approach to expand from a given functional site to generate a coherent sequence-structure pair should find wide application in protein design because of its speed and generality. The combination of the two approaches is more powerful than either one alone, as ensembles of solutions to a given functional design problem can be generated very rapidly using the second approach starting from extended site descriptions identified in the first. The hallucination approach could, in theory, also be used to refine the more extensive designs generated by inpainting. The two approaches individually, and the combination of the two, should increase in power as more and more accurate protein structure, interface, and small molecule binding prediction networks are developed moving forward.

Kemp elimination catalysts by computational enzyme design

Computational Design of an Enzyme Catalyst for a Stereoselective Bimolecular Diels-Alder Reaction

Robust de novo design of protein binding proteins from target structural information alone

Massively parallel de novo protein design for targeted therapeutics

A Computationally Designed Inhibitor of an Epstein-Barr Viral Bcl-2 Protein Induces Apoptosis in Infected Cells

Proof of principle for epitope-focused vaccine design

De novo design of potent and selective mimics of IL-2 and IL-15

De novo protein design enables the precise induction of RSV-neutralizing antibodies

Bottom-up de novo design of functional proteins with complex structural features

Improved protein structure prediction using predicted interresidue orientations

Protein sequence design by conformational landscape optimization

Accurate prediction of protein structures and interactions using a three-track neural network

Highly accurate protein structure prediction with AlphaFold

Single-sequence protein structure prediction using language models from deep learning

Ab initio protein structure prediction of CASP III targets using ROSETTA

A novel pre-fusion conformationspecific neutralizing epitope on the respiratory syncytial virus fusion protein

Structure of a unique twofold symmetric haembinding site

De Novo Design of Four-Helix Bundle Metalloproteins: One Scaffold, Diverse Reactivities

Artificial diiron proteins: From structure to function

Calcium Signaling

Tracking solvent and protein movement during CO2 release in carbonic anhydrase II crystals

The Role of Carbonic Anhydrase in Photosynthesis

Investigating the Application of Enzyme Carbonic Anhydrase for CO2 Sequestration Purposes

Crystal Structure of Δ 5-3-Ketosteroid Isomerase from Pseudomonas testosteroni in Complex with Equilenin Settles the Correct Hydrogen Bonding Scheme for Transition State Stabilization*

C3d of Complement as a Molecular Adjuvant: Bridging Innate and Acquired Immunity

C3d enhancement of antibodies to hemagglutinin accelerates protection against influenza virus challenge

Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv181004805 Cs

Semantic Image Inpainting with Deep Generative Models. ArXiv160707539 Cs

Computational Protein Design Quantifies Structural Constraints on Amino Acid Covariation

Highly accurate protein structure prediction for the human proteome

The Protein Data Bank

Generative models for graph-based protein design

Fast and Flexible Protein Design Using Deep Graph Neural Networks

Low-N protein engineering with data-efficient deep learning

Expanding functional protein sequence space using generative adversarial networks

Protein design and variant prediction using autoregressive generative models

Protein sequence design with deep generative models

Protein sequence design with a learned potential

Structure-based protein design with deep learning

Fully differentiable full-atom protein backbone generation

Categorical Reparameterization with Gumbel-Softmax. ArXiv161101144 Cs Stat

A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation

Fast differentiable DNA and protein sequence optimization for molecular design

Adam: A Method for Stochastic Optimization

Protein Folding Neural Networks Are Not Robust

Adversarial Examples Are Not Bugs, They Are Features. ArXiv190502175 Cs Stat

Perceiver: General Perception with Iterative Attention

A solution for the best rotation to relate two sets of vectors

The Rosetta All-Atom Energy Function for Macromolecular Modeling and Design

De novo design of protein homo-oligomers with modular hydrogen-bond network--mediated specificity

Improved protein structure refinement guided by deep learning based accuracy estimation

Computational Design of Ligand Binding Proteins

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets

Exploring the repeat protein universe through computational protein design

Comparison of multiple Amber force fields and development of improved protein backbone parameters

Simultaneous Optimization of Biomolecular Energy Functions on Features from Small Molecules and Macromolecules

Structure and Dynamics of PD-L1 and an Ultra-High-Affinity PD-1 Receptor Mutant

Structural basis of respiratory syncytial virus neutralization by motavizumab

Structural basis of receptor recognition by SARS-CoV-2

A Closed Compact Structure of Native Ca2+-Calmodulin

Structure of complement receptor 2 in complex with its C3d ligand

Structural basis for co-stimulation by the human CTLA-4/B7-2 complex

We would like to thank Luki Goldschmidt for maintaining the computational resource in the IPD; 

Authors declare that they have no competing interests.

All code will be made publicly available upon publication.Supplementary materials -Materials and Methods -Supplementary Text -Figures S1 -S21 -Tables S1 -S3 -Algorithm S1 -Data S1

(A) Free hallucination. At each iteration, a sequence is passed to the trRosetta or RoseTTAFold neural network, which predicts 3D coordinates and residue-residue distances and orientations (Fig. S3) which are scored by a loss function that rewards certainty of the predicted structure. The sequence is updated either by back propagating the gradient of the loss to the inputs or by MCMC, and passed back into the network for the next iteration. (B) Constrained hallucination. Same approach as in (A) but the loss function rewards motif recapitulation and other taskspecific functions in addition to structural certainty. (C) Missing information recovery. Partial sequence and/or structural information is input into the network, and complete sequence and structure are output. (D) Design problems that can be addressed by constrained hallucination, and the corresponding loss functions ( Fig. S3; Methods) . (E) Protein design challenges formulated as missing information recovery problems. Colors in all panels: native functional motif (orange); hallucinated scaffold (gray); constrained motif (purple); binding partner (blue); non-masked region (green); masked region (light gray, dotted lines)

(A) Design of proteins scaffolding immunogenic epitopes on RSV protein F (site II: PDB 3IXT chain P residues 254-277; site V: 5TPN chain A residues 163-181). Comparisons of the RF hallucinated models to unbiased AF2 structure predictions from the design sequence are in Fig.  S8 ; here because of space constraints we show only the AF2 model; the two are very close in all cases. Here and in the following figures, we assess the extent of success in designing sequences which fold to structures harboring the desired motif through two metrics computed on the AF2 predictions: prediction confidence (AF pLDDT), and the accuracy of recapitulation of the original scaffolded motif (motif RMSD AF versus native). For RSV-F designs, these metrics are rsvf_ii_141 (85.0, 0.53 Å), rsvf_ii_158 (82.9, 0.51 Å), rsvf_ii_171 (88.4, 0.69 Å); rsvf-v_854 (81.5, 0.75 Å); rsvf-v_870 (80.4, 0.76 Å). (B) Design of COVID-19 receptor trap based on ACE2 interface helix (6VW1 chain A residues 24-42). Design metrics: ace2_76 (89.1, 0.55 Å); ace2_1157 (80.4, 0.47 Å); ace2_1007 (83.3, 0.57 Å). Colors: native protein scaffold (light yellow); native functional motif (orange); hallucinated scaffold (gray); hallucinated motif (purple); binding partner (blue). See Table S2 for additional metrics on each design. 

Colors: native protein scaffold (light yellow); native functional motif (orange); hallucinated scaffold (gray); hallucinated motif (purple); bound metal (blue). Active site residues shown for boxed designs in panel B, D, F, and H for di-iron, EF-hand, carbonic anhydrase II, and (C) Given a template sequence and structure (green) with regions of both sequence and structure masked (gray), RF joint can recover the missing sequence and structure in a single forward pass. The sequence and structure in contiguous regions of test set protein 2KL8 were both masked prior to input into RF joint . Top row: alpha helix. Middle row: four strand beta sheet. Bottom row: a 10-residue loop. (D) RF joint2 builds sequence/structure between two given residue coordinates which enables tunable diversification of rebuilt segments. The depicted gray region was masked from 2KL8, and the two coordinates shown in red were randomly translated up to 8Å in any direction (within the illustrated red spheres). RF joint2 is able to build back an ensemble of helical inpainted regions n el er w: s (right panel, AF2 predictions, AF2 pLDDT > 0.8 for all designs shown). Increasing structural diversity could be achieved in the central inpainted region (in both the RF inpainted structure models and the AF2 structure predictions of the inpainted sequences) by increasing the distance by which the red coordinates could be translated (left graph, gray points) without substantial disruption to the remainder of the template structure (left graph, green points, n=5000 structures/point).

Design of proteins harboring functional motifs via information recovery using RF joint and RF joint2 . All structures of designs shown are the AF2 prediction of that design. In all cases, template inputs (sequence and structure) that are functional and their corresponding outputs are colored in purple, template inputs that are not directly related to function are in green, along with their corresponding outputs. Functional template inputs derived from a native structure are in orange, with corresponding outputs in purple. Depicted in gray are the regions of sequence and structure masked from the original protein (input column) or that were generated via RF joint /RF joint2 (output column). 

All source code will be made freely available upon publication.