key: cord-0767727-rrtncmph authors: Yu, Alvin; Pak, Alexander J.; He, Peng; Monje-Galvan, Viviana; Casalino, Lorenzo; Gaieb, Zied; Dommer, Abigail C.; Amaro, Rommie E.; Voth, Gregory A. title: A Multiscale Coarse-grained Model of the SARS-CoV-2 Virion date: 2020-11-28 journal: Biophys J DOI: 10.1016/j.bpj.2020.10.048 sha: 3a3d7b5041978287ea0b5ee219af0c1ae2c41999 doc_id: 767727 cord_uid: rrtncmph The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the causative agent of the COVID-19 pandemic. Computer simulations of complete viral particles can provide theoretical insights into large-scale viral processes including assembly, budding, egress, entry, and fusion. Detailed atomistic simulations, however, are constrained to shorter timescales and require billion-atom simulations for these processes. Here, we report the current status and on-going development of a largely “bottom-up” coarse-grained (CG) model of the SARS-CoV-2 virion. Structural data from a combination of cryo-electron microscopy (cryo-EM), x-ray crystallography, and computational predictions were used to build molecular models of structural SARS-CoV-2 proteins, which were then assembled into a complete virion model. We describe how CG molecular interactions can be derived from all-atom simulations, how viral behavior difficult to capture in atomistic simulations can be incorporated into the CG models, and how the CG models can be iteratively improved as new data becomes publicly available. Our initial CG model and the detailed methods presented are intended to serve as a resource for researchers working on COVID-19 who are interested in performing multiscale simulations of the SARS-CoV-2 virion. The onset of the global COVID-19 pandemic has brought intense investigation into the molecular components of SARS-CoV-2 encoded by the virus's 30 kilobase (kb) genome. Structural biology efforts using cryo-electron microscopy (cryo-EM) and x-ray crystallographic techniques are currently reporting new structures of viral proteins every week (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) , and computational structure prediction efforts are targeting unresolved sections of the genome using a variety of protein folding algorithms. Computational and experimental studies are underway to find new molecular therapeutics that can inhibit viral activity or further elucidate the mechanisms of action of SARS-CoV-2 proteins (13) (14) (15) (16) . The computer simulation of large-scale SARS-CoV-2 processes, such as virion assembly, budding, entry, and fusion, will remain intrinsically challenging to investigate using all-atom (AA) molecular dynamics (MD), owing to the computational cost of meaningfully simulating the hundred of millions to billions of atoms involved. A holistic model of the SARS-CoV-2 virion can provide insight into the mechanisms of large-scale viral processes and the collective behavior of macromolecules involved in viral replication and infectivity. SARS-CoV-2 virions contain four main structural proteins: the spike (S), membrane (M), nucleocapsid (N), and envelope (E) proteins (17) . S proteins are glycosylated trimers that mediate fusion and entry, in part by attaching enclosed fusion peptide sequences into the membranes of host cells (18) . M proteins appear as dimeric complexes embedded within the virion envelope, and are believed to anchor ribonucleoprotein complexes to the envelope (19, 20) . N proteins associate with and organize RNA into ribonucleoprotein structures found in the interior of virions (21, 22) . Lastly, E proteins are thought to form pentameric ion channels that are found at the lipid bilayers of virion membranes and contribute to viral budding (23) . In this paper, we construct a largely "bottom-up" coarse-grained (CG) model of the SARS-CoV-2 virion from the currently available structural and atomistic simulation data on SARS-CoV-2 proteins. In general, this model serves as a resource for researchers working on COVID- 19 , and as a platform to incorporate computational and experimental data. This model also enables new multiscale studies of SARS-CoV-2 processes to possibly help find treatment and J o u r n a l P r e -p r o o f prevention strategies against COVID-19. Atomistic trajectory and experimental structural data deposited in the NSF Molecular Sciences Software Institute (MolSSI) will be incorporated, as they become publicly available (24) . In this work, we detail several of our CG methods used to iteratively develop a CG model for the full SARS-CoV-2 virion, in which molecular interactions between CG particles are derived using a combination of phenomenological, experimental, and atomistic simulation approaches. Building models from structural data We first constructed atomic models of the structural proteins of the SARS-CoV-2 virion, ( Figure 1 ). AA models of the open and closed state of the S protein were built based on the cryo-EM structures of the spike ectodomain (PDB ID: 6VYB, 6VXX) (5), respectively, and atomic models of the N protein were constructed based on the x-ray crystallographic structure of the nucleocapsid NTD (PDB ID: 6M3M) (25) . Glycosylation sites were modeled using Glycan Reader & Modeler in CHARMM-GUI (26) and the site-specific glycoprofile derived from mass spectrometry and cryo-EM analysis (27, 28) . Homology models for the S protein stalk, including the HR2 and TM domains were assembled as α-helical trimeric bundles using MODELLER (29) on the basis of secondary structure assignments in JPred4 (30) . Homology models for the SARS-CoV-2 N protein CTD were created from the x-ray crystallographic structure of the SARS-CoV N protein CTD (PDB ID: 2CJR) (31) . Missing amino acid backbones in loop regions were built in MODELLER, and side chain rotamers were built using SCWRL4 (32) . We used atomic models for the M protein dimer (33) and the pentameric E ion channel (34) that were developed by homology. AA protein models (see discussion below) were subsequently simulated and coarsegrained to generate the CG models ( Figure 2 and see sections below). A previously developed CG model for lipids was used, consisting of three CG beads per lipid and distinct bead types for lipid head groups and hydrophobic tails (35) . A single component CG lipid bilayer was generated in a spherical configuration and equilibrated using CG molecular dynamics simulations under constant NVT conditions in LAMMPS (36) . We note that in the future more complex CG lipid J o u r n a l P r e -p r o o f models (37) can be added. Transmembrane segments of component membrane proteins were visually identified and assigned based on secondary structure motifs. Individual lipids on the outer leaflet of the spherical bilayer were randomly selected and used as initial positions for embedding spike, membrane, and envelope proteins. For each initial position, the center-of-mass of the transmembrane domain was aligned with the center of the lipid bilayer, and the principal axis of the protein was aligned with the vector normal to the lipid bilayer. Transmembrane regions were then substituted for the overlapping CG lipids to embed the proteins. The procedure was iterated until a spike, membrane, and envelope protein density on the virion surface was achieved that was approximately consistent with current available experimental estimates of ~25, 1000, and 20 per virion, respectively, from cryo-EM and biochemical data (38) (39) (40) . The diameter of the membrane envelope is approximately 100 nm and 120 nm including the S proteins on the virion surface. As higher-resolution experimental data are released, the overall structure of this model can be refined. Two glycosylated models of the open and closed spike were inserted into a symmetric 225 Å × 225 Å lipid bilayer mimicking the composition of the endoplasmic reticulum-Golgi intermediate compartment (41, 42) . The lipid patch was built using CHARMM-GUI. The complete proteinmembrane system was solvated with using the TIP3P water model (43) , and neutralized with chloride and sodium ions to maintain a 150 mM concentration. Each system contained ~1.7 million atoms. Minimization and equilibration was performed using the CHARMM36 forcefield (44, 45) and NAMD 2.14 (46) . Production runs were performed in the NPT ensemble using a Langevin thermostat at 310K and Nosé-Hoover Langevin barostat at 1 atm. All production runs used a 2-fs The CG model of the glycosylated S protein (Figure 2 A) was parameterized from the AA MD simulations described above (47) . Reference statistics used conformations sampled equally from both open and closed states with AA trajectories spanning the 3.0 and 1.8 µs, respectively. First, the protein was mapped to CG beads using essential dynamics coarse-graining (EDCG) (48) . We used 60 and 50 CG beads for the S 1 and S 2 domains, respectively, and the 22 N-linked glycans were each mapped to a single bead. Intra-protein interactions were represented as a heteroelastic network model (hENM) with bond energies − where is the spring constant of a particular CG bond and is the equilibrium bond length. These parameters were optimized using the hENM method (49) . Inter-protein interactions within the S-trimers were composed of excluded volume, attractive, and screened electrostatic terms. For excluded volume interactions, a phenomenological soft cosine potential, 1 + cos , was used, where = 25 kcal/mol and is the onset for excluded volume. Attractive, non-bonded interactions between inter-protein contacts were modeled as the sum of two Gaussian potentials, where and + are the mean and standard deviation determined by a fit to the pair correlation between CG sites i and j through least-squares regression. The constants. and , were optimized using relative-entropy minimization (REM). Screened electrostatics were modeled using Yukawa potentials, was developed by linearly mapping the amino acid sequence to particles at a resolution of 1 CG bead per 5 amino acids ( Figure 2B ). Several studies suggest that the C-terminal domain (CTD) of the N protein assembles into a helix that contains two RNA binding grooves (21, 51) . Based on these studies, we constructed atomic and CG models of the viral ribonucleoprotein complex (vRNP) by iterating between CG and AA simulations. We first constructed an atomic model of the N protein CTD helix with two RNA binding grooves by stacking 3 copies of the CTD octamer structure (PDB: 2CJR), which is composed of 4 CTD dimers and homology modeled from the X-ray crystallographic structure of the SARS-CoV N protein CTD (31) . The CTD helix was simulated in the CHARMM36m force field for 400 ns. We then constructed the CTD helix model using EDCG combined with hENM followed by manually placing CG RNA beads into the groove of the helix ( Figure 2D ). The positions of the CG beads were used as restraints to build an atomic model of the vRNP complex. Finally, the vRNP model was relaxed and simulated in the CHARMM36m force field for 400 ns. It is important to note that recent cryo-EM studies have found granule-like densities within the virion for the vRNP complex (22) . Structural detail into how CTD oligomers (including the previously proposed helical model) and RNA fit into these densities will likely require higher-resolution images. Several computational approaches have been developed to build or refine CG models using data from AA or fine-grained simulations. Our approach to coarse-graining the SARS-CoV-2 virion is to couple several CG methods in a hierarchical fashion. CG sites or "beads" are mapped from atomic structures using EDCG, a method designed to preserve the principal modes of motion sampled during atomic-level simulations (48) . In EDCG, a given CG mapping operator, ; < = : ? @ → < = , that relates the configurations of the atomistic trajectory (? @ to that of the CG model (< = ), is variationally optimized using simulated annealing. Typically, the mapping is constructed so that J o u r n a l P r e -p r o o f contiguous segments of a protein's primary amino acid sequence are mapped to distinct CG sites. For a fixed number of CG beads, B, the sets of atoms that are mapped to CG sites are adjusted to minimize the target residual: where, P = 1, … , B is the CG site index, the brackets, 〈•〉 K , denote a time-averaged quantity, the sum over 5, O is a sum over all unique pairs in the set of atoms belonging to the CG site, P, and where, c is the Hessian: where \ is the number of iterations and s is a parameter that controls the magnitude of the adjustment for each iteration. For the intermolecular interactions between proteins, non-bonded CG interactions are determined either using force matching (aka multiscale coarse-graining, or MS-CG) (52, 53) or REM approaches (54, 55) . In MS-CG, the CG potential is constructed from a linearly independent basis set where the functional forms for the basis potentials, w z (e.g., B-splines, Lennard-Jones, etc), and the number of them, B z , are determined by the user. The coefficients, ~y z •, are variationally optimized such that the following residual is minimized: J o u r n a l P r e -p r o o f where ƒ ‚ x @ = −"w x @ is the CG force and • ‚ H @ is the atomistic force on the CG site P. Similarly, in the REM approach, the objective function for minimization is the Kullback-Liebler divergence, which provides a metric for the differences between the atomistic and CG probability distributions where X vv H Š = ' vv % " %"• --H Š and X tu H Š = ' tu % " %"• -˜ H Š in the canonical ensemble, and ' is the configurational partition function. Furthermore, the relative entropy can be expressed as a difference between the potential energy and free energy of the atomistic and CG ensembles: where = − š Z log '. Minimization of the relative entropy is performed using iterative Newton-Raphson techniques. It is important to note, however, that the quality and fidelity of such CG models is determined by the molecular behavior sampled in the underlying AA simulations. Macromolecular complexes, such as virions, undergo a wide range of behavior including physical and chemical transitions that will be difficult to capture through AA simulations alone, as well as experimental techniques. This is especially true for processes that involve large conformational changes that are not sampled effectively in AA simulations, whether due to the long timescales required, free energy barriers, or inherent limitations of the simulation force field. For instance, the S protein of SARS-CoV-2 exhibits two proteolytic cleavage sites (at the S1/S2 and S2' locations) as well as binding to the host cell receptor, angiotensin-converting enzyme 2 (ACE-2). Cleavage To address these issues, one can use CG molecular simulation techniques that allow CG particles to adaptively switch discrete "states" and interactions, such as "Ultra-Coarse-Graining" (UCG) (57) (58) (59) . In the limit of infrequent internal state switches, UCG implements microscopically reversible state changes that are coupled to a Metropolis Hastings like criterion: where › 2→3 is the instantaneous switching rate from state 5 to O , w 3 − w 2 is the CG effective potential energy difference between state O and 5 , and the rates, 2→3 and 3→2 are model parameters either treated as input or calculated from atomistic simulations. This approach is similar to hybrid kinetic Monte-Carlo and MD methods, but with a spatial kinematic component, and it can be used to examine the transitions of the spike (its "states") that lead to the fusion of SARS-CoV-2 with host cells. Experiments can probe longer timescales than are available from AA MD simulations. In recent cryo-EM images of SARS-CoV-2 particles, the S1 domain of the S protein was found to transiently open and close in order to bind the ACE-2 receptor (3, 5) , which are subtle conformational changes that are difficult to sample in atomistic simulations. For these conformational changes -in the case that they cannot be treated as discrete state switchesplastic network models (60) or multi-configuration coarse graining (MCCG) methods (61) can be used to construct a CG model that continuously transitions from one state to the next. For plastic network models, two known experimental configurations of the protein are used to build a multibasin ENM that represents deviations away from each of the individual conformational minima. A phenomenological interaction Hamiltonian is constructed that couples and mixes the ENMs between the two structural endpoints. In MCCG, the primary difference is that the coupling terms in the Hamiltonian are constructed from a two-state mixing approach, derived on the basis of a mapped potential of mean force which is explicitly computed from AA simulation data along collective variables that distinguish between the two (or more) conformational states at a CG level. An alternative (and sometimes necessary "top-down") route to deriving CG models is to construct a model Hamiltonian and then analyze the model's resulting behavior in the context of the assumed interactions. Typically, parameterization of such models is designed to fit or reproduce particular observables measured in experimental data and perhaps particular sets of AA simulation data. These can be performed based on variational optimization of some systemspecific functional that depends on the experimental observable. Model Hamiltonian approaches have the advantage that physical intuition is clearer, but are not systematic, since each new problem requires a different treatment for the set of interactions involved. Furthermore, these approaches often require orthogonal experiments to validate the underlying model. Such coarsegraining methods are, however, especially useful in cases where atomistic simulation is difficult or infeasible to obtain on the system or if the bottom-up methods described above are not expected to yield converged results for the effective CG potential, owing to limited atomistic sampling. Here we present results from the first CG simulations of the SARS-CoV-2 virion (Figure 3 ). It should be noted that these are early results and we can thus expect additional simulations to become available from this model as more experimental data and AA simulations become available for the various virion components. In addition, the overall CG methodology and modeling of the virion will continue to evolve, and are works in progress. A CGMD simulation was performed on the complete CG virion model using LAMMPS for 10 r 10 • CG timesteps (see Supplementary Movie 1). The system was energy minimized using conjugate gradient descent. A temperature of 300K was maintained with a Langevin thermostat, with a damping constant (j žŸg = 10 ps and 100 fs timestep. Statistics were collected every 100 CG timesteps. Several radial distributions functions (RDFs) or pair correlations between CG particles were computed for the MD trajectories of individual S proteins and compared to the mapped AA reference statistics from which the models were derived (Figure 4 A) . In general, the CG model captured the positions and peaks in the pair correlation functions; however, error in the J o u r n a l P r e -p r o o f fine structure of the peaks was also present, indicating that refinement involving the addition of more expressive basis CG potentials (e.g., splines) may be necessary. We performed principal component analysis (PCA) on a subset of the CG particles to examine collective modes of motion of the virion ( Figure 4B ). The Cartesian coordinates of 1 particle for every 15 CG lipids, 1 for every M and E protein, and 1 for every 3 S particles were extracted from the trajectory data and used to compute the covariance matrix, This work provides an initial CG molecular model of the SARS-CoV-2 virion and details a bottomup CG approach capable of further refining the model as new atomistic and experimental data becomes available. Currently, the lipid envelope is described using a particle-based The Protein Data Bank Announcing the worldwide Protein Data Bank Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2 Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein Structural basis of receptor recognition by SARS-CoV-2 Structure of replicating SARS-CoV-2 polymerase Cryo-EM structure of the SARS-CoV-2 3a ion channel in lipid nanodiscs Structural model of the SARS coronavirus E channel in LMPG micelles Molecular Basis for ADP-Ribose Binding to the Mac1 Domain of SARS-CoV-2 nsp3 Activity profiling and structures of inhibitor-bound SARS-CoV-2-PLpro protease provides a framework for anti-COVID-19 drug design Crystal structure of SARS-CoV-2 main protease provides a basis for design of improved α-ketoamide inhibitors Elucidation of cryptic and allosteric pockets within the SARS-CoV-2 protease 2020. Targeting SARS-CoV-2 with AI-and HPC-enabled Lead Generation: A First Data Release Developing a Fully Glycosylated Full-Length SARS-CoV-2 Spike Protein Model in a Viral Membrane The Molecular Biology of Coronaviruses Distinct conformational states of SARS-CoV-2 spike protein Supramolecular Architecture of Severe Acute Respiratory Syndrome Coronavirus Revealed by Electron Cryomicroscopy The M, E, and N structural proteins of the severe acute respiratory syndrome coronavirus are required for efficient assembly, trafficking, and release of virus-like particles The SARS coronavirus nucleocapsid protein--forms and functions Molecular architecture of the SARS-CoV-2 virus. Cell. S0092867420311594 Coronavirus envelope protein: current knowledge A Community Letter Regarding Sharing Biomolecular Simulation Data for COVID-19 Crystal structure of SARS-CoV-2 nucleocapsid protein RNA binding domain reveals potential unique drug targeting sites CHARMM-GUI: A web-based graphical user interface for CHARMM Site-specific glycan analysis of the SARS-CoV-2 spike Deducing the N-and Oglycosylation profile of the spike protein of novel coronavirus SARS-CoV-2 Comparative Protein Structure Modeling Using MODELLER JPred4: a protein secondary structure prediction server Structure of the SARS Coronavirus Nucleocapsid Protein RNA-binding Dimerization Domain Suggests a Mechanism for Helical Packaging of Viral RNA Improved prediction of protein side-chain conformations with SCWRL4 Modeling of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) Proteins by Machine Learning and Physics-Based Refinement Structural Genomics of SARS-CoV-2 Indicates Evolutionary Conserved Functional Regions of Viral Proteins Efficient Simulation of Tunable Lipid Assemblies Across Scales and Resolutions Fast Parallel Algorithms for Short-Range Molecular Dynamics Systematic Coarse-Grained Lipid Force Fields with Semiexplicit Solvation via Virtual Sites Structures and distributions of SARS-CoV-2 spike proteins on intact virions In situ structural analysis of SARS-CoV-2 spike reveals flexibility mediated by three hinges SARS-CoV-2 (COVID-19) by the numbers. eLife Membrane lipids: where they are and how they behave Membrane Lipid Composition: Effect on Membrane and Organelle Structure, Function and Compartmentalization and Therapeutic Avenues Comparison of simple potential functions for simulating liquid water Optimization of the Additive CHARMM All-Atom Protein Force Field Targeting Improved Sampling of the Backbone ϕ, ψ and Side-Chain χ1 and χ2 Dihedral Angles CHARMM36m: an improved force field for folded and intrinsically disordered proteins Scalable molecular dynamics with NAMD Beyond Shielding: The Roles of Glycans in the SARS-CoV-2 Spike Protein A Systematic Methodology for Defining Coarse-Grained Sites in Large Biomolecules Systematic Multiscale Parameterization of Heterogeneous Elastic Network Models of Proteins On the Dielectric "Constant" of Proteins: Smooth Dielectric Function for Macromolecular Modeling and Its Implementation in DelPhi SARS-CoV-2 structure and replication characterized by in situ cryo-electron tomography A Multiscale Coarse-Graining Method for Biomolecular Systems The multiscale coarse-graining method. I. A rigorous bridge between atomistic and coarse-grained models The relative entropy is fundamental to multiscale and inverse thermodynamic problems Coarse-graining errors and numerical optimization using a relative entropy framework Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor The Theory of Ultra-Coarse-Graining. 1. General Principles The Theory of Ultra-Coarse-Graining. 2. Numerical Implementation Insights into the Cooperative Nature of ATP Hydrolysis in Actin Filaments Large amplitude conformational change in proteins explored with a plastic network model: adenylate kinase Multiconfigurational Coarse-Grained Molecular Dynamics Coarse-grained simulation reveals key features of HIV-1 capsid selfassembly 2020. TRIM5α self-assembly and compartmentalization of the HIV-1 viral capsid Off-Pathway Assembly: A Broad-Spectrum Mechanism of Action for Drugs That Undermine Controlled HIV-1 Viral Capsid Formation Discovery of a Novel Binding Trench in HIV Integrase Structural basis for modulation of a G-protein-coupled receptor by allosteric drugs Glutamate and Glycine Binding to the NMDA Receptor Energetics of Glutamate Binding to an Ionotropic Glutamate Receptor Neurotransmitter Funneling Optimizes Glutamate Receptor Kinetics Molecular lock regulates binding of glycine to a primitive NMDA receptor Atomic-scale characterization of mature HIV-1 capsid stabilization by inositol hexakisphosphate (IP6) Research. The CG virion model is available at: https://doi.datacite.org/dois/10.34974%2Fq8ya-wh69 and https://github.com/alvinyu33/sars-cov-2-public. The model will be periodically updated with new versions as data is added and the model refined,