key: cord-283491-y6t64pux authors: Brzezinski, Dariusz; Kowiel, Marcin; Cooper, David R.; Cymborowski, Marcin; Grabowski, Marek; Wlodawer, Alexander; Dauter, Zbigniew; Shabalin, Ivan G.; Gilski, Miroslaw; Rupp, Bernhard; Jaskolski, Mariusz; Minor, Wladek title: Covid‐19.bioreproducibility.org: A web resource for SARS‐CoV‐2‐related structural models date: 2020-09-27 journal: Protein Sci DOI: 10.1002/pro.3959 sha: doc_id: 283491 cord_uid: y6t64pux The COVID‐19 pandemic has triggered numerous scientific activities aimed at understanding the SARS‐CoV‐2 virus and ultimately developing treatments. Structural biologists have already determined hundreds of experimental X‐ray, cryo‐EM, and NMR structures of proteins and nucleic acids related to this coronavirus, and this number is still growing. To help biomedical researchers, who may not necessarily be experts in structural biology, navigate through the flood of structural models, we have created an online resource, covid19.bioreproducibility.org, that aggregates expert‐verified information about SARS‐CoV‐2‐related macromolecular models. In this paper, we describe this web resource along with the suite of tools and methodologies used for assessing the structures presented therein. This article is protected by copyright. All rights reserved. The spread of the novel coronavirus around the world has triggered an unprecedented response from the scientific community. Six months into the pandemic, PubMed already listed over 23,000 scientific papers with the terms COVID-19 or SARS-CoV-2 in the title, and tens of analyses are reported daily in mass media around the globe. Understandably, firstline research findings, including molecular structure determinations, depositions in the Protein Data Bank (PDB), 1 and related results, are often made public on BioRxiv 2 or MedRxiv 3 before formal peer review. This approach delivers the latest results to scientists that develop treatments and vaccines without any delay but at the cost of elevated risk of mistakes and errors, which can mislead scientists performing follow-up research and misinform the general public. The World Health Organization has even coined the portmanteau 'infodemic` to describe the phenomenon of potentially misleading information overload. 4 As of July 27, 2020, the PDB has amassed 298 structural models of SARS-CoV-2-related macromolecules, including proteins and RNA fragments. Structure-based drug design depends on such molecular models, especially of complexes with candidate drugs slated for further development. However, the rapidly growing number of structures without corresponding publications and the potential mistakes associated with pandemic-driven research can create confusion among biomedical researchers and could impede, rather than accelerate, drug development. Indeed, an analysis of the "entry history" of structures deposited to the PDB between January 24 and July 27, 2020 showed that as many as 56 out of the 182 (30.8%) SARS-CoV-2 structures (excluding PanDDA 5 fragment screening deposits) required a major revision of the initial model, whereas only 360 of the other 6,328 (5.9%) structures deposited during that time period had any major revisions. For 15 of the SARS-CoV-2 structures, the revisions were significant and involved replacement of the atomic coordinates. Some of these revisions were triggered by our resource. 6 part be due to the celerity of the research, and in part to this and similar projects that requested original diffraction data, which prompted the authors of these structures to revisit their models. An additional factor which may impede the use of molecular structures in biomedical research is that they are sometimes presented in a way geared toward modeling and theoretical chemistry, but not for biomedical scientists that are not necessarily experts in protein crystallography. In this paper, we present covid-19.bioreproduciblity.org, a web resource that organizes SARS-CoV-2 related structural information in a way that should be understandable and useful for a wider scientific community, and not only for structural biologists. The website also serves as a repository for examined and, if found to be suboptimal, corrected versions of PDB structures of SARS-CoV-2 proteins and RNA fragments, with a focus on assessing the smallmolecule ligands modeled in those structures. Moreover, we strive to re-deposit the optimized structure models in the PDB, always in collaboration with the original authors. The validation tools and re-refinement protocols used in this project can serve as a template for future molecular structure assessment efforts. Due to the rapid response in time of the pandemic, the covid-19.bioreproducibility.org web resource was created in an agile, fast-prototyping manner, focusing on speedy delivery and flexibility to accommodate changes. As a result, several new features are still being bound. Ligands that may affect the protein function are called "functional ligands" (as opposed, e.g. to the ligands which are artifacts of the protein purification or crystallization process, and there are no indications that their binding may affect the way the protein functions). Users can also quickly filter for structures with or without functional ligands, protein-protein complexes, pathogen-host interactions, and fragment screening results. The website also allows text searches [ Fig. 1 Crystallography 11 or, if significant improvement might be expected, contact the authors with the request to submit the diffraction images. Moreover, we extract quality metrics related to structures by querying a locally installed copy of the PDBj's VRPT database schema. After the data have been automatically gathered from the above-mentioned sources, they are processed by geometry checking, statistical, and validation tools, most of which were developed in-house by the laboratories collaborating on this project. Finally, the structures are evaluated by a team of expert structural biologists who use a combination of the mined data, validation reports, and manual inspection of the protein models and associated electron density to examine potential problems. Careful attention is paid to all functional ligands and inhibitors contained in the structures. If potential problems are spotted, the diffraction data are re-processed (whenever the raw data are available) and the models are re-refined. The corrected models are made publicly available on our webserver. In addition, we always This article is protected by copyright. All rights reserved. attempt to contact the original authors and encourage them to jointly re-deposit the optimized models to the PDB. The details of the application of our structure evaluation tools and our structure correction protocol, are discussed in the following sections. To make an informed decision whether a structure should be re-refined or not, we use several criteria and tools to assess its quality. We check the overall geometry (Ramachandran outliers and rotamer outliers), the correlation between model and electron density map (especially for ligands), the presence of large peaks in difference electron density map, the placement of the macromolecular model in the unit cell, and whether R and Rfree are reasonable for the reported resolution. Additionally, to provide a simple quantitative overview of the quality of X-ray structures, we calculate and show on our website the PQ1 metric. 7 PQ1 is the structure's quality percentile (from 0 to 100, the higher the better) based on Rfree, RSRZ score, Clashscore, Ramachandran outliers and Rotamer outliers. Being a hybrid reciprocal and space-real space global metric, PQ1 can be easily used to sort structures and compare their overall quality. The PQ1 metric is recalculated weekly for each structure. Equipped with the above-mentioned validation tools, expert structural biologists may decide to manually inspect each structure in Coot. 12 If diffraction data are available, the potential gains of their manual re-processing are analyzed. Using the calculated electron density maps, the main chain and side chains can be easily reviewed with Coot or Molstack. 13 Special emphasis is put on unmodeled electron density blobs. Based on such a review, a decision whether to re-refine the structure is made. Full re-refinement is a laborious process and sometimes requires contact with the deposition authors. In case of deposits that do not have primary citations, the identification of the principal investigator (PI) is not always an easy task. For that reason, the Commission on Biological Macromolecules of the International Union of Crystallography (IUCr), together with the IUCr Committee on Data, have asked the PDB to publicly disclose the e-mail address of the PI (or depositing author) of each deposit. This article is protected by copyright. All rights reserved. Once the decision to re-refine has been made, we use the following protocol to improve the model. Some aspects of this protocol are general in nature, and the exact values may be changed for a particular structure. The protocol and decisions made for each structure are based on our extensive experience in protein structure determination, 14-16 crystallographic software development, 17, 18 published guidelines on structure refinement and structure quality, 7, 19, 20 and previous campaigns of PDB structure re-refinement: 21,22 If raw diffraction data are available, the results of automatic processing of images by HKL-3000auto are examined to verify that the structure was determined in the correct space group and at optimal resolution. In case of inconsistent results, we use the HKL-3000 program suite with the implementation of corrections for X-ray absorption, radiation decay, and anisotropic diffraction. 18 Each structure under inspection is placed in a standardized way in the reference unit cell using ACHESYM. 28 Even though crystal structures can be presented with the molecular models located in various crystallographically equivalent locations, we seek to facilitate the process of comparisons of analogous structures for noncrystallographers by placing the models as close to the origin of the unit cell as possible. 29 The ACHESYM server 28 takes into account the equivalence of the space group symmetry positions and adjusts the location of the model in the unit cell. As a result, the atomic coordinates and electron density maps of the re-refined versions of isomorphous structures, i.e., structures in the same space group and with differences for cell parameters a/b/c within 1.5% and cell angles within 5%, are standardized to the same location. This means that the macromolecules occupy similar positions in their corresponding unit cells and both the coordinates and electron density maps of isomorphous structures can be easily viewed already superposed using any current computer. The models subjected to restrained maximum posterior refinement in REFMAC 30 with hydrogen atoms added in riding positions. For all standard protein residues, the REFMAC dictionary is used as a source of ideal stereochemical targets, 31 After each round of REFMAC refinement, the atomic model is manually inspected and corrected according to the following checklist: a. Review unmodeled electron density blobs, which might represent ligands or residues missing from the polymer model. b. Inspect all difference electron density peaks above 4.0 rmsd (5.0 rmsd if there are too many peaks). Inspect the strongest negative density peaks. c. Inspect rotamer outliers, which may indicate incorrect placement of side chains, as well as residues with missing atoms. This article is protected by copyright. All rights reserved. d. Review density fit graphs and inspect poorly fitting residues; verify terminal residue placement; and inspect any gaps in the sequence. e. Inspect Ramachandran outliers. f. Once large electron density blobs have been modeled and major issues with the protein backbone have been addressed, look for potential water molecules to add, with peaks above 1.1 rmsd in the 2mFo-DFc map and distances to protein H-bonded atoms ranging from 2.4-4.0 Å. If unmodeled electron density blobs are found during manual corrections, they are considered as potential ligands. In such cases, we try to identify the ligand with the help of CheckMyBlob, 38 fit the ligand in the density, and run no fewer than 10 REFMAC cycles. If the ligand does not have proper stereochemical description in the standard REFMAC dictionary, new geometrical restraints are generated using the Grade Web Server 39 and carefully checked before use. Since ligands originally modeled in the deposition may be incorrect, 22 they are inspected visually and, if questionable, validated using CheckMyBlob. 38 Similarly, it has been shown that a significant fraction of metal-containing structures in the PDB have incorrect metal assignment or modeling. 40 Therefore, special attention is given to metal identification by the CheckMyMetal validation server 41 and, when possible, using anomalous maps calculated with data collected above and below the X-ray absorption edge. 42 Challenging cases are discussed by at least two team members. Many structures are inspected by at least one other expert after the refinement has been completed. The revised structures are stored in the web resource described here, along with a description of the identified issues and changes made. However, if the changes are significant, the goal is to redeposit the re-refined structure in the PDB, preferably together with the original authors, using the mechanism of re-versioning. An example report for a structure re-processed from original data and re-refined according to the above protocol is presented in Figure 3 and in Figure S1 . A report showing a case when the original diffraction data were not available is presented in Figures S2 and S3 . The goal of the covid-19.bioreproducibility.org web resource is to gather macromolecular structures related to the SARS-CoV-2 virus and assess them using state-of-the-art tools. Additionally, we aim to provide information that can be easily used by non-structural biologists. That is why the structures are categorized according to the experimental method, virus type, protein type, and ligand category. We also attempt to facilitate quick overall structure assessment for general users by calculating aggregated quality metrics, such as the PQ1 percentile. Finally, we make sure that isomorphous structures solved in the same space group can be easily compared, by moving them into standardized location of the reference unit cell. Although non-uniform model placement in the unit cell may not seem to be a serious issue for trained crystallographers, for many biomedical researchers it makes comparison harder and structures may appear to be completely different, leading to confusion and misinterpretations. During the work on the server, we made several disturbing observations. First, in several cases, the deposited images were clearly not compatible with the diffraction data used for structure refinement. Second, some of the contacted scientists claimed that the diffraction data were deleted immediately after processing in order to save disk space. Third, several scientists did not respond to our request to provide their data, despite the IUCr recommendation 43 and an appeal from the community to make diffraction data related to SARS-CoV-2 public (http://phenix-online.org/pipermail/phenixbb/2020-March/024556). However, in one case 6 our request resulted in the original authors re-depositing an optimized structure instead of depositing the diffraction data. All of the above facts show that the struggle for reproducibility of scientific results is an uphill battle, and suggest that leading scientific journals should do more 44, 45 than run editorials about the need to improve the reproducibility of scientific results. It is worth noting that the described web resource is not the only project established with the aim of validating, correcting, or providing additional information on COVID- 19 With vaccines in late-stage development 46, 47 and the first reports of drugs increasing survival chances, 48 the COVID-19 pandemic will hopefully end soon. However, this may not necessarily be the end of the SARS-CoV-2 coronavirus, as it may evolve in yet unforeseen ways to evade vaccines and treatments. Therefore, we will keep improving the web resource presented herein, with the hope that it will remain useful to biologists for years to come and that it will set standards for any future health crises. This structure will be re-deposited to the PDB under a new PDB ID, due to significant changes RCSB Protein Data Bank: Sustaining a living digital data resource that enables breakthroughs in scientific research and biomedical education bioRxiv: the preprint server for biology New preprint server for medical research Coronavirus SARS-CoV-2: filtering fact from fiction in the infodemic A multi-crystal method for extracting obscured crystallographic states from conventionally uninterpretable electron density Ligand-centered assessment of SARS-CoV-2 drug target models in the Protein Data Bank On the evolution of the quality of macromolecular models in the PDB DrugBank 5.0: a major update to the DrugBank database Data structures for statistical computing in Python PDBj Mine: design and implementation of relational database interface for Protein Data Bank Japan A public database of macromolecular diffraction experiments Current developments in Coot for macromolecular model building of electron cryo-microscopy and crystallographic data Molstack: A platform for interactive presentations of electron density and cryo-EM maps and their interpretations Structural, biochemical, and evolutionary characterizations of glyoxylate/hydroxypyruvate reductases show their division into two distinct subfamilies Albumin-based transport of nonsteroidal anti-inflammatory drugs in mammalian blood plasma Biomolecular Crystallography: Principles, Practice, and Application to Structural Biology Conformation-dependent restraints for polynucleotides: the sugar moiety HKL-3000: the integration of data reduction and structure solution -from diffraction images to an initial model in minutes Refining the macromolecular model -achieving the best agreement with the data from X-ray diffraction experiment Assessment of crystallographic structure quality and protein-ligand complex structure validation A close look onto structural models and primary ligands of metallo-β-lactamases Detect, correct, retract: How to manage incorrect structural models Processing of X-ray diffraction data collected in oscillation mode Diffraction data analysis in the presence of radiation damage Weak data do not make a free lunch, only a cheap meal How good are my data and what is the resolution? Linking crystallographic model and data quality ACHESYM : an algorithm and server for standardized placement of macromolecular models in the unit cell On optimal placement of molecules in the unit cell Overview of refinement procedures within REFMAC5: utilizing data from different sources Accurate bond and angle parameters for X-ray protein structure refinement Conformation-dependent restraints for polynucleotides: I. Clustering of the geometry of the phosphodiester group Accurate geometrical restraints for Watson-Crick base pairs TLSMD web server for the generation of multi-group TLS models Significance tests on the crystallographic R factor Features and development of Coot MolProbity: More and better reference data for improved all-atom structure validation Automatic recognition of ligands in electron density by machine learning Magnesium-binding architectures in RNA crystal structures: validation, binding preferences, classification and motif detection CheckMyMetal : a macromolecular metal-binding validation tool Characterizing metal-binding sites in proteins with X-ray crystallography Findable accessible Interoperable Re-usable (FAIR) diffraction data are coming to protein crystallography No raw data, no science: another possible source of the reproducibility crisis Faculty Opinions Recommendation of [Miyakawa T Safety and immunogenicity of the ChAdOx1 nCoV-19 vaccine against SARS-CoV-2: a preliminary report of a phase 1/2, single-blind Immunogenicity and safety of a recombinant adenovirus type-5-vectored COVID-19 vaccine in healthy adults aged 18 years or older: a randomised, double-blind Effect of dexamethasone in hospitalized patients with COVID-19: Preliminary report This article is protected by copyright. All rights reserved One of the authors (WM) notes that he has also been involved in the development of software and data management and mining tools; some of them were commercialized by HKL Research and are mentioned in the paper. WM is the co-founder of HKL Research and a member of the board. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.