key: cord-1018551-sg7xvni5 authors: Chen, Chang; Liu, Haipeng; Zabad, Shadi; Rivera, Nina; Rowin, Emily; Hassan, Maheen; Gomez De Jesus, Stephanie M; Llinás Santos, Paola S; Kravchenko, Karyna; Mikhova, Mariia; Ketterer, Sophia; Shen, Annabel; Shen, Sophia; Navas, Erin; Horan, Bryan; Raudsepp, Jaak; Jeffery, Constance title: MoonProt 3.0: an update of the moonlighting proteins database date: 2020-11-27 journal: Nucleic Acids Res DOI: 10.1093/nar/gkaa1101 sha: 7340a718f4cda639490f14fdf9c0fc8b5aa4d9d8 doc_id: 1018551 cord_uid: sg7xvni5 MoonProt 3.0 (http://moonlightingproteins.org) is an updated open-access database storing expert-curated annotations for moonlighting proteins. Moonlighting proteins have two or more physiologically relevant distinct biochemical or biophysical functions performed by a single polypeptide chain. Here, we describe an expansion in the database since our previous report in the Database Issue of Nucleic Acids Research in 2018. For this release, the number of proteins annotated has been expanded to over 500 proteins and dozens of protein annotations have been updated with additional information, including more structures in the Protein Data Bank, compared with version 2.0. The new entries include more examples from humans, plants and archaea, more proteins involved in disease and proteins with different combinations of functions. More kinds of information about the proteins and the species in which they have multiple functions has been added, including CATH and SCOP classification of structure, known and predicted disorder, predicted transmembrane helices, type of organism, relationship of the protein to disease, and relationship of organism to cause of disease. MoonProt is an online resource of information about moonlighting proteins that is manually curated by experts. Moonlighting proteins are proteins in which more than one physiologically relevant discrete biochemical or biophysical function is performed by a single polypeptide chain (1) (2) (3) . Some of the first moonlighting proteins to be identified were the taxon specific crystallins. These crystallins are metabolic enzymes and chaperones that have been adopted to perform a second, noncatalytic function in the lens or cornea the eyes of a few species (4) . Since then hundreds of moonlighting proteins with many different functions have been found throughout the evolutionary tree (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) . Because a large variety of proteins have evolved to perform multiple functions, there is no shared sequence or structural feature that can be used to indicate or predict whether or not a protein is a moonlighting protein. Instead, second functions are often found through serendipity, and information about these proteins is scattered in diverse publications. Our Moon-Prot Database brings this information together to help researchers learn about proteins with multiple functions and provides a quick method to find out if a protein of interest is a known moonlighting protein or related to a known moonlighting protein (Figure 1 ). In addition, this collection of information about the sequences, structures and functions of known moonlighting proteins can aid in several current areas of research, including the analysis of protein structure and function, interpreting genome sequences and the results of proteomics studies, elucidating the evolution of Figure 1 . Example of information included in the annotation for the human Angiotensin-Converting Enzyme 2 (ACE2), which is the Sars-Cov2 coronavirus target. ACE2 is a moonlighting protein because it is both (1) an enzyme that cleaves angiotensin to produce bioactive peptides and (2) a chaperone that helps in the proper folding and plasma membrane targeting of the BoAT amino acid transporter. As for many moonlighting proteins, several names of the protein are included in the entry so that the corresponding entries can be found by users using different names. UniProtKB and PDB accession numbers are included to link to additional resources. The GO terms illustrate the protein's functions, cellular locations, and processes it is involved in. An Enzyme Commission number provides information about the type of catalytic activity and substrates. The species of organism for which the protein has been shown to have more than one function and the amino acid sequence in FASTA format are provided to clarify which version of a protein has been shown to have two functions. One or more peer-reviewed publications describing experiments that demonstrated the protein functions are also included. For some proteins and organisms, information about connections to disease have been added. For ACE2, both of the protein's functions are involved in diseases in humans. In addition, the ACE2 protein is a 'receptor' used to invade host cells by the Sars-Cov2 virus, which caused the coronavirus pandemic of 2020. Although binding of a pathogen's protein to a protein on a host cell is not a function that arose during evolution of the host protein but instead is a function of the pathogen's protein, ACE2 still qualifies as a moonlighting protein because it has a catalytic function and a chaperone function. (13, 14) . In this paper, we present Moon-Prot version 3.0. Since its previous update, the database has grown to include annotation for over 500 proteins and the information about individual moonlighting proteins has been expanded and updated. For inclusion of a protein in the MoonProt Database, peer-reviewed published biochemical, biophysical, mutagenic or other data to support the presence of multiple physiologically-relevant functions was required and was critically reviewed by the PI. Proteins were not included if the 'multiple functions' described in publications are due to different RNA splice variants, the same function performed in two different cellular or tissue locations, pleiotropic effects on multiple pathways or multiple physiological processes, or a family of proteins in which the different functions are performed by different proteins. Proteins were not included if the 'multiple functions' are simply different aspects of the same function (i.e. 'membrane protein' and 'transmembrane transporter'). As described for versions 1.0 and 2.0 (13, 14) , information about each protein was manually curated from published journal articles and online resources. The entry for each protein includes a description of each function and one or more references for publications providing experimental evidence of each function. The specific organism and protein sequence that corresponds to the protein that has two or more functions was identified and included because a homologous protein might or might not have both functions. Amino acid sequences were identified using the NCBI ( (17) for structures corresponding to the amino acid sequence. In some cases, the structure of the moonlighting protein is not available, so an indication of the presence of a related protein is included with a note about the percent amino acid sequence identity. Gene ontology (GO) terms, which are defined as evidence-based statements relating a specific gene product to a specific ontology term (18) were identified from the UniProtKB (15) , and Enzyme Commission (EC) numbers, which are part of a classification scheme for enzymes based on the chemical reactions they catalyze, are included in order to illustrate the different types of protein functions, their cellular locations, and processes in which they are involved. UniProtKB and Protein Data Bank IDs are included for locating more information in these external resources. SCOP (19) and CATH (20) classifications of structural domains were retrieved from the corresponding Protein Data Bank entry when available. The four number classification for CATH includes the Class, Architecture, Topology and Homology information. The Fold information was retrieved from the SCOP classification. Information about known disordered proteins and disordered regions within proteins was retrieved from the Dis-Prot Database of Protein Disorder (21-23), which contains manually curated information about proteins that have been experimentally shown to contain disordered regions. Predicted transmembrane helices were calculated using TMHMM (24) . Amino acid sequence analysis for regions of low complexity was performed using Protein DisOrder prediction System (PrDOS) and IUPred (25, 26) . These regions are likely to have higher flexibility and disorder than regions with higher sequence complexity. Information about type of organism (mammal, plant, etc.) has been added, and, for nonhuman organisms, whether or not the organism is known to cause disease in humans or other animals or plants was also added. Connections of human proteins to disease were retrieved from the Online Mendelian Inheritance in Man online database of human genes and genetic disorders (OMIM) (27) . The database is based on MySQL (http://www.mysql.com) for data storage, together with PHP 7.1 (http://www.php. net), HTML (HyperText Markup Language), and CSS (Cascading Style Sheets). A Content Management System (CMS): WordPress, was used to update the software. On the homepage, a Search link leads to a page with two search options, a text search and a 'BLAST' sequence similarity search. The Search box enables a text search of all the annotated information in the database, and the search returns a list of protein entries containing that term. The BLAST box enables use of the NCBI-blast-2.6.0+ algorithm (Basic Local Alignment Search Tool) (16) to search the database for proteins that share sequence similarity with a query sequence. Additional proteins and updated annotations. The Moon-Prot Database version 3.0 is now available at www. moonlightingproteins.org. The database has grown by the addition of over one hundred proteins compared to version 2.0. The database now includes information for over 500 moonlighting proteins for which experimental evidence is available confirming the presence of more than one function, and more entries are in progress. As in versions 1.0 and 2.0, most of the new entries have a catalytic activity as at least one of their functions. The relatively large number of proteins that are enzymes or chaperones inside the cell and have a second function on the cell surface or when secreted to the extracellular media also continues to increase (28) . Many of these proteins act as receptors for nutrients or as cell surface adhesins to enable pathogens or probiotic 'good' bacteria to bind to host cells, and others act as secreted signaling molecules to modulate the host immune system. Several dozen proteins have been added that function as transcription or translation factors and have been found to have a second function during mitosis, for example, in binding to the mitotic spindle (12) . The number of known or predicted transmembrane proteins was very small in versions 1 and 2, and has increased in version 3.0. Along with adding more proteins to the database, the annotations for many of the proteins have been updated. As the number of protein structures that are available in the Protein Data Bank grows, more structures of moonlighting proteins have become available, so more PDB identification numbers (PDB IDs) have been added for proteins previously in the MoonProt Database. For some proteins that were included in previous versions of MoonProt, additional functions and references have also been included. As the number of moonlighting proteins continues to grow, the examples in MoonProt provide information about protein folds, features or other characteristics that can enable a protein to have two or more biochemical or biophysical functions. To aid in research on how new functions evolve or how to design proteins with additional functions, version 3.0 includes more information about known or predicted structural features of the moonlighting proteins. Protein fold information has been included by adding the SCOP (19) and CATH (20) structural classification for many proteins that have structures in the Protein Data Bank. While some protein folds might enable evolution of a second function, regions of flexibility or disorder can also be important in performing multiple functions, so information about proteins experimentally shown to contain disordered regions was retrieved from the DisProt Database (21) (22) (23) . Previous versions of the MoonProt Database included very few transmembrane proteins, which raised the question if soluble globular proteins might be more amenable to evolving additional functions. In preparing version 3.0, an emphasis was placed on finding additional examples in the literature of transmembrane moonlighting proteins. Information about the number and location of predicted transmembrane helices was included based on calculations using TMHMM (24) , and this information helps identify the twenty proteins in version 3.0 of the database that contain transmembrane helices. The MoonProt Database version 3.0 is now available at www.moonlightingproteins.org The database provides a centralized, organized, searchable, online resource contain-ing information about over 500 moonlighting proteins for which experimental evidence is available for more than one function. Both the number of proteins and the amount of annotation per protein are continuing to grow as new peerreviewed publications about moonlighting proteins become available and as new protein structures are solved and deposited in the Protein Data Bank. The wide variety of proteins that moonlight inhibits identification of one general characteristic that could be used to identify all moonlighting proteins and all their functions. However, the numbers of proteins within some subsets of moonlighting proteins are increasing. Two examples are proteins that have one function in the cell cytoplasm and another function when displayed on the cell surface (28) and cytoplasmic proteins with a second function in the nucleus, for example as a transcription factor that regulates gene expression (9) . Having collections of the sequences and structures for proteins in these subsets might enable identification of sequence or structural motifs or algorithms that can be used to identify additional moonlighting proteins within these subsets. The database also serves as a resource for labs interested in developing computational methods for predicting protein functions based on sequence, structure, protein-protein interactions or other characteristics. Information about X-ray crystal structures available in the Protein Data Bank has been included and updated, but a crystal structure might represent only one structure out of several for a moonlighting protein. Many moonlighting proteins undergo changes in structure when they change functions, including changes in conformation, tertiary fold, and/or multimeric assembly. For example, some proteins have one function while part of the ribosome or another multiprotein complex and a different function as a monomer or homomultimer. A small number of proteins, called metamorphic and morpheein proteins, can undergo changes in the tertiary folds of protein subunits, and, in some cases, the change in tertiary fold is correlated with a change in function (29, 30) . Some information about the different structures correlating with different functions has been included in the MoonProt Database, but adding the information is still in progress, and the structure performing each function is not always known. Nevertheless, having a database of proteins that change function could be a starting point for identifying additional proteins that undergo significant changes in structure. Changes in the structures of moonlighting proteins that correlate with changes in function could be valuable leads for the development of bioinspired nanoswitches and nanomachines. We also note that updating the MoonProt Database has been one method by which the Jeffery lab has continued its research activities during the coronavirus pandemic of 2020. While other research opportunities for undergraduates and high school students were not available during social distancing, we developed opportunities for three UIC undergraduate students, two Summer Research Opportunity Program undergraduate students in Puerto Rico, two additional undergraduate volunteers from within the USA or internationally (Ukraine) and four high school students to be involved in lab activities and learn about protein sequence and structural analysis. The MoonProt Database is freely available via a userfriendly graphical user interface (GUI) at the web address www.moonlightingproteins.org. The interface enables text search for a protein name, species, or a UniProtKB or PDB identifier and a BLAST search using an amino acid sequence in the one letter code. The user can also browse a list of all the proteins in the database. The database is 'read and search only' by the public, but additional information about the known moonlighting proteins and suggestions of other proteins that might also be moonlighting are welcome and can be sent to the curators for possible inclusion in the database. Moonlighting proteins Moonlighting proteins: old proteins learning new tricks Moonlighting proteins-nature's Swiss army knives Recruitment of enzymes as lens structural proteins Essential nontranslational functions of tRNA synthetases Bacterial virulence in the moonlight: multitasking bacterial moonlighting proteins are virulence determinants in infectious disease MultitaskProtDB: a database of multitasking proteins Moonlighting proteins in yeasts. Microbiol Trigger enzymes: bifunctional proteins active in metabolism and in controlling gene expression Extraribosomal functions of ribosomal proteins Moonlighting in mitosis: analysis of the mitotic functions of transcription and splicing factors MoonProt: a database for proteins that are known to moonlight MoonProt 2.0: an expansion and update of the moonlighting proteins database UniProt: the universal protein knowledgebase BLAST: at the core of a powerful and diverse set of sequence analysis tools The RCSB protein data bank: integrative view of protein, gene and 3D structural information SCOP: a structural classification of proteins database for the investigation of sequences and structures The CATH database: an extended protein family resource for structural and functional genomics Natively unfolded proteins: a point where biology waits for physics Intrinsically unstructured proteins Intrinsic disorder and protein function A hidden Markov model for predicting transmembrane helices in protein sequences PrDOS: prediction of disordered protein regions from amino acid sequence IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content OMIM. org: Online Mendelian Inheritance in Man (OMIM ® ), an online catalog of human genes and genetic disorders Physical features of intracellular proteins that moonlight on the cell surface Morpheeins-a new structural paradigm for allosteric regulation Unfolding the mysteries of protein metamorphosis