key: cord-0661984-rp0hxc0k
authors: Wei, Junkang; Xiao, Jin; Chen, Siyuan; Zong, Licheng; Gao, Xin; Li, Yu
title: ProNet DB: A proteome-wise database for protein surface property representations and RNA-binding profiles
date: 2022-05-16
journal: nan
DOI: nan
sha: 7bd7a39fcefb3e905235169f5034593fbdd48eb6
doc_id: 661984
cord_uid: rp0hxc0k

The rapid growth in the number of experimental and predicted protein structures and more complicated protein structures challenge users in computational biology for utilizing the structural information and protein surface property representation. Recently, AlphaFold2 released the comprehensive proteome of various species, and protein surface property representation plays a crucial role in protein-molecule interaction prediction such as protein-protein interaction, protein-nucleic acid interaction, and protein-compound interaction. Here, we propose the first comprehensive database, namely ProNet DB, which incorporates multiple protein surface representations and RNA-binding landscape for more than 33,000 protein structures covering the proteome from AlphaFold Protein Structure Database (AlphaFold DB) and experimentally validated protein structures deposited in Protein Data Bank (PDB). For each protein, we provide the original protein structure, surface property representation including hydrophobicity, charge distribution, hydrogen bond, interacting face, and RNA-binding landscape such as RNA binding sites and RNA binding preference. To interpret protein surface property representation and RNA binding landscape intuitively, we also integrate Mol* and Online 3D Viewer to visualize the representation on the protein surface. The pre-computed features are available for the users instantaneously and their potential applications are including molecular mechanism exploration, drug discovery, and novel therapeutics development. The server is now available on https://proj.cse.cuhk.edu.hk/pronet/ and future releases will expand the species and property coverage.

Proteins perform vital functions in a variety of cellular activities, and protein-molecule interactions decipher the complexity of organisms such as gene expression regulation Weirauch et al. [2013] , signal transduction Alipanahi et al. [2015] , and drug therapy Lu et al. [2020] . However, the delicate mechanism of most protein-molecule interactions has not been well illustrated and hinders the development of mechanism exploration and drug discovery. During the process of protein interacting with molecules, molecules are intended to recognize the surface of the protein, such as the hydrophobicity, charge distribution, hydrogen/electron donor, and binding steric hindrance. Thus, a comprehensive and efficient representation of the protein surface is essential to elucidate the mechanism of protein-molecule interaction. For example, Rudden et al. Rudden and Degiacomi [2019] utilize a single volumetric descriptor representing protein surface including electrostatics and local dynamics for protein docking and achieved an average success rate of 54%. Experimental assessments such as NMR-based measurement Almeida et al. [2021] , hydrophobic interaction chromatography (HIC) Lienqueo et al. [2007] of protein surface property are time-consuming and costly. Besides, with the presence of AlphaFold2 Protein Structure Database Varadi et al. [2022] , a number of protein structures are determined by computational prediction, indicating that the traditional approaches are unable to handle this series of protein surface property evaluation.

To overcome the limitations of experimental approaches, several in silico methods of protein surface property have been proposed, such as MaSIF Gainza et al. [2020] , FEATURE Halperin et al. [2008] , and AutoDock Huey et al. [2012] . For example, AutoDock Huey et al. [2012] calculates the atom-wise biochemical property, and FEATURE Halperin et al. [2008] employs a series of centric shells to represent atoms of the protein with 7.5Å of a grid point with 80 physicochemical properties.

MaSIF Gainza et al. [2020] presents a method to encode geometric features (shape index and distance-dependent curvature) and chemical features (hydropathy, continuum electrostatics, and free electrons/protons) on the surface with the geodesic radius of 9Å or 12Å. Despite the availability of those tools for downstream applications, they are not ready-to-use, with the complex running environment and long running time. Also, it is inefficient for each user to run them locally for the same protein, which leads to repetitive work. Theoretically, for the fixed protein structure, the surface representation of the same tool should be the same. Considering that, we build up the database, running MaSIF to encode protein surface physicochemical properties including hydrophobicity, charge distribution, hydrogen bond, interacting face for the protein structure from the experimentally validated database (PDB) and in silico database (AlphaFold DB), so that the user can directly use such features for their downstream applications.

Similar to the physicochemical property, RNA binding landscape is also an important part of surface property. Direct recognition of RNA motifs on RNA-binding proteins (RBPs) can provide information of protein-nucleic acid interaction Wei et al. [2022] . For example, the Pumilio/FBF (PUF) family can govern translations by direct base-protein recognition, such as UGUR motifs on RNA transcripts Quenault et al. [2011] . Thus, the RNA binding profiles of RBPs are the important part to illustrate protein-molecule interaction. In this study, we employed the state-of-the-art deep-learning framework NucleicNet Lam et al. [2019] to predict binding preference of RNA constituents and the binding sites on protein surface to provide RNA-binding landscape of the protein structure from the experimentally validated database (PDB) and in silico database (AlphaFold DB). Although the dataset is based on prediction, we are the first to provide such a ready-to-use database for downstream applications, such as CRISPR-Cas system optimization Tycko et al. [2016] , RBP-targeting therapeutics discovery Gebauer et al. [2021] , and aptamer-guided drug delivery system development Alshaer et al. [2018] .

In summary, we proposed a comprehensive database for protein surface feature, ProNet DB, which contains protein surface physicochemical representations and RNA-binding landscape for more than 33,000 protein structures covering the proteome (Homo sapiens, Saccharomyces cerevisiae) from AlphaFold DB and PDB. For each protein, we provide the original protein structure, surface property representation including hydrophobicity, charge distribution, hydrogen bond, interacting face, and RNA-binding landscape such as RNA binding site and RNA binding preference. To interpret protein surface property representation and RNA binding landscape intuitively, we also integrate Mol* and Online 3D Viewer to visualize representation on the protein surface. The server now can be assessed at https://proj.cse.cuhk.edu.hk/pronet/ and future releases will expand the species and property coverage.

We first collected 19,987 protein structures on Homo sapiens proteome and 6,042 protein structures on Saccharomyces cerevisiae proteome from AlphaFold DB Varadi et al. [2022] . If the corresponding experimentally validated protein structures exist in PDB, we collected the protein structure with the highest resolution from PDB (Homo sapiens: 6,030, Saccharomyces cerevisiae: 1,160) Burley et al. [2017] .

MaSIF is a general framework to encode protein surface fingerprints Gainza et al. [2020] . For each protein, it will generate a discretized molecular surface by assigning calculated physicochemical features on every vertex of the surface. In this way, the properties of the protein surface can be clearly represented. As shown in Figure 1 , the user can determine which part of the surface area is hydrophilic or hydrophobic, and which part is more likely to interact with other molecules (interacting face). We computed the surface properties by the MaSIF tool for the proteins in our database so that users can obtain the physicochemical property profile for every protein efficiently. These computed features can benefit many downstream tasks a lot including binding site prediction Miotto et al. [2021] , protein-protein interaction prediction Gaudelet et al. [2021] , and protein design , etc.

As the Protein-RNA interaction is involved in multiple cellular activities, the interaction between RNAs and RBPs plays an important role in understanding the cellular activities. The systematic mappings of the RNA-protein interaction for multiple RNA constituents were constructed in ProNet DB. Following the deep-learning framework proposed by NucleicNet Lam et al. [2019] , we acquired the binding preference as well as the binding sites for multiple bases for protein structures in Alphafold DB and PDB, such as Ribose (R), Phosphate (P), Adenine (A), Guanine(G), Cytosine (C), and Uracil (U). The protein RNA binding profiles are further classified into multiple sub-classes for each species based on their diverse protein functions. In ProNet DB, users are able to address the protein properties such as RNA backbone composition and binding preference of different bases.

Multi-scale data analyses for 33,000 protein structures were processed under diverse protein categories. The homepage integrates all the server tools and is divided into three major components: prediction tools, database queries, and visualization tools. An overview and interactive table present information ranging from protein name, PDB ID, UniProt ID, protein Type, interacting face proportion, Hbond region proportion, positive/negative charge region proportion, and protein-RNA-binding profiles (see Tabel 1 ). The web interface was established to search for the property profiles of a user-specified protein type in multiple species. The homepage is shown in the top-right part of Figure2.

Similar to AlphaFold DB Varadi et al. [2022] , ProNet DB was constructed by standard database development techniques, making protein surface physicochemical property and RNA-binding profiles available to the scientific community. In the first version, ProNet DB includes two primary species (Homo sapiens and Saccharomyces cerevisiae). We adopted Vue.js as the responsive framework and used BootstrapVue which is based on Vue.js and the front-end CSS library -Bootstrap v4 to implement components and grid system instantly. Nginx was used not only as a static resource server but also as a proxy for requesting back-end servers. In this work, Python was the back-end language and Python-Flask was employed as the back-end framework to handle user requests. MongoDB was chosen as the database server, and PyMongo was utilized to access the database server. Visualization engines, such as Mol* and Online 3D Viewer, were incorporated into the protein information page for visualization of protein surface physicochemical properties and RNA binding landscape. In order to interactively visualize the PSE format file from NucleicNet and the customized PLY file from MaSIF, we implemented PyMOL and offline Python script to translate these files into standard PDB files and colored PLY files, respectively. All services, including the Nginx server, Python-Flask server, and MongoDB, are docker containers integrated by Docker-Compose and interact internally, independent of the local machine.

As shown in Figure 2 , the top-right panel illustrates the homepage of NucleicNet Database. Users can browse, search, analyze and download the protein structure as well as the surface property files of their interest. The specific page of each protein consists of two components: meta-data, visualization of processed results, including the protein surface physicochemical property produced by MaSIF, and the protein-RNA binding profiles predicted by NucleicNet. 

Currently, the database contains over 33,000 entries for proteome from AlphaFold DB and PDB (Homo sapiens: 19,987, 6,625; Saccharomyces cerevisiae: 6,729, 1,664). We further classify these proteins by their function into sub-classes. Figure 6 demonstrates the considerable protein structure prediction performance, since 80.6% of validated proteins in Homo sapiens and 74.8% of validated proteins in Saccharomyces cerevisiae are accurately predicted (RMSD ≤ 2.0). In Figure 3 (B), we integrate the proportion for hydrophobic and hydrophilic vertex over the total number of vertex for AlphaFold2 Human proteins and compare them with the interacting face proportion. A clear pattern is shown in Figure 3 (B) that the Hbond receptor region is statistically higher than Hbond donor region in AlphaFold Yeast proteins. In addition, Figure 3 (D) reveals that the overall number of nucleic acids for Yeast is much higher than that of human. The statistical results of the chain numbers from both PDB Human and Yeast data are illustrated in Figure 3 (E).

Since Cas9 protein plays an important role in the CRISPR-Cas system, understanding how Cas9 mediates RNA-guided DNA recognition is the essential part to improve the gene editing system. The crystal structure of Staphylococcus aureus Cas9 (PDB: 5AXW) was chosen for protein surface physicochemical property and RNA-binding profiles analysis.

As shown in Figure 4 , we have highlighted the SaCas9-sgRNA complex structure with its binding guide RNA, in which a central channel was formed in the middle of the structure. From the protein surface fingerprint results of Iface region, ProNet DB reveals that the original nucleic acid binding site is located at the interacting face region, compared to the non-binding site. Besides, the electron donor region shows the inner region is positive-charged, indicating the interaction between protein surface and nucleic acids in the central channel between the recognition and nuclease lobes Jinek et al.

[2014], Nishimasu et al. [2014] . In Figure 4 , the predicted protein-RNA binding profiles with specific RNA binding site and binding preference indicate that the corresponding RNA molecule is located in the inner region of protein structure, which is consistent with the ground truth. These results show that such in silico approach can efficiently capture the protein physicochemical surface property as well as RNA binding landscape and lay the foundation for downstream tasks such as gRNA design Doench et al.

[2016] and CRISPR system optimization Hu et al. [2021] .

ProNet DB has several potential applications and some of them are listed below:

(i) Develop computational approach and machine learning framework for binding affinity and binding site of proteinmolecule complexes. For example, with target protein surface property, compound-protein interaction prediction framework can be proposed for drug discovery.

(ii) Validate the effect of mutation on the binding affinity of the complex. The mutant occurs on the protein site with typical surface property, indicating the disorder of protein function. For example, mutations located in the hydrophobic site with hydrophilic amino acids can be considered as the potential disease-causing mutations.

(iii) Design RNA aptamers and antibody for targeted therapy. For example, based on human ACE2 surface property, ACE2 mutant decoy could be designed for COVID-19 therapy. Based on the RNA binding landscape of the target protein, RNA aptamer could be designed for drug delivery system.

(iv) Explore the molecular mechanism of protein-molecule interactions at amino acid-level. For example, β-catenin is one of the key factors in Wnt/β-catenin signaling pathway in embryonic development. β-catenin is expected to bind to co-factor (e.g. SOX2) for their cis-regulatory role in genome. Different co-factor may lead to different embryonic differentiation statuses, and thus it is important to reveal the amino acid-level mechanism between β-catenin and co-factor.

In the first version of ProNet DB, we have processed 33,219 entries obtained from two major databases: AlphaFold DB and PDBBerman et al. [2003] , covering Homo sapiens, Saccharomyces cerevisiae, by two protein surface property annotation and prediction algorithms (MaSIF and NucleicNet). During the period of this study, AlphaFold DB contains over 360,000 predicted structures from 21 species, and PDB contains 201,060 structures in total and the future releases will expand the species and property coverage. To expand the property of proteins, we will apply more in silico methods, such as DeepSurf Mylonas et al. [2021] , HOLOPROT Somnath et al. [2021] . 3D structure preview on web browser facilitates the research progress since it provides information intuitively. Based on this, the future release will also cover more visualization details, including meta information for each entry and 2D displays for the input protein. In addition to that, in the future, we aim at providing a one-stop database for RNA-binding protein and RNA representations, including protein language model representations [Yu et al., 2021 , Hong et al., 2021 , RNA foundation model and structure [Chen et al., 2020] representations. Such a database will accelerate the related studies greatly.

All metadata and processed data are now available, and several approaches are provided to access the data. 

Evaluation of methods for modeling transcription factor sequence specificity

Predicting the sequence specificities of dna-and rna-binding proteins by deep learning

Recent advances in the development of protein-protein interactions modulators: mechanisms and clinical trials

Protein docking using a single representation for protein surface, electrostatics, and local dynamics

Protein surface interactions-theoretical and experimental studies

Current insights on protein behaviour in hydrophobic interaction chromatography

Alphafold protein structure database: Massively expanding the structural coverage of protein-sequence space with high-accuracy models

Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning

The feature framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications

Using autodock 4 and autodock vina with autodocktools: a tutorial. The Scripps Research Institute Molecular Graphics Laboratory

Protein-rna interaction prediction with deep learning: structure matters

Puf proteins: repression, activation and mrna localization

A deep learning framework to predict binding preference of rna constituents on protein surface

Methods for optimizing crispr-cas9 genome editing specificity

Rna-binding proteins in human genetic disease

Aptamer-guided nanomedicines for anticancer drug delivery

Protein data bank (pdb): the single global macromolecular structure archive

Molecular mechanisms behind anti sars-cov-2 action of lactoferrin

Utilizing graph machine learning within drug discovery and development

Deep learning in protein structural modeling and design

Structures of cas9 endonucleases reveal rna-mediated conformational activation

Crystal structure of cas9 in complex with guide rna and target dna

Optimized sgrna design to maximize activity and minimize off-target effects of crispr-cas9

Discovery and engineering of small slugcas9 with broad targeting range and high specificity and activity

Announcing the worldwide protein data bank

Deepsurf: a surface-based deep learning approach for the prediction of ligand binding sites on proteins

Multi-scale representation learning on proteins

Hmd-amp: Protein language-powered hierarchical multi-label deep forest for annotating antimicrobial peptides

fastmsa: Accelerating multiple sequence alignment with dense retrieval on protein language

Deepacr: Predicting anti-crispr with deep learning. bioRxiv

Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions

Rna secondary structure prediction by learning unrolled algorithms