key: cord-0295731-bvpjhqqk
authors: Zemla, Adam T.; Allen, Jonathan E.; Kirshner, Dan; Lightstone, Felice C.
title: PDBspheres - a method for finding 3D similarities in local regions in proteins
date: 2022-05-14
journal: bioRxiv
DOI: 10.1101/2022.01.04.474934
sha: ef388c90a135dd4bd3ec9da960b9f37e38c6d25c
doc_id: 295731
cord_uid: bvpjhqqk

We present a structure-based method for finding and evaluating structural similarities in protein regions relevant to ligand binding. PDBspheres comprises an exhaustive library of protein structure regions (“spheres”) adjacent to complexed ligands derived from the Protein Data Bank (PDB), along with methods to find and evaluate structural matches between a protein of interest and spheres in the library. PDBspheres uses the LGA structure alignment algorithm as the main engine for detecting structure similarities between the protein of interest and template spheres from the library, which currently contains more than 2 million spheres. To assess confidence in structural matches an all-atom-based similarity metric takes sidechain placement into account. Here, we describe the PDBspheres method, demonstrate its ability to detect and characterize binding sites in protein structures, show how PDBspheres - a strictly structure-based method - performs on a curated dataset of 2,528 ligand-bound and ligand-free crystal structures, and use PDBspheres to cluster pockets and assess structural similarities among protein binding sites of 4,876 structures in the “refined set” of the PDBbind 2019 dataset.

Interactions between proteins and small molecule ligands are a cornerstone of biochemical function.

Modern drug discovery often relies on structural information about the target of interest (typically a protein). However, when a new structure is obtained, it may be the case that little is known with regard to potential binding sites on that structure. Numerous studies have focused on understanding proteinligand interactions (e.g., 1, 2) to further the development of binding site prediction methods (e.g., [3] [4] [5] [6] as well as benchmark datasets (e.g., [7] [8] [9] [10] [11] used to compare and evaluate their performance. In general, the binding site prediction methods can be categorized in three broad sets: 1) templatebased methods that use known protein information, 2) physics-based methods that rely on geometry (for example, cavity detection) and/or physicochemical properties (for example, surface energy interactions with probe molecules), and 3) machine learning (ML) -rapidly developing in recent years methods capable of efficiently processing information collected in their training data sets. For ML methods data can come from both experiments and in silico data processing, as described for example in (12) (13) (14) (15) . Our method, PDBspheres is template-based; however, it relies solely on local structure conformation similarities with templates extracted from structures from PDB -i.e., PDBspheres does not use any prior information from libraries of sequences, motifs or residues forming binding sites that are collected separately to enhance the method performance. In general, template-based methods can be categorized as 1) those that rely on only protein sequence, e.g., Firestar (16) , 2) those that rely on only protein structure, e.g., SP-ALIGN (17) , or 3) those that rely both on sequence and structure, e.g., Concavity (18) . Some methods, e.g., SURFNET (19) are classified as structure-based methods, but since they use geometric characteristics, they are grouped within geometric-based methods (20) .

One of the strengths of the developed PDBspheres method is that in the binding site searches we use only those template models of cavities (PDBspheres library) that are already observed as proteinligand binding sites in other proteins (experimentally solved protein-ligand complexes deposited in PDB). This approach ensures that any structural shape which might resemble a cavity in evaluated protein models -e.g., holes within interchain interfaces, missing fragments in coordinates, or other structural imperfections in geometry -will not be reported by PDBspheres searches as cavity candidates, simply because they are not evidenced as ligand-binding cavities in reality, i.e., in confirmed-by-experiment PDB structures. To resolve the problem with this kind of false-positive pocket assignments some template-based methods may need additional information about proteins collected from calculated sequence alignments, patterns in residues forming binding sites, etc. A natural strength of structure-template-based methods such as PDBspheres lies in their ability to detect similar binding sites and help determine the biochemical function even if the sequence identity is very low, i.e., when similarity between pockets cannot be detected with confidence based on the sequence. A limitation, however, of the PDBspheres method is that its library depends on structural information derived from the PDB database, so identification of a new pocket or function relies on data availability in PDB, i.e., pockets having shapes not represented in PDB will not be found.

In the PDBspheres method, binding sites are identified exclusively based on the structure similarity between regions from the query protein and pocket spheres from the PDBspheres library. Ligand placement within a predicted pocket is calculated based on a protein-sphere superposition, i.e., an agreement in structure conformation between atoms from the query protein and protein atom coordinates from the template sphere. The main premise of structure template-based binding site prediction methods is that the number of pockets is limited (2) , therefore each one of them may serve as a binding site for a large diversity of ligands (17) . However, possible conformations of ligands bound in the pocket are also limited. Indeed, when we predict a ligand-protein conformation we focus on what parts of the ligand (e.g., its core region) need to be in a correct conformation, i.e., a conformation that can be confirmed by experiment and from which the binding affinities can be estimated. Usually such "correct" conformations are shared between ligands that bind a given pocket (at least shared by their "core" regions.) The evaluation of the correctness of the ligand placement within a pocket is not an easy task. For example, when for a given protein two pocket predictions with different ligands inserted need to be evaluated a simple calculation and comparison of the ligand centroids (the center of mass of the bound ligands) can be misleading. This is because the two predictions may differ in assigned placements of the ligands within a pocket (especially when the pocket size is large allowing the ligands to fit in different areas within the cavity), or the ligand sizes and their shapes are different (e.g., some parts of the ligands can be exposed outside the pocket with different orientations (21); see Figure 1 ). These difficulties in the assessment of the accuracy of binding site predictions and ligand placements within predicted pockets could be overcome when we focus on the evaluation of residues interacting with the bound ligands. In Figure 1 we present an example of the b1 domain from the Human Neuropilin-1 (PDB 2qqi) where two different poses of the same ligand EG01377 (PDB id: DUE) are possible (21) . Superposition of two experimentally solved structures of Neuropilin1-b1 domain in complex with EG01377 shows almost identical protein structure conformations while poses of bound ligands differ significantly outside the "core". The distance between centroids of the ligand's two orientations placed in the pocket is more than 4.6 Å when the centroid is calculated on all ligand's atoms, and only 0.7 Å when it is calculated on the buried parts of the ligands -"core". Superposition of two experimentally solved structures of Neuropilin1-b1 domain in complex with EG01377 (in green: PDB 6fmc at resolution 0.9 Å, and in red: PDB 6fmf at resolution 2.8 Å) shows almost identical protein structure conformations while poses of bound ligands differ significantly outside the core. The distance between centroids of the ligand's two orientations when placed in the pocket is more than 4.6 Å.

Knowing the correct "core" conformation of the ligand within the pocket (i.e., the pose of the conserved part) can significantly improve "in silico" drug discovery; it can be used in compound screening/docking efforts as a pre-filter for selection of most promising compounds for a more detailed and expensive computational evaluation. Currently, there are millions of compounds screened using docking or molecular mechanics-generalized Born surface area (MMGBSA) approaches. These methods are widely used to estimate ligand-binding affinities in identified binding sites of targeted proteins to find good candidates for a further (experimental) analysis of potential inhibitors.

Assessment of the performance of binding sites prediction methods is not an easy task because they may perform differently on different benchmarks. In Critical Assessment of Protein Structure Prediction (CASP) experiments (rounds 8, 9, and 10) (22) (23) (24) an attempt was made to compare different ligand binding residue prediction methods, but it was found that the assessment might be biased (not enough targets, targets not diverse enough to represent different families, folds, ligand type, and ligand binding sites) and this category was subsequently dropped from CASP. A comprehensive benchmark dataset was recently designed by Clark et al (25, 26) . It covers 304 unique protein families with 2,528 structures in 1456 "holo" and 1082 "apo" conformations. A detailed comparison analysis using this benchmark dataset was performed on seven binding site prediction methods (25) . The authors decided to exclude template-based methods from their evaluation to avoid unfair comparison with non-template-based methods. As stated in (25) template-based methods use "libraries of sequence-based template information" which "would inevitably lead to the use of holo structure knowledge to solve the binding site locations in apo structures." So, it was expected that template-based methods would easily outperform non-template-based methods on the benchmark.

The accuracy of pocket prediction methods in Clark et al. (25) analysis and during CASP was measured mainly by the Mathew's Correlation Coefficient (MCC). This metric assigns high scores not just when at least one residue in a given binding site is identified correctly, but rather rewards methods with more correct and less false predictions of contact residues implying which of the algorithm pocket predictions are close to the "correct" location on the binding surface of the protein. In this paper, we use the same metric to evaluate the performance of our method with varying datasets of structural templates used by PDBsheres to make predictions. Our reported results were merged with results from the assessment of non-template-based methods performed by Clark et al. (25) in Table 1 and also a discussion of the performance of template-based methods from CASP (22) (23) (24) is included to illustrate how PDBspheres places within these methods and what the limits of structurebased pocket prediction methods may be.

The PDBspheres system is designed to help assess the similarity between proteins based on their structure similarity in selected local regions (e.g., binding sites, protein interfaces, or any other local regions that might be structurally characteristic of a particular group of proteins). The PDBspheres system has three main components: 1) the PDBspheres library of binding site templates, 2) a structure similarity search algorithm to detect similarity between evaluated local structural regions (e.g., binding sites), and 3) numerical metrics to assess confidence in detected similar regions.

Currently, the PDBspheres library (ver. 2021/10/13) contains 2,002,354 compound binding site models (template spheres) and 67,445 short-peptide binding site models derived from protein structures deposited in the PDB database (27) .

The Local-Global Alignment (LGA) program (28) is used to perform all structure similarity searches.

While comparing two structures, the LGA algorithm generates many different local superpositions to detect regions where proteins are similar, and calculated similarity scores are not affected by some perturbations in small parts of compared proteins. In essence, LGA assesses structure similarity based on detected conserved parts between two proteins or protein fragments while removing small deviating regions from calculated scores. This capability of LGA allows accurate detection of similar pocket shapes. The main metrics to assess confidence in detected pockets are a scoring function LGA_S (a combination of Global Distance Test (GDT) and Longest Continuous Segments (LCS) measures (28)), which evaluates structure similarities on Calpha and/or Cbeta levels, and GDC (Global Distance Calculations (29) ), which allows evaluation on an all-atom level, i.e., extending structure similarity evaluation to the conformation of side-chain atoms.

When applied to predict protein binding sites in a given protein structure, PDBspheres detects ligandprotein binding regions using the pocket/sphere templates from the PDBspheres library constructed from all available structures deposited in the PDB database. After a binding pocket is detected the ligand(s) from matching template(s) is/are inserted into the identified pocket in a query protein to illustrate an approximate location of the ligand. The location is approximate because it is based on the alignment of the query protein with the template protein spheres, and no docking or any energy minimization or structure relaxation are undertaken. Thus, it should be noted that PDBspheres is not a docking system to predict de novo ligand poses within a binding site. However, further relaxation of the predicted protein-ligand complex can be considered as the next step to improve a ligand placement within a pocket (which is a subject of continuing PDBspheres system development.)

In the process of detecting pocket candidates within protein structures, similarity searches can be performed using all templates in the PDBspheres library (i.e., exhaustive search, testing all 2.0 M pockets from the library), or if time is prohibitive, searches can be performed on a preselected subset of the sphere templates. For example, the preselection process can be based on specific targeted ligands (e.g., their names or sizes), or can be limited to those template spheres from the PDBspheres library that come from homologous proteins. The computational time of the structure processing can vary depending on the size of the protein or the number of template spheres from the PDBspheres library preselected by the system for the pocket detection and similarity evaluation. An exhaustive search (against the entire PDBspheres library) can take more than one day on a single processor machine, however with above-mentioned preselection procedures the calculations can be completed in less than one hour for medium size proteins. For example, the processing of three protein structures discussed in the manuscript and illustrated on Figure 1 (Neuropilin-1, 158aa), Figure 2 (PL2pro, 318aa), and Figure 6 (Tryptase, 248aa)) took 7 min, 52 min, and 50 min respectively on a single processor with 16 cores machine.

The clustering of identified pocket-spheres matches and their corresponding ligands characterizes distinct pocket regions within the protein by the sets of residues interacting with inserted ligands. The clustering of interacting residues allows identification of overlapping parts of ligands that approximate the ligand's "core" conformation within detected pockets and for each cluster defines a representative set of residues forming a consensus pocket. An example of an identified cluster of pocket-spheres is shown in Figure 2 . Each protein may have more than one such consensus pocket.

Each protein-ligand template sphere in the PDBspheres library is a subset of the records from a corresponding PDB entry, consisting of the coordinates of all atoms belonging to a given ligand, and coordinates of all protein atoms belonging to residues near that ligand (water atoms are excluded).

The protein residue is part of the sphere if at least one of its atoms is within 12.0 Å of any atom of the ligand. Previous research indicated that distances of 7.5 Å are sufficient to capture informative functional properties for clustering purposes (30) ; however, based on our experimentation using the

LGA program for detecting local structure similarities we expanded the distance to 12.0 Å. This size of the "template sphere" is sufficient to capture local structural environment of the ligand for protein structure comparative needs. Our tests indicated that larger than ~12 Å distance spheres would affect accuracy in calculated structure conformation-based local residue-residue correspondences between template spheres and the query protein within detected pockets' regions. On the other hand, the smaller than ~12 Å distance criteria may not provide sufficient fold constraints to capture the uniqueness of searched pockets. To assist in identification of functional residues, PDBspheres also collects and reports information on protein-ligand interface residues. These residues are identified as those of which at least one atom is within 4.5 Å of any ligand atom.

In the PDBspheres library the 12.0 Å "sphere" entries have been constructed for each ligand in PDB, including peptides, metals, and ions, although the library includes only peptides containing 25 or fewer residues. The library is updated weekly in coordination with new PDB releases. As of 2021/10/13 the library consists of 2,069,796 spheres (binding site templates).

The primary use of the PDBspheres library is to identify "sphere" protein structures that are structurally similar to regions of a query protein structure. The query protein structure may be a complete, multimeric assembly, or it may be a single sub-unit of such an assembly, or even a fragment of a protein as long as it carries enough structural information (local structure conformation formed by atom coordinates) to reflect a structural shape of a putative binding site.

The fundamental identification method used by PDBspheres is structural alignment of a sphere with the query protein structure, and the assessment of the structural match. For a general use, a comprehensive search -that is, a structural alignment-based search of the entire PDBspheres library (over 2 million sphere templates) -would be computationally expensive. One means to address this issue is to conduct an initial search for matches to a query protein structure on a subset of sphere templates. Then, if needed, the search can be expanded to additional templates from the library. In its current implementation for possibly faster processing, the PDBspheres library searches can be restricted to a subset of template spheres selected based on 1) ligand specificity, e.g., ligand names, their sizes or similarities, or 2) sequence similarity between the query protein and PDB proteins from which the PDBspheres library entries were derived. In the latter approach, the sequence similarity searches are conducted using the Smith-Waterman algorithm against FASTA-format sequences of all proteins contributing to the PDBspheres library entries. In the current version of PDBspheres, the Smith-Waterman algorithm "ssearch36" (31) is used.

In the PDBspheres library each sphere template entry includes a number of characteristics such as the ligand name (PDB ligand ID), the number of heavy atoms in the ligand, and the number of residues in the protein fragment forming the protein-ligand "sphere" (that is, the size of the pocket template). Thus, the list of matching spheres can be screened using specific criteria related to the expected pocket size, exact ligand name or ligand similarity. A selection of the template spheres based on the estimated similarity between the expected ligand and the PDB ligands from the PDBspheres system can be quantified by the Tanimoto or Tversky similarity indexes (32) .

After a selection of template spheres is completed, each member of this set is evaluated by assessing The PDB ligand from the sphere template is translated and rotated according to the transformation that results from aligning the sphere template atoms to the query protein atoms. The ligand atoms are not considered in this alignment, and their conformation is not altered. It means that potential steric clashes between protein residues and atoms of "inserted" ligands can be observed. The number of possible clashes is reported.

PDBspheres reports the following set of characteristics for all pockets detected within a given protein:

(a) PDB identifier of the protein structure from which a template sphere matching the query protein pocket was derived, (b) PDB identifier of the ligand from the template sphere matching the query protein pocket, (c) list of residues in the query protein in direct contact with the inserted ligand (residues within the distance <= 4.5 Ångstroms), (d) coordinates of the centroid of the ligand inserted into the detected pocket in the query protein.

Because the pocket searches can match template spheres to different regions of the query protein all identified pocket candidates are organized into clusters. Within the PDBspheres approach the merging and clustering of individually detected pockets (i.e., pockets detected based on individual template spheres) is performed to satisfy the following two criteria: 1) within each cluster the pockets are grouped together based on sets of query protein residues interacting with inserted ligands. When more than 80% of predicted contact residues are the same, then these sets are merged to define residues forming the pocket, 2) pockets and the sets of residues identified as in contact with ligands are merged when the inserted ligands overlap (the distances between ligand centroids are not larger than 2.0

This grouping of pockets and their corresponding inserted ligands allows to define sets of residues interacting with similar ligands and also helps identify overlapping parts of different ligands. Such overlapping parts can approximate the ligand's "core" conformation, whose protein contact residues can be used to define representative (consensus) pockets within the protein.  LIGAND -protein-ligand template sphere identifier which includes PDB id of the ligand, ligand size, and PDB id of the protein-ligand complex from which the sphere was extracted.

 Ns -number of residues forming the protein-ligand template sphere.

 RMSD -root mean square deviation calculated on superimposed Calpha or Cbeta atoms from sphere template and detected protein pocket.

 Nc -number of conserved, i.e., "tightly" superimposed, residue pairs between the template sphere and detected pocket in an evaluated protein.

 SeqID -sequence identity in structure aligned conserved residues. The higher value indicates that the protein forming a template sphere and the query protein might be close homologs.

 LGA -structure similarity score based on aligned by the LGA program Calpha or Cbeta atoms.

 GDC -structure similarity score calculated by the LGA program assessing agreement in conformations of all atoms (i.e., including side chain atoms).

 N4 -the number of predicted protein-ligand contact residues (i.e., query protein residues observed within 4.5 Å of the inserted ligand).

 cl -the number of query protein residues that may have possible steric clashes with inserted ligand's atoms.

Firstly, to be considered a predicted site candidate, there must be at least ten aligned residues (Nc>=10) that are conserved between the query protein and the sphere. Secondly, the structural similarity measured by GDC (29) , which counts how many atoms (including side chain atoms) in the query protein and the template sphere are in close superposition, must be at least 55% (GDC>=55.0).

Thirdly, there must be at least one query protein binding site residue that has an atom within less than 4.5 Å of the atom of the inserted PDB ligand (N4>=1). Finally, there are no more than two steric clashes (cl<=2) (distance less than 1.0 Å) allowed between query protein binding site residue atoms and any atom of the inserted ligand. The "cl" number is counted per residue which means that multiple clashing atoms, all within a single residue, are allowed, but not more than two residues can be involved in clashes.

We may expect a higher confidence in predicted binding sites when: Nc>=25, GDC>=65, and cl<=1.

Note that these thresholds cannot be considered as absolute requirements. For example, possible clashes indicated by "cl>0" may suggest a need to correct the placement of the inserted ligand.

Ideally, a score of "cl=0" would enhance the confidence in ligand-protein binding prediction, however, in many cases when "cl>0" the ligand placement within the pocket can be fixed by additional adjustment or refinement of protein side-chain atoms through docking or molecular dynamics (MD) simulations. It is also important to keep in mind that the query protein structure may not be in its "holo" conformation to properly accommodate the ligand (i.e., without clashes), so the conformation of binding site residues of the protein may require more substantial optimization. Indeed, recent studies of ligand binding site refinements show significant success in generating reliable "holo" (ligand-bound) protein structures from their "apo" (ligand-free) conformations (33) .

If N4 -the number of protein atoms within 4.5 Å of the ligand -is very low, then this indicates that the current placement of the ligand does not show strong interactions with residues of the protein binding site. This may suggest that the pocket is too large (or some residues in the protein model are missing, or side-chain atoms are not in the right conformation). Of course, it may also indicate that the identified pocket is incorrect, or the location of the inserted ligand is wrong, but these conclusions can only be confirmed by more detailed inspection. In such cases it can be informative to check the overlap of the template sphere pocket and the query protein pocket (i.e., check the Nc number of conserved superimposed pocket-forming residues). Higher overlaps (e.g., Nc>=25 or more) indicate greater size of the region forming the pocket and thus the higher confidence in reported pocket similarities (more residues identified as conserved). However, such thresholds cannot be decided upfront because in cases of shallow cavities or interface sites on the surface of the protein the Nc number can be low. indicates that all listed ligands are predicted to bind the same pocket identified through clustering as "pocket cluster #1".

The benchmark dataset used to assess the performance of PDBspheres is the "LBSp dataset" described in Clark et al. (25, 26) . It comprises 304 unique protein families (2,528 structures in 1456 "holo" and 1082 "apo" conformations) with each family represented by several protein structures in different possible conformations. Since this LBSp dataset is curated, well designed, large and diverse, we have found it very suitable to test the performance of the PDBspheres method comprehensively.

The main criterion often used for the assessment of the accuracy of predictions of the binding site is the agreement in the set of residues predicted to be in contact with experimentally confirmed ligands (residue contacts derived from protein-ligand co-crystals from PDB (27)). To estimate the accuracy of our method we followed the same metrics and evaluation procedures as used in (25) . The authors defined reference lists called unified binding sites (UBS) as unions of all residues contacted by any bound ligands within the protein family. The scoring of binding site predictions is determined by agreement with the UBS reference using calculated numbers of any residue predicted to be part of the binding site which is confirmed in the experimental data is denoted as true positives (TP), all remaining residues (which are not predicted as part of the binding site and not confirmed in experimental data) are denoted as true negative (TN), any residue predicted to be part of the binding site which is not confirmed in experimental data is denoted false positive (FP) or over-prediction, and any remaining residues in the experimental data not accounted for in an algorithm's predicted binding site are denoted as false negative (FN) or under-predictions.

have been used as metrics to represent the predictive power of evaluated binding site prediction methods.

(1)

Of course, there are other metrics sometimes used to evaluate protein-ligand binding sites detection successes, e.g., distance center center (DCC) (34) and discretized volume overlap (DVO) (12) .

DCC is a metric to evaluate the correct location of the pocket based on the distance between the predicted and the actual center of the pocket (centroids). If DCC is below 4 Å, the pocket is considered as correctly located. DVO, on the other hand, is defined as the ratio of the volume of the intersection of the predicted and the actual segmentations to the volume of their union. It assesses the correctness of the shape of predicted pockets. However, in our study we follow metrics described in Clark et al. (25) as they focus on assessing accuracy of binding site prediction methods based on the correctness of identified pocket residues being in contact with ligands. Similar approaches were used in CASP experiments to evaluate the performance of binding site prediction methods.

In Clark et al. (25) , predictive power of different methods is assessed using two metrics: F scores and Matthew's Correlation Coefficients (MCCs). The F scores and MCCs provide a good description of the relative success of evaluated algorithms by assessing whether or not a method produced a predicted binding site that contains residues in common with the reference UBS list of ligand contact residues.

They assign high scores not just when at least one residue in a given site is identified correctly, but rather reward methods with more correct and less false predictions of contact residues implying which of the algorithm pocket predictions are close to the "correct" location on the binding surface of the protein. Note that a similar evaluation procedure using MCC scores between predicted and observed contact residues was also applied in CASP experiments. (38) , AutoSite (39) and Kalasanty (15) . Five of these methods are considered to be geometry-based, while one is energy-based, and one is machine-learning-based. To perform our analysis, we took prediction results for these seven methods from the supplemental data from the publication (25) . In Table 1 we present the prediction results for PDBspheres and recalculated -for consistency -F and MCC scores of the seven mentioned above methods. The last six data rows in Table 1 show results from the evaluation of PDBspheres when the pocket detection was performed using the complete PDBspheres library (100%), and when the PDBspheres library was restricted to templates with no more than 90%, 80%, 70%, 60% and 50% sequence identity with query proteins, respectively. The six rows for PDBspheres include also the mean values of MCC in addition to the median values since this metric was used by CASP assessors. 

Here, we discuss how close in sequence identity two proteins need to be to have structurally similar binding sites, addressing these two questions:

 How low the level of sequence identity between two proteins can be and still share similar pockets and perform similar functions?

 Do such proteins have enough similarity in their functional sites to bind ligands in a similar manner? Figure 5 illustrates results from the LBSp database pocket identification calculations, which were summarized in Table 1 . These results illustrate that in contrast to the large diversity of possible protein sequences the number of structurally distinct pockets is limited, therefore proteins that by sequence are very different may still share almost identical structural conformations in local regions (e.g., binding sites) and can perform similar functions. This observation served as a basis for the development of our PDBspheres structure template-based binding site detection method. The only limitation in its ability to identify correct pockets for a given protein might be an underrepresentation of particular pocket's conformation in currently experimentally solved protein structures deposited in the PDB database. For example, in case when a non-restricted library is used the binding sites for all 304 families ("apo" and "holo" protein conformations) were predicted. However, in the case of the restricted library (which excludes template spheres derived from proteins showing more than 90% sequence identity to the query proteins) the binding sites for 5 families of "apo" versions and 7 families of "holo" versions were not predicted (see Table 2 ). In Table 2 The same can be observed in Figure 5 . More rapid decline in the ability to successfully detected pockets starts when the restriction on the library gets below 50% of sequence identity. Let us note that the main difference between results shown in the Table 2 and Figure 5 is that the plot illustrates the correctness of the prediction (i.e., as accurate as possible identification of residues involved in protein-ligand interaction) while in Table 2 we report results from just detection of the pocket location in the protein (i.e., without any assessment of the accuracy and completeness of predicted interacting residues). Figure 5 . The plot illustrates how many pockets from the LBSp database proteins (all 2,528 pockets;

"apo" and "holo" combined) can be predicted based on binding site templates derived from proteins with lowest possible sequence identity to the query proteins. For example, at least 400 pockets (see left bottom part of the plot) can be predicted based on template binding sites derived from proteins that share as low as 20% sequence identity (SeqID) with a query protein. On the other hand, more than 2,400 pockets (~95%) can be correctly predicted based on templates sharing no more than 80%

of SeqID with query proteins in their binding site regions.

The PDBspheres method leverages this structure conservation quality efficiently. Indeed, thanks to the richness in diversity of protein structures deposited in PDB the system is able to identify location of 97% binding sites from the LBSp dataset using structural templates that share as low as 50% sequence identity with targeted proteins. And even with such significant restriction to the library of template spheres the accuracy in identified sets of residues interacting with ligands is still high ~0.75

(median) and ~0.7 (mean) as measured by F and MCC metrics (see Table 1 ).

As mentioned in Clark et al. (25) , we expected that the template-based method such as PDBspheres could outperform non-template-based methods which is evidenced in Table 1 . We should emphasize, however, that PDBspheres is an exclusively structure-based method which does not utilize any prior information from libraries of sequences/residues forming binding sites. Additionally, PDBspheres does not use any prior information about the location of searched pockets in proteins from the same family.

We treat all structures equally and independently (as pairwise comparisons) in finding structural similarities between the query protein and template spheres. Of course, we can find such cavities more easily in the "holo" conformations, but as the results show, we can also find adequate structure similarities in "apo" versions of the query protein; again, strictly based on similarities in structure without any sequence-based knowledge of residues forming the pocket (residue information that could be transferred from some databases of "holo" structures.) All results reported here are based on PDBspheres superpositions calculated on C-alpha atoms. Results based on superposition of C-beta atoms (results not shown) are virtually identical.

In Figure 6 we show an example of correct pocket detection by PDBspheres using different pocketligand template spheres derived from proteins that share low sequence identity. Two compared proteins (serine proteases) have less than 38% of sequence identity, but they are structurally similar at the level of over 86% by LGA_S (assessment of similarity on the C-alpha atoms level) and 77% by GDC (all atoms level). Of the residues that are in contact with corresponding ligands (distance below 4.5 Å) in the identified similar pockets, 11 out of 17 are identical (65%). The pocket template spheres come from two PDB complexes that bind inhibitor compounds having PDB ligand ids 2A4 and QWE, respectively. These ligands are significantly different in size. Therefore, the distance between ligand centroids calculated from the complexes of the same orientations is very different, specifically, the distance between centroids of ligands when inserted in any of these pockets is about 6.0 Å. However, the portions of the ligands inserted in pockets have a similar overlapping region and are in contact with similar amino acids from the pockets. In each of these two protein-ligand complexes the core parts of the ligands are in contact with 13 residues of which 11 are identical (85%) (see Figure 6 (C)

and (E)). We have had a similar observation in the example of Human Neuropilin-1 illustrated in Figure 1 , when we focused on possible discrepancies in distances between centroids of two experimentally confirmed orientations of the same ligand inserted into the same pocket. The distance between "core" parts of the ligand was very small, while the outside parts varied significantly. It suggests that if we are interested in identifying protein residues critical for ligand binding, we should concentrate mostly on those residues that are in contact with ligand's "core" parts. Likewise, as shown in Figure 6 , in case of different proteins and different ligands a high sequence identity of residues being in contact with "core" parts of the ligands can help identify those residues that may be critical for binding in similar pockets. These results illustrate how PDBspheres can be used to detect conservation in local structural conformations and to assess conservation of critical contact residues;

both can assist in inferring protein function. These conclusions are also supported by results from evaluation of large and diverse set of proteins collected in the PDBbind dataset (40) . Hence, in total the dataset of evaluated binding sites consists of 4,876 structures. Since we are interested in the assessment of similarities between specific pockets listed in proteins from PDBbind (the PDBbind database reports only one pocket for each protein regardless of how many different pockets a given protein may have or how many alternative locations of a given pocket in a multichain protein complex can be observed), we restricted our structure similarity searches and evaluations to only those regions in PDBbind proteins that encompass targeted pockets. Note that many protein structures in the PDBbind refined dataset are multidomain complexes with total sizes of more than 2000 residues, so they can have multiple binding sites in addition to the targeted ones. Therefore, in our approach to evaluate similarities and cluster pockets from PDBbind, each of its protein structures was reduced to the region in the close vicinity of a reported ligand (residues having any atom within 16 Å of any ligand atom), and each ligand binding site reported in PDBbind was associated with its corresponding PDBspheres template sphere (residues having an atom within 12 Å of a ligand atom).

We performed an all-against-all PDBspheres detection and pocket similarity evaluation using the 16 Å Figure 7 illustrates the clusters, and it is a snapshot taken from the HTML supplemental File6 (interactive overview of predicted clusters created using "plotly" R graphic library).

Each axis of Figure 7 File6. Figure 8 shows a "zoom-in" to cluster #277, which contains 24 protein-ligand pairs.

As an example of results from PDBspheres clustering, Figure 8 shows a "zoom-in" to cluster #277 (circled and marked by an arrow in Figure 7) , which contains 24 protein-ligand pairs. This "zoom-in"

shows details of PDBbind binding sites grouped together based on all-against-all structure similarity measured by the GDC score. Each of the proteins grouped into this cluster share the same EC subclass (3.4) and GO annotation (:0006508:0008237:0008270:), which indicate that the PDBspheres-based clustering can assist in making predictions useful in protein functional annotation. Results from the PDBspheres clustering of the PDBbind dataset suggest that the structure similarity between pockets/ligand placements is a significant characteristic that can allow prediction of similar binding affinity values. In future work, we anticipate enhancing current measurements adding information about the specific atom location of protein residues interacting with the ligand, which may improve our predictions. Complete results from PDBspheres analysis of pairwise similarities in binding sites between proteins from PDBbind are provided in supplemental File5.

While developing PDBspheres we focused on two goals: 1) binding pocket detection; and 2) identification of characteristics and scores to assess similarities between pockets to help further protein functional characterization and clustering.

In particular, with regard to binding pocket detection we find that PDBspheres' strictly structure-based approach can correctly predict binding site regions in protein structures known to be in a "holo" (i.e., ligand-binding) conformation as well as in protein structures in "apo" (without a ligand present) With regard to characterizing and evaluating binding pocket similarities among proteins we find that a high level of sequence similarity between different proteins is not essential to identify structurally similar binding sites and that proteins even significantly different by sequence may perform similar functions. In the structure-based detection of binding sites the similarity assessed based on calculated structural alignment using C-alpha atom positions is sufficient, and the use of other residues (e.g., Cbeta atoms, or other points representing residue) in the similarity assessment does not yield better results. Structurally similar binding pockets having similar ligand placements allow inference of binding affinity from one pocket-ligand pair to another pocket-ligand pair. PDBspheres-based clustering of detected pockets and calculated structure similarities among pockets from different proteins provide information that can significantly help protein function annotation efforts.

The PDBspheres library and the LGA program are made publicly available for download at http://proteinmodel.org/AS2TS/PDBspheres and http://proteinmodel.org/AS2TS/LGA

Supplementary Data are available at NAR online. 

On the diversity of physicochemical environments experienced by identical ligands in binding pockets of unrelated proteins

A comprehensive survey of small-molecule binding pockets in proteins

Graph-Based Clustering of Predicted Ligand-Binding Pockets on Protein Surfaces

ProBiS algorithm for detection of structurally similar protein binding sites by local structural alignment

A method for localizing ligand binding pockets in protein structures

Surface-based protein binding pocket similarity

A benchmark driven guide to binding site comparison: An exhaustive evaluation using tailor-made data sets. (ProSPECCTs)

Comparative assessment of strategies to identify similar ligand-binding pockets in proteins

Shape variation in protein binding pockets and their ligands

A new protein binding pocket similarity measure based on comparison of clouds of atoms in 3D: application to ligand prediction

CavBench: A benchmark for protein cavity detection methods

DeepSite: proteinbinding site predictor using 3D-convolutional neural networks

Improved Protein-Ligand Binding Affinity Prediction with Structure-Based Deep Fusion Inference

Improving detection of protein-ligand binding sites with 3D segmentation

Detection of protein-ligand binding sites with 3D segmentation

Firestar -prediction of functionally important residues using structural templates and alignment reliability

A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation

Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure

SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions

Structure-based methods for computational protein functional site prediction

Small Molecule Neuropilin-1 Antagonists Combine Antiangiogenic and Antitumor Activity with Immune Modulation through Reduction of Transforming Growth Factor Beta (TGF beta) Production in Regulatory T-Cells

Assessment of ligand binding residue predictions in CASP8

Assessment of ligand-binding residue predictions in CASP9. Proteins, 79 Suppl

Assessment of ligand binding site predictions in CASP10

Predicting binding sites from unbound versus bound protein structures

Inherent versus induced protein flexibility: Comparisons within and between apo and holo structures

The Protein Data Bank

LGA: A method for finding 3D similarities in protein structures

The other 90% of the protein: assessment beyond the Calphas for CASP8 template-based and high-accuracy models

Clustering protein environments for function prediction: finding PROSITE motifs in 3D

Finding Protein and Nucleotide Similarities with FASTA

Molecular Similarity in Medicinal Chemistry

Ligand-Binding-Site Refinement to Generate Reliable Holo Protein Structure Conformations from Apo Structures

A critical comparative assessment of predictions of protein-binding sites for biologically relevant organic compounds

Detection of multiscale pockets on protein surfaces using mathematical morphology

LIGSITEcsc: predicting ligand binding sites using the Connolly surface and degree of conservation

Fpocket: an open source platform for ligand pocket detection

DEPTH: a web server to compute depth and predict small-molecule binding cavities in proteins

AutoSite: an automated approach for pseudo-ligands prediction-from ligand-binding sites identification to predicting key ligand atoms

The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures