Kincore: a web resource for structural classification of protein kinases and their inhibitors Kincore: a web resource for structural classification of protein kinases and their inhibitors Vivek Modi Roland Dunbrack Jr. Institute for Cancer Research Fox Chase Cancer Center, Philadelphia PA 19111 USA .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ Abstract Protein kinases exhibit significant structural diversity, primarily in the conformation of the activation loop and other components of the active site. We previously performed a clustering of the conformation of the activation loop of all protein kinase structures in the Protein Data Bank (Modi and Dunbrack, PNAS, 116:6818-6827, 2019) into 8 classes based on the location of the Phe side chain of the DFG motif at the N- terminus of the activation loop. This is determined with a distance metric that measures the difference in the dihedral angles that determine the placement of the Phe side chains (the ,  of X, D, and F of the X-DFG motif and the 1 of the Phe side chain). The nomenclature is based on the regions of the Ramachandran map occupied by the XDF residues and the 1 rotamer of the Phe residue. All active structures are “BLAminus”, while common inactive DFGin conformations are “BLBplus” and “ABAminus”. Type II inhibitors bind almost exclusively to the DFGout “BBAminus” conformation. In this paper, we present Kincore (http://dunbrack.fccc.edu/kincore), a web resource providing access to the conformational assignments based on our clustering along with labels for ligand types (Type I, Type II, etc.) bound to each kinase chain in the PDB. The data are annotated with several properties including PDBid, Uniprotid, gene, protein name, phylogenetic group, spatial and dihedral labels for orientation of DFGmotif residues, C-helix disposition, ligand name and type. The user can browse and query the database using these attributes individually or perform advanced search using a combination of them like a phylogenetic group with specific conformational label and ligand type. The user can also determine the spatial and dihedral labels for a structure with unknown conformation using the web server and standalone program. The entire database can be downloaded as text files and structure files in PyMOL sessions and mmCIF format. We believe that Kincore will help in understanding conformational dynamics of these proteins and guide development of inhibitors targeting specific states. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint http://dunbrack.fccc.edu/kincore https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ Introduction Protein kinases are catalytic molecular switches that regulate signaling pathways in cells by phosphorylating protein substrates [1]. Their catalytic activity is achieved by a remarkably flexible active site which is observed in multiple different conformations when the enzyme is in inactive state but adopts a unique conformation in the catalytically active state. The dysregulation of this mechanism due to a mutation or upregulation of expression can lead to a variety of diseases including cancer [2, 3]. Protein kinases are widely studied as drug targets with molecules targeted to inhibit the active state or stabilize a specific inactive state [4, 5]. Thus, the understanding of conformational dynamics in protein kinases is critical for development of better drugs and novel biological insights. There are 484 typical protein kinase genes with 497 kinase domains in the human genome [6, 7]. This number includes several pseudokinases but excludes atypical protein kinase genes, some of which are distantly related to the typical protein kinase fold [7]. Among the 497 domains, currently the structures of 283 have been experimentally determined either in apo form or in complex with ligands. The protein kinase fold consists of an N-terminal lobe, which is formed by five beta sheets and one alpha helix called the C-helix, and a C-terminal lobe which consists of five or six alpha helices. The two lobes form a deep cleft in the middle region of the protein creating the ATP-binding active site. This site is surrounded by several structural elements critical for catalysis which occupy a unique conformation in the active state and exhibit flexibility across different inactive states of the enzyme. One of the most critical elements is the activation loop which adopts a unique extended orientation in the active state of the kinase and multiple types of folded conformations in inactive states. It begins with a conserved motif called the DFGmotif (Asp-Phe-Gly) whose orientation is tightly coupled with active/inactive status of the protein. In addition, the C-helix displays inwards disposition in the active state while exhibiting a range of positions and orientations in other states. The DFGmotif conformations were previously addressed by using a simple convention of DFGin and DFGout. The DFGin group consists of all the conformations in which DFG-Asp points in ATP pocket and DFG-Phe is adjacent to the C-helix. The structures solved in the active state conformation of the enzyme form a subset of this category. In DFGout conformations, the DFG-Asp and DFG-Phe residues swap their positions so that DFG-Asp is removed from the ATP binding site and replaced with DFG-Phe. All the Type II inhibitors bind to DFGout conformations [8]. The DFGin and DFGout groups, however, provide only a broad description of a more complex conformational landscape [9, 10]. In our previous work, we developed a scheme for clustering and labeling different conformations of protein kinase structures [11]. Our clustering scheme is based on the spatial location and backbone and side-chain dihedrals of the conserved DFGmotif in the activation loop. We clustered all the conformations into three spatial groups (DFGin, DFGinter, DFGout) based on the proximity of the DFG-Phe side chain to two different residues in the N-terminal domain. Within these groups, we further clustered the structures by the dihedral angles that determine the location of the DFG- Phe side chain: the backbone dihedrals of the X, D and F residues (where X is the residue before the DFGmotif) and the χ1 dihedral angle of the Phe side chain. The kinase states are therefore named after the region of the Ramachandran map occupied by the X, D, and F residues (A for alpha, B for beta, L for .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ left-handed) and the Phe χ1 rotamer (plus, minus, or trans for the +60°, -60°, or 180° conformations). As a result, among the DFGin structures, we distinguished between the catalytically active kinase conformation (labeled BLAminus) and five inactive conformations (BLBplus, BLBminus, BLBtrans, ABAminus, BLAplus). Among DFGout structures, we identified one dominant conformation labeled BBAminus, which is strongly correlated with Type II kinase inhibitors, such as imatinib. Finally, among the small set of DFGinter structures, where the Phe side chain is intermediate between the DFGin and DFGout positions, we distinguished one cluster based on clustering the dihedral angles (BABtrans). Our nomenclature strongly correlates with other structural features associated with active and inactive kinases, such as the positions of the C-helix and the activation loop and the presence or absence of the N-terminal domain salt bridge. Since our clustering and nomenclature is based on backbone dihedrals, it is intuitive to structural biologists and easy to apply in a wide variety of experimental and computational studies, as demonstrated recently in identifying the conformation in crystal structure of IRAK3 [12], molecular dynamics simulations of Abl kinase [13] and structural analyses of pseudokinases [14]. Developing small molecule inhibitors is one of the most common therapeutic strategies against protein kinases. These inhibitors occupy the ATP binding pocket and allosteric sites on the surface of the protein. There have been two approaches used to classify inhibitors – a) based on the region of the protein to which the inhibitor binds; b) based on the conformation of the protein to which it binds. The first approach was used by Dar and Shokat [15] who defined three types of inhibitors: Type I – inhibitors which bind to the adenosine pocket but do not require a specific conformation of structural elements including the C- helix and DFGmotif; Type II – inhibitors that occupy the adenosine pocket and induce DFGout conformations because they extend into the pocket adjacent to the C-helix occupied by DFG-Phe in DFGin structures; Type III – inhibitors that block kinase activity but without displacing ATP. This classification was extended by Zuccotto and coworkers who introduced Type I½ inhibitors as molecules which bind to the ATP region like Type I compounds but extend into the back cavity making additional contacts with the residues involved in Type II binding [16]. Rauh et. al. defined Type IV as the allosteric inhibitors which bind to a site distant to the ATP binding region inducing an inactive conformation in the active site [17, 18]. van Linden et al. defined the ligand types by identifying three regions in the active site - a front cleft, the gate area, and the back cleft, which are further divided into subpockets [19] without the use of labels like Type I, II etc. Roskoski used the second approach and redefined all the inhibitors based on the conformation of the protein [20]. According to this scheme, Type I inhibitors bind only to the active conformation; Type I½ are the inhibitors which bind to DFGin inactive conformations and Type II inhibitors bind to DFGout conformation. Each of these categories were divided into two subtypes A and B. However, this scheme is inadequate because, as we have shown, some inhibitors such as Bosutinib and Sunitinib can bind to different conformations across proteins [11]. For example, according to Roskoski’s classification Sunitinib will be labeled Type I in 6NFZ_A (DFGin-BLAminus) and Type IIB in 3G0F_A (DFGout-BBAminus), even though they bind to the kinase domain in an identical manner. In this paper, we present the Kinase Conformation Resource, Kincore – a web resource which automatically collects and curates all protein kinase structures from the Protein Data Bank (PDB) and assigns conformational and inhibitor type labels. The website is designed so that the information for all .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ the structures can be accessed at once using one database table and instances of it through individual pages for kinase phylogenetic groups, genes, conformational labels, PDBids, ligands and ligand types. The database can be searched using unique identifiers such as PDBid or gene, and queried using a combination of attributes such as phylogenetic group, conformational label and ligand type. We also provide several options to download data – database tables as a tab separated files; the kinase structures as PyMOL sessions and coordinate files in mmCIF format. The structures have been renumbered by Uniprot and our common numbering scheme, which is derived from our structure-based alignment of all 497 human protein kinase domains [7]. We have also developed a webserver and standalone program which can be used to determine the spatial and dihedral labels for a structure with unknown conformation. We automatically label ligand types based on the pockets to which an inhibitor binds defined by specific residues in the kinase domain. Thus, we use five labels for different ligand types: Type I – bind to ATP binding region only (both active and inactive DFGin states); Type I½ – ATP binding region and extending into the back pocket (both active and inactive DFGin states); Type II – ATP binding region and extending to back pocket regions exposed only in DFGout structures; Type III – back pocket only without displacing ATP; and Allosteric – outside the active site cleft. Results Kincore provides conformational assignments and ligand type labels to protein kinase structures from PDB. The current update contains structures from 283 kinase genes from humans (7129 chains) and from 55 genes (707 chains) from seven model organisms. The PK structures were identified from the PDB [21] using PSI-BLAST [22] using a kinase PSSM matrix as a query (Methods). The PDB files are split by chain, renumbered by Uniprot numbering [23, 24] and our common residue numbering scheme, and annotated by conformational and ligand type labels as described below. The conformational labels are assigned using the structural features and clusters described in our previous work [11]. The scheme assigns two types of labels to each chain – 1) A spatial label (DFGin, DFGinter, DFGout) by computing the distance of the DFG-Phe-CZ atom from the C atoms of two conserved residues – the strand 3-Lys involved in the N-terminal domain salt bridge formed in active kinase structures (and some inactive structures) and the residue four amino acids past the C-helix-Glu involved in the same salt bridge and assigning a label using distance cutoff criteria (Methods); 2) A dihedral label –the dihedral angles (φ,ψ of X-DFG, Asp, Phe and χ1 for Phe) for each chain in a spatial group are used to calculate the distance of the structure from the precomputed cluster centroids and assigned a label if its distance satisfies defined cutoff criteria (Methods). All the kinase conformations are represented by a set of eight labels: DFGin-BLAminus, DFGin-BLAplus, DFGin-ABAminus, DFGin-BLBminus, DFGin-BLBplus, DFGin- BLBtrans; DFGout-BBAminus; DFGinter-BABtrans. The chains that do not satisfy the dihedral distance cutoff criteria for any cluster or are missing some of the relevant coordinates are labeled as ‘Unassigned’. Additionally, we have also labeled the C-helix disposition by computing the distance between the C-helix- Glu-C atom from the B3-Lys-C-atom (as a proxy for the conserved salt bridge interaction) and labeled it as C-helix-in and C-helix-out (Methods). .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ Figure 1: Representative protein kinase structure (3ETA_A) displaying the residues used to define inhibitor binding regions. To assign labels to ligands, we have used specific residue positions to identify regions of the binding pocket – the ATP binding pocket (including the hinge residues), back pocket and Type II-only region (Figure 1). The structures are first renumbered by our common numbering scheme so that all the aligned residues have the same residue number across all the kinases. A ligand is then assigned a label based on its contacts with different binding regions. We have used the following five ligand type labels to annotate all the ligand-bound structures of protein kinases (Figure 1): 1. Type I – bind to ATP binding region only 2. Type I½ – bind to ATP binding region and extend into the back pocket (subdivided as Type I½-front and Type I½-back depending on contact with N-terminal or C-terminal residues of the C-helix, respectively) 3. Type II – bind to the ATP binding region and extend into the back pocket and Type II-only region 4. Type III – bind only in the back pocket without displacing ATP 5. Allosteric - any pocket outside the ATP-binding region The distribution of different ligand types across kinase conformations is provided in Table 1. It shows that Type I and Type I½ are the most commonly observed inhibitors. However, except Type II, all the inhibitor types are observed in complex with multiple conformational states. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ Table 1: Distribution of ligand types across protein kinase conformations (Number of chains). Spatial label Dihedral label Type I Type I½ (front+back) Type II Type III Allosteric Total (%) DFGin BLAminus (active) 2926 196 - 12 199 3333 (55.0) BLBplus 443 76 - 59 15 593 (9.8) ABAminus 479 36 - 1 19 535 (8.8) BLBminus 162 11 - 5 10 188 (3.1) BLBtrans 175 6 - - 5 186 (3.1) BLAplus 91 86 - - 1 178 (2.9) Noise 282 38 - 1 18 339 (5.6) DFGout BBAminus 20 9 288 69 24 410 (6.8) Noise 43 17 79 26 12 177 (2.9) DFGinter BABtrans 14 1 - - - 15 (0.2) Noise 89 16 - 3 3 111 (1.8) Total (%) 4724 (77.9) 492 (8.1) 367 (6.1) 176 (2.9) 306 (5.0) Many inhibitors are observed in multiple crystal structures bound to one or more different kinases. We counted the number of unique inhibitors that occur bound to kinase chains in two (or more) states across entries in the PDB. In Table 2, we show a table that provides the number of unique inhibitors that occur in each pair of states (excluding the unclassified spatial or dihedral labels). The numbers along the diagonal are the counts of unique inhibitors observed in at least one structure of the given state. A total of 259 inhibitors occur in two or more kinase states. Table 2. Counts of inhibitors that are bound to chains in two or more states. DFGin- BLAminus DFGin- ABAminus DFGin- BLBplus DFGin- BLBminus DFGin- BLBtrans DFGin- BLAplus DFGout- BBAminus DFGinter- BABtrans DFGin-BLAminus 1686 DFGin-ABAminus 48 334 DFGin-BLBplus 39 11 344 DFGin-BLBminus 26 9 11 210 DFGin-BLBtrans 29 4 11 4 134 DFGin-BLAplus 15 6 13 8 2 107 DFGout-BBAminus 7 3 2 2 2 3 254 DFGinter-BABtrans 6 2 4 4 1 3 1 8 Numbers along the diagonal provide the number of unique inhibitors in each state. The off-diagonal values are the number of unique inhibitors bound to chains in the two states shown in the row and column headers. Website The web pages on Kincore are designed in a common format across the website to organize the information in a consistent and uniform way. Each page retrieved from the database is organized in two parts – the top part provides a summary of the number of structures in the queried groups or conformations, with representative structures from each category listed and displayed. This is followed by a table from the database with each unique PDB chain as a row providing different kinds of information including conformational and ligand type labels and C-helix position, kinase family, gene name, Uniprot ID, ligand PDB ID, and ligand type. The kinase group, gene name, PDB code, conformational labels, ligand name and ligand type are hyperlinked to their specific pages. Each page also contains three tabs on the top to list ‘Human’, ‘Non-human’ and ‘All’ structures. There are buttons provided on each page to .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ download the database table as a tab separated file, and to download all of the kinase structures on the page as PyMOL sessions, and renumbered coordinate files. Figure 2: Snapshot of database table displaying entries for PDB chains on Browse page. The information from the database can be accessed using two main pages: 1. Browse page: This page provides statistics and labels for all the kinase structures in the database (Figure 2). The ‘Summary’ table on top of the page displays the distribution of protein kinase chains in the PDB across conformational states and phylogenetic groups. This is followed by ‘Database’ table which contains annotation for all individual PDB chains retrieved from the database. The entire table with additional information like resolution, Rfactor, activation loop residue etc. can be downloaded as a tab separated file. 2. Search page: This page offers two options to query the database: • Unique identifier: The database can be queried by PDB entry code (e.g., 2GS6), UniProt identifier (e.g., EGFR_HUMAN), gene name (e.g., EGFR), and ligand identifier (e.g., STI). The result will take the user to the page dedicated to the specific query item. For reference the list of all genes in the database is provided for the user through a ‘Help’ button above the search box. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ • Advanced query: The database can be queried by selecting kinase phylogenetic group, conformational label, and ligand type using a drop-down menu. If ‘All’ option is selected for all the three categories, then the entire database table can be accessed at once. A subset of chains in the database can be retrieved by selecting a specific group name, conformational label, and ligand type, for example selecting TYR group + DFGout- BBAminus + Type II ligand type will retrieve all the structures which have these three annotations. If all the structures in complex with Type I½ ligand are desired, then the user can select ‘All group’ + ‘All conformations’ + ‘Type I½ ligand’. The website contains several webpages which are dynamically generated and retrieve queried instances of the database. These pages can be accessed as a result of individual queries or by clicking on the hyperlinks on the Browse page table. They are, 1. Phylogenetic group page: typical protein kinases are divided into nine phylogenetic groups – AGC, CAMK, CMGC, CK1, NEK, RGC, STE, TKL and TYR [ref]. Each group is assigned a page on Kincore displaying information about the structures in that group. On each page, the Summary table provides the number of kinase chains in the group across different conformations with their representative structures (best resolution and least missing residues). These representative structures are also displayed on the page in 3D using NGL viewer. 2. Gene page: A page for each kinase gene in the PDB can be accessed through the hyperlinks on Browse page or by unique identifier Search feature and contains information for all the structures of a specific gene. The summary table on the page gives the number of structures available and their distribution across different conformations with representative example for each. It also provides hyperlinks to the phylogenetic group page (described above) for the gene and the corresponding protein entry on the Uniprot website. In addition to the data provided on the Browse page, the Database table on this page also contains for each chain information on mutations, phosphorylation with total length of the structure and number of residues resolved in the activation loop. 3. PDB page: The PDB page provides information on individual PDB entries and can be accessed by the hyperlinks on the Browse page or by the unique identifier Search feature (Figure 3). Each PDB entry is annotated with information on gene, protein name, phylogenetic group, UniProt id, organism, domain boundary, resolution, conformation, and ligand type labels for every chain. Additionally, the page also contains a sequence feature displaying the UniProt sequence of the protein in the structure. The residues which are unresolved in the structure are displayed in lower case letters to distinguish them from residues with coordinates in the entry. Further, mutated and phosphorylated residues are shown in red and green color, respectively. 4. Ligand page: The ligand page provides access to all chains in complex with a specific ligand. For example, all the structures in complex with ATP can be retrieved by querying for ‘ATP’ on the Search page or clicking on the hyperlinks on the Browse page. The Summary table provides the .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ number of chains in complex with the ligand across different conformations. Like other pages, the Database table provides the list of all the PDB chains with conformational labels and ligand annotations. This page facilitates the comparison of conformations and ligand binding mode across structures from one or multiple kinases in complex with the same ligand. For example, Bosutinib (PDB identifier DB8) which is an FDA-approved drug, is found in complex with structures from 10 kinases in 5 different conformations (Figure 4). Figure 3: Snapshot of PDB page with the sequence feature. Alignment Page In our previous work, we developed a structure-based multiple sequence alignment (MSA) for 497 human protein kinase domains [7]. This alignment contains 17 blocks of aligned regions conserved across human kinases with intermittent regions of low sequence similarity in lower case letters. The alignment is annotated with gene name, UniProt id, and protein residue numbers. On Kincore, we provide access to this MSA through the Alignment page which contains basic information about the alignment with a table of conserved regions across human kinases. The alignment can be visualized inside the browser window through ‘Open in browser’ button created using Jalview’s BioJS feature. This feature provides multiple options for quick analysis including buttons to filter, color, or sort the sequences within the browser window. The alignment is also available to download as a Jalview session as well as Clustal- and FASTA- formatted files. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ Phylogeny Page Using our multiple sequence alignment, we also updated the protein kinase phylogenetic tree [7]. This tree was used to assign a set of ten kinases previously categorized as “OTHER” to the CAMK group, consisting of Aurora kinases, Polo-like kinases, and calcium/calmodulin-dependent kinase kinases. On our resource the tree can be accessed through the Phylogeny page. It provides basic information about the tree, the number of kinase genes and domains in different phylogenetic groups, and links to visualize and download the tree. Figure 4: Snapshot of ligand page displaying Bosutinib (PDB ligand identifier DB8) in complex with structures from 10 kinase genes and in 5 different conformations. Download Options We provide multiple data download options on Kincore to assist the user in different kinds of analysis. These download options are created for all the pages or any instance of database retrieved by a query, e.g. structures of a specific gene, ligand etc. or structures from an advanced query like TYR kinases with DFGout state and Type II ligands. These options are: 1. Coordinate Files We provide structure files in mmCIF and PDB format with three different numbering systems: the original author residue numbering; renumbered by Uniprot protein sequence; and a .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ common residue numbering scheme derived from our multiple sequence alignment of kinases [7]. 2. PyMOL Sessions We provide PyMOL [25] sessions for the structures retrieved from any query from the database. Two PyMOL sessions are provided for each query – All chains and Representative chains (best resolution, least missing residues). Across all the PyMOL sessions, the chains are labeled in a consistent format as – PhyloGroup_Gene_SpatialLabel_DihedralLabel_PDBidChainid (e.g., TYR_EGFR_DFGin_BLAminus_2GS6A). Additionally, we also provide PyMOL scripts (.pml format) which the user can download and run on a local machine to create the sessions. 3. Database Files We provide the information retrieved from the database on every page as tab separated files which can be downloaded using ‘Database table as tsv’ button. When clicked on the ‘Browse’ page, this button will download the information in the entire database in one file. On the other pages specific for a gene or conformation, this file will contain only the subset of the information from the database which is queried. The tsv file has the following header, “Organism Group Gene UniprotID PDB Method Resolution Rfac FreeRfac SpatialLabel DihedralLabel C-helix Ligand LigandType DFG_Phe Edia_X_O Edia_Asp_O Edia_Phe_O Edia_Gly_O ProteinName” 4. Bulk download The ‘Download’ page provides different options to download structure files and PyMOL sessions in bulk. The page is divided into two sections – coordinate files and PyMOL sessions. The user can download coordinate files for all the structures in one zip folder or in subsets of specific phylogenetic group, gene, and conformational label. The tab on the top of the page gives the option to download files with original author residue numbering or renumbered by Uniprot protein sequence and common residue numbering from our alignment. The second part of the ‘Download’ page provides PyMOL sessions for phylogenetic groups, genes and ligands. We have developed a webserver which the user can use to upload a kinase structure file in PDB or mmCIF format to determine its conformation. The program extracts the sequence from structures file and identifies residue positions by aligning it with precomputed HMM profiles of kinase groups. It then determines the conformation of the protein by assigning Spatial and Dihedral labels (Methods). On the output page, the server prints the kinase phylogenetic group which is the closest match to the sequence of the input structure, dihedrals of X-DFG, DFG-Asp, DFG-Phe residues, spatial group, dihedral label and C-helix disposition. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ We have written a standalone program using Python3 which the user can download to assign conformational labels to an unannotated structure. The program can be run in two ways: a) with flag align=True: alignment with precomputed HMM profiles is done to identify the residue numbers for B3- Lys, C-helix-Glu and DFG-Phe. The program then computes inter-residue distances and dihedral angles to label the conformation in the structure (Methods); b) with flag align=False: alignment with an HMM profile is not done, and the residue numbers are provided by the user. This option is faster and more useful for identifying conformations in a large number of structures generated from a molecular dynamics simulations. Discussion Experimentally determined protein kinase structures in apo-form or in complex with a ligand display an extremely flexible active site. However, examining the conformational dynamics of kinases and its role in ligand binding require combining two pieces of information – the conformational state of the protein and the type of ligand in complex. Currently, there are two main resources, Kinametrix and KLIFS, that address protein kinase conformations and inhibitors. However, they provide either conformational assignments or ligand type information, but not both. Kinametrix (http://kinametrix.com/) offers a simple scheme of DFGin and DFGout coupled with C-helix conformation [26]. The resource does not provide information on ligands and lacks any download options for structures. This resource has not been updated with structures since May 2017. KLIFS (https://klifs.vu-compmedchem.nl/index.php) – also offers a simple DFGin and DFGout classification [19, 27] and does not distinguish active and inactive DFGin structures. This resource is more focused on providing information about ligand binding to kinases. It is regularly updated and allows bulk downloads for the results of each search. Kincore fills a gap by providing a sophisticated scheme for kinase conformations, with ligand type labels. The information can be accessed as individual queries for example, getting a list of all chains in complex with Type II ligand; or a combination of queries like, AGC group kinases + DFGin-BLBplus conformation + Type I½ ligand. A feature that distinguishes Kincore from many structural bioinformatics resources is the ability to download coordinate files for the result of any query in one click. For example, a search for AURKA produces a list of 191 protein chains from 154 PDB entries. These can be downloaded in mmCIF format with one click with residue numbering in original PDB numbering, renumbered according to the UniProt sequences, or in our common residue numbering scheme from the kinase multiple sequence alignment. Each coordinate file is labeled by spatial label and dihedral angle cluster, e.g. CAMK_AURKA_DFGin_BLAminus_1OL6A.cif. A user can also download a PyMOL session file with all of the structures for a given query. In addition, an important part of our resource is the web server and standalone program which can label the unknown conformation of a new structure. The standalone program can run on structure files with multiple chains and models. We believe it will be extremely useful to batch process the structures generated from a molecular modeling protocol or molecular dynamics simulation. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint http://kinametrix.com/ https://klifs.vu-compmedchem.nl/index.php https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ Several experimental and computational studies have reported applying the nomenclature from our previous work in structural analyses of kinases [11]. Lange and colleagues have solved the crystal structure of the pseudokinase IRAK3 (PDBID 6RUU) and identified its conformation as BLAminus, similar to the active state of a typical protein kinase [12]. Paul et.al. have studied the dynamics of ABL kinase by various simulation techniques with Markov state models and analyzed the transition between different metastable states by using our nomenclature [13]. Kirubakaran et. al. have identified the catalytically primed structures (BLAminus) from the PDB to create a comparative modeling pipeline for the ligand bound structures of CDK kinases [28]. Paul and Srinivasan have done structural analyses of pseudokinases in Arabidopsis thaliana and compared with typical protein kinases by applying our conformational labels [14]. Therefore, we believe that the development of Kincore database and webserver will greatly benefit a larger research community by making the labeled kinase structures more accessible and facilitating identification of kinase conformations in a wide range of studies. Methods Identifying and renumbering protein kinase structures The database contains protein kinase domains from Homo sapiens and seven model organisms consisting Bos taurus, Danio rerio, Drosophila melanogaster, Mus musculus, Rattus norvegicus, Sus scrofa and Xenopus laevis. To identify structures from these organisms the sequence of human Aurora A kinase (residues 125-391) was used to construct a PSSM matrix from three iterations of NCBI PSI-BLAST on the PDB with default cutoff values [22]. This PSSM matrix was used as query to run command line PSI-BLAST on the pdbaa file from the in the PISCES server (http://dunbrack.fccc.edu/pisces) [29]. pdbaa contains the sequence of every chain in every asymmetric unit of the PDB in FASTA format with resolution, R-factors, and SwissProt identifiers (e.g. AURKA_HUMAN). A total of 4908 PDB entries with 7277 kinase chains were identified. Some poorly aligned kinases and non-kinase proteins that were homologous to kinases but distantly related were removed. The structure files were split by individual kinase chains in the asymmetric unit and renumbered by UniProt protein numbering scheme. The mapping between PDB author numbering and UniProt was obtained from Structure Integration with Function, Taxonomy and Sequence (SIFTS) database [24]. The SIFTS files were also used to extract mutation, phosphorylation, and missing residue annotations. The structure files were also renumbered by a common residue numbering scheme using our protein kinase multiple sequence alignment. Each residue in a kinase domain was renumbered by its column number in the alignment. Therefore, aligned residues across different kinase sequences get the same residue number. For example, in these renumbered structure files the residue number of the DFGmotif across all kinases is 1338 – 1340. The conserved motifs for all the structures were identified from the same alignment. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint http://dunbrack.fccc.edu/pisces https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ Assigning conformational labels Each kinase chain is assigned a spatial group and a dihedral label using our previous clustering scheme as a reference [11]. Our clustering scheme has three spatial groups – DFGin, DFGinter, and DFGout. These are sub-divided into dihedral clusters DFGin -- BLAminus, BLAplus, ABAminus, BLBminus, BLBplus, BLBtrans; DFGinter – BABtrans; and DFGout – BBAminus. To determine the spatial group for each chain, the location of DFG-Phe in the active site was identified using the following criteria: 1. D1≤11 Åand D2≥11 Å– DFGin 2. D1>11 Å and D2<=14 Å– DFGout 3. D1≤11 Å and D2≤11 Å – DFGinter, where D1= αC-Glu(+4)-Cα to DFG-Phe-Cζ and D2 = β3-Lys-Cα to DFG-Phe-Cζ Any structure not satisfying the above criteria is considered an outlier and assigned the spatial label “None.” To identify the dihedral label the DFG-Phe rotamer type in each chain was first identified (minus, plus, trans). The chains for each rotamer type were then represented with a set of 6 backbone (Φ, Ψ) dihedrals from X-DFG, DFG-Asp, DFG-Phe residues. Using these dihedrals, the distance of each kinase chain was calculated from precomputed cluster centroid points for each cluster with the same rotamer type in the given spatial group. For example, the dihedral distance for all DFGin with Phe-minus structures was computed against BLAminus, ABAminus and BLBminus. The dihedral angle distance is computed using the following formula, 𝐷(𝑖, 𝑗) = 1 6 (𝐷(∅𝑖 𝑋 , ∅𝑗 𝑋 ) + 𝐷(𝜓𝑖 𝑋 , 𝜓𝑗 𝑋 ) + 𝐷(∅𝑖 𝐷 , ∅𝑗 𝐷 ) + 𝐷(𝜓𝑖 𝐷, 𝜓𝑗 𝐷 ) + 𝐷(∅𝑖 𝐹 , ∅𝑗 𝐹 ) + 𝐷(𝜓𝑖 𝐹 , 𝜓𝑗 𝐹 )) where, 𝐷(𝜃1 , 𝜃2) = 2(1 − cos(𝜃1 − 𝜃2)) A chain is assigned to a dihedral label if the distance from that cluster centroid is less than < 0.45. The chains which have any motif residue missing or are distant from all the cluster centroids are assigned the dihedral label “None.” The C-helix disposition is determined using the distance between Cβ atoms of B3-Lys and C-helix-Glu(+4). A distance of <10 Å indicates that the salt bridge between the two residues is present suggesting a C-helix- in conformation. A value of >10 Å suggests a C-helix-out conformation. Ligand classification The different regions of the ATP binding pocket are identified by specific residues using our common numbering scheme (Supplementary figure 1): • ATP binding region – hinge residues – residues 426-428 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ • Back pocket - C-helix and partial regions of B4 and B5 strands, DFGmotif backbone – residues 106- 147, 150-152, 184, 187-195, 420-422 and 1337-1339 • Type II-only pocket – exposed only in DFGout conformation – residues 153, 149, 959 and 1011 A contact between ligand atoms and protein residues is defined if the distance between any two atoms is ≤ 4.5 Å (hydrogens not included). Based on these contacts we have labeled the ligand types as follows: 1. Allosteric: Any small molecule in the asymmetric unit whose minimum distance from the hinge region and C-helix-Glu(+4) residue is greater than 6.5 Å. 2. Type I½: subdivided as – Type I½-front – at least three or more contacts in the back pocket and at least one contact with the N-terminal region of the C-helix. Type I½_back - at least three or more contacts in the back pocket but no contact with N-terminal region of C-helix. 3. Type II – at least three or more contacts in the back pocket and at least one contact in the Type2- only pocket. 4. Type III – minimum distance from the hinge greater than 6 Å and at least three contacts in the back pocket. 5. Type I – all the ligands which do not satisfy the above criteria. Identify conformation using webserver The program uses the structure file uploaded by the user to extract the sequence of the protein. It aligns the sequence with precomputed HMM profiles of kinase phylogenetic groups (e.g. AGC.hmm, CAMK.hmm). The alignment with the best score is identified and used to determine the positions of the DFGmotif, B3-Lys, and C-helix-Glu(+4) residues. The program then computes the distance between specific atoms and dihedrals to identify spatial and dihedral labels using the assignment method described above. Standalone program The standalone program is written in Python3.7. The program is available to download from https://github.com/vivekmodi/Kincore-standalone and can be run in a MacOS or Linux machine terminal window. The user can provide individual .pdb or .cif (also compressed .gz) file or a list of files as an input. It identifies the unknown conformation from a structure file in the same way as described for the webserver. Software and libraries used All the scripting and analysis is done using Python3 and depends on Pandas (https://pandas.pydata.org), and Biopython [30] libraries. Website and Database Kincore is developed using Flask web framework (https://flask.palletsprojects.com/en/1.1.x/). The webpages are written in HTML5 and style elements created using Bootstrap v4.5.0 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://github.com/vivekmodi/Kincore-standalone https://flask.palletsprojects.com/en/1.1.x/ https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ (https://getbootstrap.com/). The 3D visualization is done by using NGL Viewer (http://nglviewer.org/ngl/api/). PyMOL (v2.3) is used for creating download sessions [25]. The entire application is deployed on the internet using Apache2 webserver. Acknowledgements The authors want to thank Maxim Shapovalov for his help in deploying the server. This work was funded by NIH grant R35 GM122517 to R.L.D. References 1. Adams, J.A., Kinetic and catalytic mechanisms of protein kinases. Chem Rev, 2001. 101(8): p. 2271-90. 2. Blume-Jensen, P. and T. Hunter, Oncogenic kinase signalling. Nature, 2001. 411(6835): p. 355- 365. 3. Lahiry, P., et al., Kinase mutations in human disease: interpreting genotype-phenotype relationships. Nat Rev Genet, 2010. 11(1): p. 60-74. 4. Zhang, J., P.L. Yang, and N.S. Gray, Targeting cancer with small molecule kinase inhibitors. Nat Rev Cancer, 2009. 9(1): p. 28-39. 5. Ferguson, F.M. and N.S. Gray, Kinase inhibitors: the road ahead. Nature Reviews Drug Discovery, 2018. 17(5): p. 353-377. 6. Manning, G., et al., The protein kinase complement of the human genome. Science, 2002. 298(5600): p. 1912-34. 7. Modi, V. and R.L. Dunbrack, Jr., A Structurally-Validated Multiple Sequence Alignment of 497 Human Protein Kinase Domains. Sci Rep, 2019. 9(1): p. 19790. 8. Vijayan, R., et al., Conformational analysis of the DFG-out kinase motif and biochemical profiling of structurally validated type II inhibitors. Journal of medicinal chemistry, 2015. 58(1): p. 466- 479. 9. Möbitz, H., The ABC of protein kinase conformations. Biochimica et Biophysica Acta (BBA)- Proteins and Proteomics, 2015. 1854(10): p. 1555-1566. 10. Ung, P.M.-U., R. Rahman, and A. Schlessinger, Redefining the protein kinase conformational space with machine learning. Cell chemical biology, 2018. 25(7): p. 916-924. e2. 11. Modi, V. and R.L. Dunbrack, Defining a new nomenclature for the structures of active and inactive kinases. Proceedings of the National Academy of Sciences, 2019. 116(14): p. 6818-6827. 12. Lange, S.M., et al., Dimeric Structure of the Pseudokinase IRAK3 Suggests an Allosteric Mechanism for Negative Regulation. Structure, 2020. 13. Paul, F., Y. Meng, and B. Roux, Identification of Druggable Kinase Target Conformations Using Markov Model Metastable States Analysis of apo-Abl. J Chem Theory Comput, 2020. 16(3): p. 1896-1912. 14. Paul, A. and N. Srinivasan, Genome-wide and structural analyses of pseudokinases encoded in the genome of Arabidopsis thaliana provide functional insights. Proteins, 2020. 88(12): p. 1620- 1638. 15. Dar, A.C. and K.M. Shokat, The evolution of protein kinase inhibitors from antagonists to agonists of cellular signaling. Annu Rev Biochem, 2011. 80: p. 769-95. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://getbootstrap.com/ http://nglviewer.org/ngl/api/ https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ 16. Zuccotto, F., et al., Through the "gatekeeper door": exploiting the active kinase conformation. J Med Chem, 2010. 53(7): p. 2681-94. 17. Gavrin, L.K. and E. Saiah, Approaches to discover non-ATP site kinase inhibitors. MedChemComm, 2013. 4(1): p. 41-51. 18. Fang, Z., C. Grutter, and D. Rauh, Strategies for the selective regulation of kinases with allosteric modulators: exploiting exclusive structural features. ACS Chem Biol, 2013. 8(1): p. 58-70. 19. van Linden, O.P., et al., KLIFS: A knowledge-based structural database to navigate kinase-ligand interaction space. J Med Chem, 2013. 20. Roskoski, R., Jr., Classification of small molecule protein kinase inhibitors based upon the structures of their drug-enzyme complexes. Pharmacol Res, 2016. 103: p. 26-48. 21. consortium, w., Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Research, 2018. 47(D1): p. D520-D528. 22. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of database programs. Nucleic Acids Research, 1997. 25: p. 3389-3402. 23. UniProt Consortium, UniProt: a hub for protein information. Nucleic Acids Res, 2015. 43(Database issue): p. D204-12. 24. Velankar, S., et al., SIFTS: Structure Integration with Function, Taxonomy and Sequences resource. Nucleic Acids Research, 2013. 41(D1): p. D483-D489. 25. DeLano, W.L., The PyMOL molecular graphics system. 2002, Schrödinger, Inc.: San Carlos, CA. 26. Rahman, R., P.M.-U. Ung, and A. Schlessinger, KinaMetrix: a web resource to investigate kinase conformations and inhibitor space. Nucleic acids research, 2018. 47(D1): p. D361-D366. 27. Kanev, G.K., et al., KLIFS: an overhaul after the first 5 years of supporting kinase research. Nucleic Acids Research, 2021. 49(D1): p. D562-D569. 28. Kirubakaran, P., et al., Comparative Modeling of CDK9 Inhibitors to Explore Selectivity and Structure-Activity Relationships. bioRxiv, 2020: p. 2020.06.08.138602. 29. Wang, G. and R.L. Dunbrack, Jr., PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res, 2005. 33(Web Server issue): p. W94-8. 30. Cock, P.J., et al., Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 2009. 25(11): p. 1422-3. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/