key: cord-0283393-b2q40hnp
authors: Sharma, Nitesh Kumar; Gupta, Sagar; Kumar, Prakash; Kumar, Ashwani; Pradhan, Upendra Kumar; Shankar, Ravi
title: RBPSpot: Learning on Appropriate Contextual Information for RBP Binding Sites Discovery
date: 2021-06-07
journal: bioRxiv
DOI: 10.1101/2021.06.07.447370
sha: d8f61f4801f1b720d12550ea28f41b8d239693ef
doc_id: 283393
cord_uid: b2q40hnp

Identifying RBP binding sites and mechanistic factors determining the interactions remain a big challenge. Besides the sparse binding motifs across the RNAs, it also requires a suitable sequence context for binding. The present work describes an approach to detect RBP binding sites while using an ultra-fast BWT/FM-indexing coupled inexact k-mer spectrum search for statistically significant seeds. The seed works as an anchor to evaluate the context and binding potential using flanking region information while leveraging from Deep Feed-forward Neural Network (DNN). Contextual features based on pentamers/dinucloetides which also capture shape and structure properties appeared critical. Contextual CG distribution pattern appeared important. The developed models also got support from MD-simulation studies and the implemented software, RBPSpot, scored consistently high for the considered performance metrics including average accuracy of ∼90% across a large number of validated datasets while maintaining consistency. It clearly outperformed some recently developed tools, including some with much complex deep-learning models, during a highly comprehensive bench-marking process involving three different data-sets and more than 50 RBPs. RBPSpot, has been made freely available, covering most of the human RBPs for which sufficient CLIP-seq data is available (131 RBPs). Besides identifying RBP binding spots across RNAs in human system, it can also be used to build new models by user provided data for any species and any RBP, making it a valuable resource in the area of regulatory system studies.

It has been reported that at any given time, compared to just 2-3% transcription factors expression share, ~10 times higher volume of RNA binding proteins are expressed (1) . Advances with highthroughput techniques like CLIP-seq and Interactome Capture have drastically revised our understanding about RBPs which suggest that human systems are expected to have at least 1,500-2,000 genes coding for RBPs (1, 2) . Unfortunately, we are still far behind in terms of information for these regulators where hardly ~150 RBPs have been studied so far for their interactions with RNAs.

Despite of their critical functional roles in cell systems, very few RBPs have been explored with precise identification of their mechanism of action (1).

There are certain limitations with these high-throughput experiments. These experiments are costly.

They too don't give the entire RBP-RNA interactome spectrum and at a time work for one RBP only in condition specific manner. The CLIP-seq reads provide narrowed down regions to look for interactions but don't provide the mechanistic details and explanations for the interactions. Using general motif discovery tools to identify the interaction spots have got limited success in case of RBPs as they either report too short motifs which have high chances of occurrences across the random data or they don't cover large spectrum of instances. Unlike transcription factors, RBPs binding sites display sparse motif positional conservation. They are usually difficult to detect through such routine motif finding approaches. Besides the binding motifs, contenxtual sequence With Graphprot a new generation of such tools started which applied machine learning as well as leveraged from new data-sets developed from CLIP-seq experiments (12) . It also applied the concept of differential RNA secondary structure information in contextual manner to build the interaction models. A recently develop tool, beRBP, carries forward the approach similar to RBPmap while implementing a machine-learning method of Random-Forest (13) . It clusters the potential motif sites where it ranks them and uses the highest scoring regions for the matches in the given region while scanning for the user provided motif/PWM. In the followup, they have also applied an approach similar to RNAcontext where RNA structural information is provided for the motif region using ab-initio structure prediction tool, RNAfold. Further to this, it added the phylogenetic conservation information similar to RBPmap and mCarts.

With recent developments in the area of deep learning, many deep-learning based RBP-RNA interaction detection approaches have been implemented recently. DeepBind deserves special mention among them as it pioneered this category where a robust general system was created to model nucleic acids and protein interactions using convolution neural network (CNN) (14) .

DeepBind has become a sort of prototype for almost all of the recent Deep-learning based tools to identify the RBP-RNA interactions. DeepBind applies 7-mer motif weight matrices are transformation into an image pixel matrix and is scanned for entire sequence while evaluating for 4stages to derive the binding score: convolution stage, rectification stage which zooms the scanner to most promising regions for the motif, followed by pooling of all such regions and expansion and clustering of motifs, which is finally subjected to a non-linear classifier. However, the authors accepted that compared to transcription factors and their data, running DeepBind with RNAcompete data did not achieve that level of accuracy. They pointed out the importance of accurate RNA secondary structure information and RNA shape readouts in RNA-RBP interactions which most of the approaches have missed so far. Taking the work further on Deep-learning based RBP-RNA interaction detection, another prominent tool system is iDEEP which has come like a series of softwares like iDeep, iDeepS, and iDeepE (15) (16) (17) . These tools differ from each other for the way they applied various combinations of CNN and RNN layers. iDeepS applied CNN with Long-Short term memory (LSTM) while taking input from sequence and RNAshape data. iDeepE applies combinations of CNNs which capture local and global sequence properties. A recently developed tool, DeepRiPe, has evolved a CNN and GRU based deep-learning approach while also introducing transcript's regions specific information like splice junctions etc (18) . DeepCLIP is another recently developed tool which detects RBP-RNA interaction spots while applying CNN in combination with bidirectional-LSTM and claims to detect sequence position specific importance which could determine the contribution of various nucleotides in RBP binding (19) . These very recently developed deep-learning approaches have become much more complex than DeepBind and claim to achieve much higher accuracy. Their complexity comes from adding complex layers above the regular dense hidden layer. These complex layers actually do the job of automatic feature extraction unlike the other machine-learning approaches where expert knowledge is applied to identify the important properties to look into for feature extraction.

While reviewing these developments and tools, it looked imminent that there is an enormous scope of improvement in the approaches to find and locate RBP-RNA interaction spots. Some of the major points to consider would be: 1) Choice of datasets: A notable issue with all these algorithms is the choice of data-sets, especially the negative data-sets, which have mostly been too relaxed and unrealistic, due to which these tools are prone to over-fitting and imbalance. They are either randomly shuffled sequences or regions randomly selected from those RNAs which did not bind the given RBP. 2) Motif searching approach: most of existing tools, with exception of recent deeplearning based approaches, begin with predefined/user defined motif or PWM derived from traditional motif finding tools with user defined length, which is not a natural approach and one of the prime mistakes. RBP binding sites display sparse conservation which regular motif discovery tools may fail to capture sufficiently. Third, high dependence on ab-initio RNA structure prediction tools to derive the structural and accessibility information may be misleading, as already pointed out above, such tools don't provide correct information on actual complete RNA length. A better approach has been consideration of dinucleotide densities for such purpose (20, 21) . Consideration of RNA-shape appears very much important as pointed out by DeepBind as well as some other recent works (14, 22, 23) . It has been reported that pentamers capture the essence of nucleic acid's shape accurately (24) , making them a suitable candidate to be evaluated along with dinucleotide densities to derive RNA structure and shape information. Fourth, though the recent deep-learning approach claim good success through automation of the process of feature extraction at the cost of added complexity, the effectiveness of such automated feature detection needs to be evaluated.

Simpler models, if trained with carefully selected properties, are capable to outperform complex models. This is why some of the shallow learning methods have outperformed deep-learning methods on structured data (25, https://towardsdatascience.com/the-unreasonable-ineffectivenessof-deep-learning-on-tabular-data-fd784ea29c33) .

Considering these all factors, here we present a reliable Deep Neural Net (DNN) based approach to build the mechanistic models of RBP-RNA interactions using high-throughput cross-linking data while considering data from 99 experiments and for 137 RBPs for human system. An ultrafast kmer spectrum search approach was used to identify the most important seed regions in the sequence for contextual information derivation. Contextual information for 75 bases flanking regions around the identified seed derived motif was extracted in the form of variable windowed position specific dinucleotide, pentamers, and heptamers density based propensities. The combined contextual information was provided to a two hidden layers based dense feed-forward networks to accurately 7 153 154 155 156 identify the RBP binding spots in RNAs. The developed models were used to identify the interaction spots and scored very high accuracy with remarkable balance between sensitivity and specificity as well as performance consistency when tested across a large number and different types of experimental datasets. Molecular dynamics studies also supported these models. The developed approach has been implemented as a freely available webserver and standalone software, RBPSpot. It was comprehensively bench-marked across three totally unbiased standardized datasets for performance along with five recently published tools, including more complex deeplearning based tools, where it outperformed all of them consistently across all these datasets for most of the studied RBPs. Unlike most of the existing software which don't provide the option to build new models from data, RBPSpot approach can be applied to detect human system RBP-RNA interactions with its inbuilt models as well as it can be used to develop new models for other species and new RBP data also.

The study has considered human RBP models while using high-throughput sequencing data from cross-linking experiments using various CLIP-seq techniques like CLASH, dCLIP, eCLIP, FLASH-CLIP-seq, HITS-CLIP, iCLIP, PAR-CLIP, sCLIP-seq, uvCLAP-CLIP-seq. This data also includes the two cell lines eCLIP data from ENCODE. Most of them are processed peak data collected for 137 RBPs with starBase 2.0 as their primary source (26) . A total of 872Mb peak data from 99 experiments were covered in this study for RBP-RNA interaction information from CLIP-seq experiments (Supplementary Data 1 Sheet 1). The peak data of RBPs were downloaded in the form of co-ordinates along with their associated RNA information on which they were binding. Peak data were converted into BED file format along with their strand specificity information. Genome sequences of human hg19 builds were obtained from UCSC browser. Peak data were also refined based on the length distribution and peaks laying in extreme range (length >300 bases and <5 bases) were omitted from the study (Supplementary Data 1 Sheet 2).

To search binding sites motifs/seeds for any particular RBP, all the peak regions were transformed into overlapping lists of k-mers of size six to start with. Iteratively and in parallel these generated kmer spectrum for each such sequence was searched across all the reported cross-linked associated regions in the targets to obtain the enrichment status of the k-mers (seeds) on which motif would be built. These searches were allowed with maximum 30% mismatches. Since normal search would be heavily time consuming step, we implemented an enhanced Burrow-Wheeler transformation with FM-Indexing to search with any number of mismatch which made the search ultra-fast for even inexact searches. The detailed algorithmic implementation pseudo-code of the implemented algorithm is given in the supplementary methods.

All the k-mer seeds and their relatives displaying at least 70% similarity were evaluated for existence across at least 70% of peak data. Such motif seed candidates were further evaluated for their statistical significance. Those RBPs where no k-mer and their relatives crossed 70% representation were looked for the highest representation available. The remaining data which did not show the representative k-mer were checked further and recursively with minimum cut-off of 20% data representation. Motifs coming from such data were considered as mutually exclusive one.

Null model distribution probabilities of occurrence of each k-mer along with its relatives was calculated from the random data set to find their random probabilities. Random data set was generated from unassociated RNAs while randomly carving out the lengths similar to the peak data.

Significantly over-represented k-mers were screened using binomial test with p-value cut-off of 0.01. These significantly enriched k-mers were used as initial seeds to develop the final motif.

These seeds of significantly enriched k-mers were expanded in both the directions by expanding by one nucleotide both sides, followed by search across the peak data with at least 70% occurrence in the peak data while repeating the above mentioned search operation recursively. Expansion of seed region in both directions was allowed till at least 70% match existed. Final motifs were selected on the basis of satisfying both the criteria i.e. the motif displayed least 70% abundance across the CLIP-seq instances at 1% significance level and the maximum k-mer expansion maintained at least 70% identity with the associated sequences and relatives. Mutually exclusive motifs were other predominant motifs which existed in the remaining data which were scanned in similar recursive manner as described above. 

Once we had prime motifs anchored for each RBP from the given data, their associated peak data sequences were converted into positive datasets. To generate positive datasets for each RBP, start and end co-ordinates from the main motif's both terminals were expanded by +75 and -75 bases into both the directions. In case of multiple motif locations originating from a single peak for the main motif, all the locations were expanded. Different length dataset sequences formed for different RBPs which depended mainly upon the length of the core motif region. However, for any single RBP all the sequences of the dataset were of same length.

To generate the negative datasets for each RBP, similar condition corresponding RNA-seq data were downloaded from GEO. With minimum three replicates of RNA-seq data expression of each RNA was calculated. Only those transcript sequences were considered which had expression condition available for the same condition but did not bind the RNA or which was not found present in the corresponding condition's CLIP-seq binding data. Associated main motifs for the RBP were searched across these RNA sets also just in the similar manner as was done to the positive dataset instances. Locations of the main motif were reported in the form of start and end co-ordinates from where further expansion of +75 and -75 bases was done on both the sides. This way very strong negative data-sets were built which ensured that learning was in no way influenced by the motif alone as the motif may also occur randomly to some extent and surrounding context is also considered along in a right manner. This approach was carried out for 74 RBPs for which similar condition RNA-seq data were available. Datasets derived this way were called Set A data-sets.

For 57 RBP similar RNA-seq data were not available for the corresponding conditions. In such scenario the main motifs for negative datasets were searched in those regions which did not appear in the CLIP-seq data but belonged to the same target RNA sequences whose some part appeared in the CLIP-seq, suggesting that though the RNA expressed and even bound to the RBP, these regions despite of having the motif for the RBP did not bind to the RBP and may work as a suitable negative dataset. +75 and -75 flanking bases from both the terminals of the motifs were considered along with the motif region to build the negative datasets. These data-sets were called Set B datasets.

Five different types of properties were considered for input into machine learning: 1) The main motif itself, 2) Di-nucleotide density in the associated region while considering 75 bases flanking regions from both the sides of the motif, 3) Dot bracket representation of the RNA structural triplet for the data-set sequences, covering twenty seven combinations of structure triplets arising from the dot-bracket structural representation from RNAfold predicted RNA structures

., ] , 4) Pentamers density profile for each position which captures the shape information, and, 5) Heptamers densities for the complete region. Dinucleotide densities were evaluated for their discriminatory power for multiple sliding windows starting from 17 to 131. Similarly, the dot brackets structural triplets representation of the data-set sequences were generated using RNAfold (27) . They too were evaluated for optimum windows size while testing for window sizes ranging from 29 to full sequence. 1,024 pentamers and 16,384 heptamers densities were evaluated in the similar manner across the data-set sequences.

To calculate heptamers based feature, all positive datasets were split into k-mers of seven bases.

Probability of each k-mer were calculated with maximum of two mismatches for each position and accordingly populated in the tensor. Thus, we had 16,384 X ((sequence length) -7) tensor of probabilities. 16,384 rows represent the heptamers and 150 columns represent individual positions.

In the similar manner pentamer features were calculated. For that we had 1024 X ((sequence length)-5) tensor of probabilities. These both tensors were used to convert the sequence data into vectors of probabilities. All together, based on optimum windows, the combined features sets representation of all the data-set sequences was done. The optimum windows and total features varied for each RBP. Finally, each data-set was broken into training and testing data-sets ensuring that no instances from training ever appeared in the testing data-sets. The breakup for each RBP for their training and testing data-sets is given in Supplementary Data 1 Sheet 5 and 6. After generating all the features from positive and negative data-sets these features were individually checked for their performance using tree based approaches which are expected to perform better on high dimension instances. Random forest and XGBoost were applied. Each property and their associated feature sets were evaluated for the varying window sizes for their discrimination power between the positive and negative sets. Sliding windows of variable sizes were used for dinculeotide and structure based features. These variable sizes windows were evaluated for the performance. Out of these different sized windows the size producing the best performance was kept for final model generation. It was found that the best performing window size varied across the RBPs, resulting into different optimum windows for the RBPs.

Pentamers and heptamers appeared most informative on the full length window. Equal number of positive and negative instances were chosen for all RBPs considered in the study. From the total chosen instances, 60% were used to create the training set, while remaining 40% instances were used to create the testing set. Python scikit-learn library was used for the same purpose. For feature importance evaluation F-score was used for every considered feature. F-score locates the features which display major difference between their values between negative and positive training sets while comparing the averages for the feature values for positive, negative, and whole set of instances (28) . The F-score is represented by the following equation: Also, for every i-th feature, t-test was conducted between n + and nto evaluate the significance of ith feature for its discrimination capability between positive and negative instances.

With the optimized windows in the above mentioned section, feature vectors for all the RBPs were used to build models to recognize RBP binding sites using two major machine-learning approaches:

XGBoost and Two Hidden Layers based Deep Feed Forward Neural Networks (DNNs). Both were implemented using python scikit-learn, Keras, and Tensorflow libraries. In both the cases 70% and 30% of data were retained for train and test sets, respectively.

The DNNs were built where the input layers had number of nodes equal to the number of features for the RBP considered. Thus, the size of input layer varied from 1,200 to 2,500. The performance of DNN was also evaluated for various numbers of hidden layers where finally total two hidden layers were found performing the best. The connections between the nodes were made dense. For every RBP model the number of nodes across the two hidden layers varied between 700 to 1,300.

Different types of activation functions combinations were applied for the layers from a pool of a number of available activation functions. Activation functions define the layers and transform the activation values obtained from previous layer to a non-linear form, creating several hyperplanes to obtain best possible discrimination of instances. In most of the models here, the first hidden layer had RELU and the second hidden layer had ELU (for some cases they interchanged also), while the final output layer had sigmoid function.

Every learning step provides estimation of error made, measuring the error and accordingly corrections in the learning rate and weights on connections are done. This error estimation is achieved by loss/cost functions. Multiple types of loss functions were tried to optimize the accuracy.

The best performance was obtained for Binary Cross Entropy. Since its a feed forward network where the cost function assess the missed targets and accordingly network connection weights are updated though some optimizer. The optimizer parameter which worked the best was 'Adam' optimizer, otherwise SGD with momentum. Usually Adam optimizer works better because of its capability to provide different learning rates per parameter, deals better with sparse gradients, and adapts based on recent learning rates while keeping them in memory. Momentum was applied in the learning which helps to ward-off entrapment under local minima during the minimization steps. The learning rate varied from 0.001 to 0.01 and momentum varied from 0.05 to 0.9. L1 and L2 weight decay regularizors were applied to avoid over-fitting. DNN models were trained using 1000 epochs and batch sizes varying from 50 to 200 instances. All the model from DNN and XGBoost were saved in protobuf format. Since the entire system is implemented here using TensorFlow, the In XGBoost, grid search was applied for parameter optimization. Following parameters were finalized after the grid search: params = {"eta/learning rate": 0.2, "max_depth": 4, "objective":

"binary:logistic", "silent": 1, "base_score": np.mean(yt), 'n_estimators': 1000, "eval_metric":

"logloss"}. Gradient boosted decision trees learn very quickly and may overfit. To overcome this shrinkage was used which slows down the the learning rate of gradient boosting models. Size of the decision tree were run on max-depth=9. At the value of 4 stability was gained as the logloss got stabilized and did not change thereafter.

To evaluate the consistency of performance models developed with the given features, 10-fold cross validation was also performed for each RBP. Everytime, the training dataset was split into 70:30 ratio with first used to train and second part used to test, respectively. Each time data was shuffled and random data was selected for building new model from scratch. This process was repeated 10 times for each RBP. Accuracy and other perfomance measure were calculated for each model.

The performance on test sets was also evaluated. Confusion matrices containing correctly and incorrectly identified test set instances were built for each RBPs. Frequently used measures for classifier performance evaluation and accuracy of RBPs models were evaluated. Sensitivity Bank (PDB). X-Ray crystallographic structure for 13 different RBPs were downloaded. Prior to docking, protein structures were prepared by removing water molecules and other hetero-atoms, while adding polar hydrogen atoms. RNA motifs identified through RBPSpot algorithm for above mentioned five RBPs were taken as flexible molecules. All docking studies were performed through NPDock (Nucleic Acid-Protein Docking) and PATCHDOCK incorporating more realistic DARS-RNP statistical potential based on reverse Boltzmann statistics to score protein-RNA complexes (30) . RNA motifs three dimensional structures were built using RNACOMPOSER web server based on RNA FRABASE database relating the RNA secondary and tertiary structure elements. In order to search for all possible RNA-binding sites and optimize the structural effects of RNA on the construction of complex, short RNA motifs were taken into account. Protein-RNA interface residues were predicted using DR_Bind1 (31) based on evolutionary conservation. Top three representative docking potential-ranked protein-RNA complexes were built for each of the above mentioned RBPs and the best one was considered for further analysis.

All molecular dynamics simulations of the RBP alone and the RBP-RNA complex were conducted using GROMACS 5.1 package (32), modeling each system with the AMBER03 force-field of protein and nucleic acids (33) with periodic boundary conditions. The topology files for the selected target RNA motifs were built using pdb2gmx in the framework of AMBER03 force-field. Models were solvated with the TIP3P water model (34) . The distance between the biomolecule and the edge of the simulation box was set as minimum 1.0 Å so that they could not directly interact with their own periodic boundary condition and fully immerse with water while rotating freely. Boxes were solvated with TIP3P water. The number of solvated molecules added to each system varied. After the establishment of initial configuration, the systems were minimized. 50,000 steps ( steepest A two steps statistical approach was employed to identify the co-occurring motif pairs. In this approach, the positive set of RBP was scanned for other most frequent occurring k-mers. Top cooccurring motifs were checked for their statistical significance. KS-test was used to find the significance of distance for two motifs. All the distance between two motifs were calculated from positive and negative data-sets. Distribution plot of random data and positive data were further checked using KS-test. Level of significance were considered p<0.05. They were further checked for frequency ratio (FR). At 5% level of significance, if the hypergeometric test p-value was less than 0.05, motif pair of enriched and co-occurring motifs was considered significant. Additionally, frequency ratio (FR) as a measure of co-occurrence of motif pairs was also computed to estimate the tendency of motif pairs to co-occur with each other as proposed previously (37): 

A total of 29 RBPs from RNAcompete study were found overlapping with our set of 131 RBPs.

Their IUPAC motifs were downloaded from RNAcompete web portal. For these 29 RBPs a total of was done with the search for motifs identified by RBPSpot approach in order to maintain an unbiased motif search approach. Random data sets to evaluate the random chance observations were generated from the transcriptome data using the length exactly similar to the ones from the crosslinking peak data. The similar above mentioned allowed mismatches based motif searching criteria was used here also to scan the random datasets for motif occurrence in them. Binomial test was applied to find the significance of these motifs in the cross-linking data. Other than RNAcompete motif, experimentally validated motifs were also considered from CISBP-RNA Database. A total of 31 RBPs from this dataset were found overlapping with RBPSpot data. Out of these 31 RBPs, 24

RBPs were reported from RNACompete study only, two RBPs were reported through SELEX and yeast three-hybrid screening whereas five RBPs were reported from RNAcompete and SELEX/RIP-Chip. These motifs were also searched in the similar manner.

To identify the binding sites of RBPs across SARS-Cov2 genome, we downloaded its genome from NCBI (accession number NC_045512).

Reads data collection, filtering, and pre-processing 

As discussed in the introduction section, most of the available tools for identifying the RBP bindings sites across the RNAs require either prior information available traditional motif finding approaches like MEME, TOMTOM or HOMER. The application of traditional motif discovery tools may not be much information in case of RBPs which have been reported to be sparse, short, and poorly conserved. Further to this, such motif discovery approaches expect user defined motif length instead of naturally capturing the motif. In general, if such motifs are not considered with proper context they may lead towards false discoveries. Here, we have used the initial deepsequencing data to find the most frequently occurring k-mers (seeds) to make it an initial step for motif finding. To find a naturally occurring most frequent k-mers, search was started with k=6 with two mismatches. Reason behind this was that some of the previously reported motifs for RBPs were either very sparse or as small as 4 bases long only. Therefore, a k-mer with six bases and with two mismatches would fetch all possible 6-mer spectrum which would agree with each other with two mismatches (relatives to the main k-mer) while also meeting the lowest bound of such motifs. This defined the 6-mer groups within two mismatches. Since a large number of 6-mers spectrum is created whose search with two mismatches across the sequences becomes a computationally intensive and time consuming step, a FM-Indexing and Burrows Wheeler Transformation (BWT) based inexact search step was applied. Since, parallelism through multiprocessing was also implemented, the search becomes more faster with available cores of CPUs.

4,096 possible combinations of 6-mers were individually searched in the peak data for every RBP.

To select the most abundant 6-mers, the first criteria was its occurrence across at least 70% of the CLIP-seq peak region data. All the 6-mers which were occurring in at least 70% data were evaluated for their significance occurrence at p-value<=0.01 using binomial test. Many most frequently occurring 6-mers were found whose numbers varied for RBPs (from RBM39 (3) to ELAVL1 (17)). The found significant spots for 6-mers for any given type worked as the seed which were subjected to bi-directional expansion. This expansion step every time evalauted the similarity between the expanded region and checked for minimum similarity cut-off of 70% across the considered seed regions which were expanding. The 6-mer seeds were expanded at every found position until they were satisfying both the criteria. The final step resulted into the most frequently occurring elongated k-mers with maximum possible elongation with both criteria met. After elongation, the best scoring expanded k-mer family for each RBP was considered as the primary motif in the RNA sequences interacting with the given RBP. It was found that at least one such primary motif existed for all the RBPs considered in this study, barring four RBPs. These primary There were four different RBPs viz. FXR1, SND1, ILF3, and U2AF1 which did not have any single seed k-mer occurring in at least 70% of the data. This suggested the possibility for multiple motifs working in mutually exclusive manner. It was found that two different motif groups for these four RBPs were working almost in mutually exclusiveness manner with small fractions of overlaps in their instances. The overlap levels between these two motif groups' instances were: FXR1 (7.8%), SND1 (6.8%), ILF3(8.25%) and U2AF1 (9.5%) instances (Supplementary Data 1 Sheet 4).

However, for these cases the found 6-mers could not be expanded further as at least 10% of the data was lost due to this.

This way, the most significant motifs present in the cross-linking data of all these RBPs were discovered which could act as anchor in contextual form. It was interesting to observe that the When these motifs were mapped back to the genome in order to derive the contextual information, several of them hinted for coexistence of secondary supporting motifs for any given RBP. Such cases were studied further for co-occurrence of motifs where the most dominant motif would be supported by some other predominant secondary motif. All those sequences where the dominant motif existed were also searched for the supporting secondary motifs. Obtained co-occurring motif pairs were further evaluated to measure the similarity between them using Jaccard similarity index based approach. The method utilizes the position weight matrices of co-occurring motifs for alignment considering relative shifts to recognize similarity between two motifs (40) . All cooccurring motif pairs possessed similarity score < 0.2 ensuring different motif partners being evaluated instead of same motif repeating itself. Sequence regions where the motifs co-occurred displayed high statistical significance of co-occurrence rate for the motif pairs for any given distance(p-value<<0.05; KS-test ). For all RBP models, big difference was observed for the distribution of co-occurring motif pairs when compared to the random sequence regions, strongly supporting the existence of co-occurrence of motifs in RBP binding models of RNAs. Figure 2 illustrates some of these cases. In this way, 178 statistically significant co-occurring motifs pairs out of 297 motif pairs for 127 RBPs were obtained, strongly suggesting again that context holds importance in RBP-RNA interactions. Co-occuring motif details for the RBPs is given in the Supplementary data 1 Sheet 8. Further these motifs were also checked for frequency ratio >1 as discussed in method section. All 178 statistically significant co-occurring motif pairs were found to have frequency ratio (FR) > 1. These co-occuring motifs were analyzed for the region flanking 75 bases from both sides of the prime motif. The reasons for considering this region becomes more clear in the following next section.

The motifs reported in the present study were compared with the experimentally reported motifs. was also observed that several of these experimentally reported motifs were not the prime motif reported here but matched to other lower ranked motifs which either co-occured with the prime motifs or were exclusively present, covering comparatively lesser amount of CLIP-seq data than the prime motifs reported in the present study. Their occurrence in the cross-linking data varied from 34.07% to 81.32% while the prime motifs reported in this study mostly covered at least 70% of CLIP-seq data. Figure 3 provides a snapshot of the comparison between experimentally reported motifs with motifs identified in the present study.

The discovered motifs above work as a point to zero upon to consider the potential significant interaction spots in the RNA. However, such motifs alone can't hold much higher stake than that as they may appear in the non-binding regions also, though found statistically significant for the instances was greater than total number of peak data for most of the RBPs due to multiple occurrence of motifs on a single sequence.

Identifying suitable negative dataset candidates becomes a more crucial task. And it is where most of the previously developed tools have gone too soft and mostly ended up selecting random sequences, which actually does not help to divulge more information. As transpires from above discussions and results, there are many spots across the transcriptomes which posses sequences similar to the interaction motifs but they yet not interact. In usual, chances of finding shorter motif themselves is higher in the random data. In such scenario, considering random sequences really does not add significantly to the purpose of discrimination and does not answer the question raised above. In order to build a better negative dataset, it is better to pick those candidates as negative instances where the region similar to the main motif is present and creates a strong confusion matrices to build a more natural and robust model. Therefore, to create the negative set for RBPs two different kind of strategies were used. In the first strategy we used RNA-seq data for the same condition for which we had the cross-linking data available for the given RBP. Those RNAs were selected which were expressing themselves in the same condition but did not bind to the considered RBP and did not reflect in the CLIP-seq data. They were searched for the prime motifs of the RBP similar to the positive data cases and in similar manner 75 bases flanks were considered along with capturing the contextual information with more discrimination power. For the RBPs for which the negative datasets were created using this strategy are called Set A RBP datasets throughout this study. This way, the negative datasets for 74 RBPs were created (Supplementary Data 1 Sheet 5).

In the second strategy, the negative datasets were created for those RBPs which did not have similar condition RNA-seq data available for the considered CLIP-seq conditions. In such scenario, therefore, here those RNAs were considered which exhibited binding to their respective RBPs but they also had the motifs on other positions which did reflect in the CLIP-seq data and were also far away from such cross-linking regions. The logic behind is that such RNA sequences whose some regions exhibited binding to RBPs in CLIP-seq data make clear positive instances out of these regions as well as hold a simultaneous evidence that these RNAs were expressed in the given condition. Regions which display the interaction motif in these expressed RNAs but don't bind to the RBPs become an apt case for negative instance consideration with high potential for contextual information unlike the usual random sequences. This particular set of negative dataset instances were called Set B. In this way, the Set B negative dataset were created for the remaining 57 RBPs (Supplementary Data 1 Sheet 6)). Rest of the analysis were same on both the sets of RBPs. This all also reinforces the view that any successful RBP-RNA interaction discovery approach can not be founded solely upon the motifs consideration but needs correctly designed context information extraction approach also which can be provided only after a better a negative instances consideration.

Motif discovery and anchoring helped in selecting the more appropriate positive and negative instances from which contextual information and features might be derived. The contextual information came in the form of other co-occurring motifs, sequence specific information, position specific information, and structural/shape information which could exhibit sharp discrimination between negative and positive instances. Contextual information were derived from the features for picking up any further sequence specific signals in the flanking region, where similar approach of inexact search was applied with at least 70% similarity match as was done for the prime motifs' 6-mer seeds. Pentamers application was motivated from the recent findings which reported that pentamers capture the DNA shape very accurately (24) . The nucleic acids shape has been found critical in the interactions with regulatory proteins which scan these shapes for their stationing. So far, this approach has been applied on DNA but hardly on RNAs. The DeepBind work had observed about the importance of using such kind of information which could be beneficial in future developments for the tools reporting RBP-RNA interactions (14) . The dinucleotide densities have been found to be highly useful in indirectly evaluating the RNA structure and accessibility (20, 21) . In fact, it has been found more promising than ab-initio RNA structure prediction. Abinitio methods' accuracy drastically falls with the length of RNA, and they are suitable for only short RNA sequences (9, 20) . Pentamer and di-nucleotide frequencies capture better structural and Figure 4 presents an example of one such group, RBPs belonging to AGO4 cluster (Cluster 1). As can be noted in this figure also, CG is remarkably enriched for the binding site regions. Therefore, despite of having binding sites motifs they differ in their binding which is influenced by context. Also, the universal prominence of CG in the RBP binding regions reinforces the theory which suggests their regulatory roles in stationing the binding factors and supporting the binding motifs (42) . Also, they may be studied further for RNA modification which are considered critical for RBP binding dynamics. suggest that substantial amount of information is being held by the flanking regions around the binding motif, which may be one of the determinant for contextual interactions between RBP and RNA. A series of t-tests between the positive and negative instances for various features also supported this. Biologically, heptamers and pentamers were expected to reflect any supporting cooccurring motifs near the prime anchored motif. Pentamers, specifically, were considered to capture the shape properties, which too have been called important in protein and nucleic acids interactions, more so in cases where sequence motifs are not clear or prime (14, 24) . A closer look with these pentamers and heptamers revealed that for many RBPs binding sites, they were prominent in the flanking regions where the co-occuring secondary motifs existed (Figure 2 ).

Though heptamers were found more reflective to this phenomenon. As could be expected now, these information properties from the flanking regions looked highly promising for identification of a true binding site. The impact of each of these properties on discrimination capacity between true binding sites and negative sites was also clear when evaluated directly on the machine learning models for performance, as transpires in the following section.

Before combining the features to build the collective models for RBP-RNA interactions, one more assessment of contributions by the above mentioned properties in discrimination was done.

Classification assessment was made for each given properties separately before joining them Figure 5 (A) presents the violin plots for the accuracy distributions observed for the classifications done by each of these properties for all the RBPs.

With this all, it was pretty evident that the selected properties and their features had strong discriminatory strength, barring the RNA structural information derived through ab-initio structure prediction method. All the features originating from these qualifying properties were combined together to build the final models of RBP-RNA interaction targets.

After getting optimum window size for di-nucleotide densities, we combined these three features (pentamers probabilities, heptamers probabilities and di-nucleotide densities) together to build the final models. The final models were built using Xgboosting as well as DNN. The reason for considering these two different approaches are that: 1) they reflect two different learning approaches: Shallow and Deep, 2) They complement each other as Xgboost works good for the cases with comparatively lower training data while DNN performance is good where learning data is higher, 3) Both the approaches work very good for conditions where the dimensions are high, as was with this study.

Combining of the features based on above mentioned properties was done in a gradual manner in order to see the additive effect of them on the classification performance. As it is apparent from In general, it was apparent that DNN approach was sensitive towards the volume of training instances as it was found performing better where number of instances were higher. But the biggest impact on performance was observed was for the granularity of dataset creation. Performance of DNN was specially more marked here, as can be seen from its performance plot on Set A datasets.

On Set A, the DNN models performance hardly touched below 90% accuracy. Even XGBoost's performance was better with Set A when compared to Set B. It needs to be recalled that Set B was made for those RBPs for which the RNA-seq data was not available for the considered CLIP-seq conditions. In such scenario, those RNA were considered to generate the negative instances whose some regions were present in the CLIP-seq data suggesting their expression. From the same RNA, those regions were selected which were having the prime motifs but yet not binding to the RBP and not reflected in the CLIP-seq data and were distant from such binding regions. While Set A negative instances were clearly those regions which were expressed during the CLIP-seq experimental condition and possessed the prime motif but no region of the RNA itself bound to the RBP. Thus, though the over all performance with Set B was still good and better than the datasets used by the compared tools as transpires in the next section, it same time reflects that how important it is to have a refined data-set like Set A. This is possible that some instances covered as negative instances in Set B could be contributing to the RBP-RNA interactions or could not be captured in the CLIPseq experiments. Yet, as transpires from the various performance metrics plots across various RBPs given in Figure 6 and AUC/ROC plots given in Figure 7 , the developed approach in this study, named as RBPSpot, showcases a consistently high and reliable performance for a large number of RBPs. It also provides the largest number of models for RBPs binding developed from CLIP-seq data to this date. (representing some very recent and more complex deep-learning based tools). Besides this, the benchmarking has also considered three different datasets as this work also presents a new dataset while underlining the importance of better datasets in creating better models as well as to carry out a totally unbiased assessment of performance of these tools on different datasets.

Thus, the first dataset considered in the benchmarking study was derived from the RBPSpot dataset.

Only those RBPs were considered for comparison for which at least one tool had model built,

besides RBPSpot itself. This way comparison was done for 52 RBPs. The second dataset considered was the one evolved during development of Graphprot software which has been used largely by various other datasets for model building and performance benchmarking purposes. The third dataset used in this benchmarking study was the one used by beRBP software which too has been used by many other tools for the same purpose. Details about these datasets have already been discussed above and in the methods sections.

All these six software were tested across all these three datasets and RBPSpot outperformed all of them across all the datasets, and for all the performance metrics considered (Figure 8) . Figure 8 gives After RBPSpot, the best performance was observed for the complex deep-learning based software iDeepE and DeepCLIP. On RBPSpot dataset, iDeepE performed better than DeepCLIP, but for other two datasets they attained almost similar metrics scores for performance. Undeniably, they emerged far superior than their deep-learning predecessor, DeepBind, and other compared tools. They even displayed much smaller dispersion of their scores than other compared tools. However, RBPSpot's performance points out that more appropriate features may be learned through training on biologically relevant properties to derive better discrimination power using machine learning approach, which can be amalgamated with Deep Neural Nets with much lesser complexity and superior performance than applying complex deep-learning layers to automate feature extraction.

The observations made in the introduction part of this work appeared true in this study that such complex deep-learning approaches score good on unstructured data where clear features identification and extraction is difficult to be done by expert and automation is required for feature extraction. The problems where features are identifiable and can be structured, simpler machine learning models may outperform the complex deep-learning approaches.

The above mentioned benchmarking was done for all the tools while keeping their original training dataset and models for RBPs. Most of the existing tools don't provide the option to build user specified models of RBPs using their algorithms but come with their own pre-built models. This limits the scope to test the algorithms with different combinations of datasets. Fortunately, the two best performing tools after RBPSpot, iDeepE and DeepCLIP, provided this scope where the users may build their new models with their own datasets. Also, since these two tools performance were next to RBPSpot, they stood as a natural choice to study the performance impact with datasets and their triplicate pairs. Details can be found in Table 1 .

In the nutshell, the structural molecular dynamics study supported the identified binding spots for the RBP where it was clearly evident that the identified binding motif provided structural stability to the considered RBP-RNA complexes.

Most of the deadly viruses are RNA viruses which exploit the host proteins to replicate, spread and survive. The best living example is nSARS-CoV-2. The emergence of the novel human corona-virus SARS-CoV-2 in Wuhan, China has caused a pandemic of respiratory disease (Covid19). The big scientific concern is that to this date very scarce and uncertain molecular information is available about the Covid19 patient's molecular system as not much high-throughput studies have been carried out so far. There is almost absolutely no information on the host RBPs response during (2), TARBP2 (4) and KHSRP (4)) ( Figure 11 ). Among these, AIFM1 interaction with viral polymerases in influenza virus infected cells is well studied (45) . These all binding sites were found on anti-sense strand of the genome whose importance is for viral replication. where users can put their own data and raise their own models for any species and any RBP. The software is freely available as a webserver as well as as an standalone program.

From here, we visualize that incorporation of spatio-temporal and other interactome network information for RBPs as the another dimension to explore to further improve our understanding on 

All the secondary data used in this study were publicly available and their due references and sources have been provided. All data and information generated/used, methodology related details etc have also been been made available in the supplementary data files provided along with and also From the plots it is clearly visible that for all these datasets and for almost all of the RBPs, RBPspot consistently outperformed the compared tools for all the metrics. More all the radar plots it scored the highest and nearest to the isosceles triangle suggesting consistent and better average performance also. The box plot suggests that RBPSpot not only performed best in overall but also the dispersion of its various metric scores were much lesser than other compared tools. Some of these tools exhibited enormous variation in the distribution of their metric score suggesting unstable performance by them. RBPSpot dataset presented here emerged as a better dataset for such studies. 

A census of human RNA-binding proteins

Metabolic Enzymes Enjoying New Partnerships as RNA-Binding Proteins

A compendium of RNA-binding motifs for decoding gene regulation

RBPDB: a database of RNAbinding specificities

CLIPZ: a database and analysis environment for experimentally determined binding sites of RNA-binding proteins

POSTAR: a platform for exploring post-transcriptional regulation coordinated by RNA-binding proteins

CLIPdb: a CLIP-seq database for protein-RNA interactions

RNAcontext: A New Method for Learning the Sequence and Structure Binding Preferences of RNA-Binding Proteins

A comprehensive comparison of comparative RNA structure prediction approaches

RBPmap: a web server for mapping binding sites of RNA-binding proteins

mCarts: Genome-Wide Prediction of Clustered Sequence Motifs as Binding Sites for RNA-Binding Proteins

GraphProt: modeling binding preferences of RNA-binding proteins

beRBP: binding estimation for human RNA-binding proteins

Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning

RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach

Predicting RNA-protein binding sites and motifs through combining local and global deep convolutional neural networks

Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks

Deep neural networks for interpreting RNA-binding protein target preferences

DeepCLIP: 48 effect of mutations on protein-RNA binding with deep learning

Flanking region sequence information to refine microRNA target predictions

A unified dinucleotide alphabet describing both RNA and DNA structures

Specificity and nonspecificity in RNA-protein interactions

The role of RNA sequence and structure in RNA--protein interactions

DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale

Deep Learning with Structured Data

starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data

ViennaRNA Package 2.0

Combining SVMs with Various Feature Selection Strategies

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

DARS-RNP and QUASI-RNP: new statistical potentials for protein-RNA docking

STAR RNA-binding protein Quaking suppresses cancer via stabilization of specific miRNA

GROMACS: A message-passing parallel molecular dynamics implementation

A point-charge force field for molecular mechanics simulations of proteins based on condensed-phase quantum mechanical calculations

Comparison of simple potential functions for simulating liquid water

LINCS: A linear constraint solver for molecular simulations

A Leap-frog Algorithm for Stochastic Dynamics

A novel unbiased measure for motif co-occurrence predicts combinatorial regulation of transcription

AURA 2: Empowering discovery of post-transcriptional networks

BindSpace decodes transcription factor binding signals by large-scale sequence embedding

Jaccard index based similarity measure to compare transcription factor binding site models

Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine

CG dinucleotides enhance promoter activity independent of DNA methylation

The RCSB Protein Data Bank: redesigned web site and web services

Impact of Cl− and Na+ ions on simulated structure and dynamics of βARK1 PH domain. Proteins: Structure, Function, and Bioinformatics

Comprehensive Proteomic Analysis of Influenza Virus Polymerase Complex Reveals a Novel Association with Mitochondrial Proteins and RNA Polymerase Accessory Factors

We are thankful to the Director, CSIR-IHBT, for his kind support. We are thankful to Dr. Indu Gangwar for her inputs for the study. We are thankful to DBT for the funding support they gave for this project. NKS is thankful to CSIR for financial support as project associateship. UKP and PK are thankful to ICAR, New Delhi for providing support in Ph.D. UKP, NKS, and PK are thankful to Academy of Scientific and Innovative Research (AcSIR).

Not applicable.

The authors declare that they have no competing interests.