key: cord-0257402-uqzw4wzv authors: Halloran, John T.; Urban, Gregor; Rocke, David; Baldi, Pierre title: Deep Semi-Supervised Learning Improves Universal Peptide Identification of Shotgun Proteomics Data date: 2020-11-14 journal: bioRxiv DOI: 10.1101/2020.11.12.380881 sha: a88796cedd65d68fc56298567e9fdc477371a6b4 doc_id: 257402 cord_uid: uqzw4wzv In proteomic analysis pipelines, machine learning post-processors play a critical role in improving the accuracy of shotgun proteomics analysis. Most often performed in a semi-supervised manner, such post-processors accept the peptide-spectrum matches (PSMs) and corresponding feature vectors resulting from a database search, train a machine learning classifier, and recalibrate PSM scores based on the resulting trained parameters, often leading to significantly more identified peptides across q-value thresholds. However, current state-of-the-art post-processors rely on shallow machine learning methods, such as SVMs, gradient boosted decision trees, and linear discriminant analysis. In contrast, the powerful learning capabilities of deep models have displayed superior performance to shallow models in an ever-growing number of other fields. In this work, we show that deep neural networks (DNNs) significantly improve the recalibration of shotgun proteomics data compared to the most accurate and widely used post-processors, such as Percolator and PeptideProphet. Furthermore, we show that DNNs are able to adaptively analyze complex datasets and features for more accurate universal post-processing, leading to both improved Prosit analysis and markedly better recalibration of recently developed p-value scoring functions. The field of proteomics has undergone explosive growth in the past two decades, largely fueled by technological breakthroughs in mass spectrometry. Most often accomplished through liquidchromotography tandem mass spectrometry (LC-MS/MS) followed by a peptide-database search, proteomics experiments have concurrently seen rapid increases in the size and complexity of generated datasets, resulting in the proteome-scale analysis of whole biological systems 1, 27, 46 . In practice, however, the peptide-spectrum matches (PSMs) resulting from a database search are often uncalibrated, diminishing overall identification accuracy. To overcome this, machine learning post-processors are widely used to recalibrate PSMs, greatly improving the yield of identifications in proteomic analysis pipelines. The most accurate of these post-processors are semi-supervised 6, 23 , using decoy PSMs as negative training cases and examples gathered from target PSMs as positive training cases. For instance, one of the most popular post-processors, PeptideProphet, 6, 25 uses linear discriminant analysis (LDA) for fixed feature sets and pre-computed weights to recalibrate input PSMs. Another extremely popular post-processor, Percolator, 23 adaptively learns feature weights by iteratively training a support vector machine (SVM) and recalibrating input PSMs using the final learned SVM parameters. Subsequent works have used shallow neural networks, 42 Naive Bayes Classifiers, 38 and gradient boosted decision trees. 22 However, these approaches limit analysis to specific search engines and sets of MS/MS features, similar to PeptideProphet. In contrast, universal postprocessing-where arbitrary sets of MS/MS features may be analyzed accurately-has been made possible through Percolator's adaptive algorithm. By analyzing arbitrary feature sets and adaptively learning all parameters, universal postprocessing has enabled two important developments for MS/MS analysis. The first is the recent development of machine learning methods which extract large numbers of sophisticated features from MS/MS datasets 11, 15, 16, 44 and rely on universal post-processing to use these complex feature sets for better PSM recalibration. The most notable of these feature extraction methods is Prosit, 11 which uses deep neural networks to extract 60 informative features (per PSM) for Andromeda 7 searches and feeds these features into Percolator to improve search results. The second is the quick and easy use of PSM recalibration by newly developed search algorithms. For instance, while Percolator was initially adopted by many established database-search scoring algorithms after its introduction (e.g., Mascot, 5 XCorr, 10, 23, 34 and X!Tandem 47 ), Percolator analysis has since been rapidly adopted by more recent search algorithms near their initial development (e.g., MS-GF+, 12 XCorr p-values, 19 DRIP, 14 and combined res-ev p-values 31 ). However, while Percolator has demonstrated impressive performance for general feature setsespecially compared to other popular post-processors 45 -the use of shallow machine learning models (such as SVMs and gradient boosted decision trees) potentially leaves identifiable peptides on the table. In particular, Deep Learning 3 , has lead to many recent groundbreaking advances in other fields, such as computer vision, 29, 33 speech recognition, 17, 18 genomics, 2, 48 particle physics, 4 climate analysis, 43 and medical diagnosis. 32, 35, 41 We show that deep neural networks (DNNs) improve MS/MS universal post-processing accuracy across a large number of diverse datasets, identifying more PSMs than Percolator for both Prosit analysis and the post-processing of a recently developed scoring algorithm designed for new MS/MS machines and datasets. 31 Most notably, DNNs offer the highest performance gains for the most sophisticated feature sets, demonstrating the intrinsic feature learning capabilities of deep models are more effective at exploiting rich PSM information compared to shallow models. Furthermore, compared to Percolator and the non-adaptive post-processors Scavager, 22 Q-ranker, 42 and Peptide-Prophet, 25 DNNs largely improve recalibrating PSM scores collected from the widely-supported Comet search algorithm, 10 only identifying fewer PSMs (at less stringent q-values, q ≥ 0.05) than one other method on a single evaluated dataset out of twelve. The deep semi-supervised learning algorithm presented herein is available in the new universal post-processing package, ProteoTorch. ProteoTorch uses an iterative, semi-supervised training procedure to recalibrate input targets and decoy PSMs (illustrated in Figure 1 ). By construction, all decoy PSMs are incorrect identifications, and are thus assigned negative labels, whereas positive training examples are estimated as the set of target PSMs with scores achieving a stringent, user-specified q-value. These scores are then re-evaluated in each training iteration and the positive label assignments are updated, using the predictions of a classifier that is trained to distinguish between positively and negatively labeled PSMs. This overall process repeats either for a user-specified number of iterations or until convergence. Furthermore, to prevent overfitting and to improve generalizability, three-fold crossvalidation (CV) is carried out in the overall procedure, where the dataset is partitioned into three separate test and train splits (as described in 13 ). Thus, PSMs from a training set are always disjoint from the corresponding test set (i.e., the set of PSMs to be re-scored). When the classifier is a linear SVM, 24 the described training scheme is equivalent to Percolator. 23 The DNN classifier used in ProteoTorch is a Multilayer Perceptron, which is the most suitable deep learning architecture for this problem as the input PSM features have no regular (or temporal) structure that could be exploited by convolutional or recurrent neural networks. To avoid confusion in the DNN training details that follow, we distinguish the use of the term iteration to refer to a single step in the overall semi-supervised training procedure, and the use of the term epoch to refer to a single pass through the data when training the DNN-thus, each iteration consists of several epochs of training within a CV fold, and in turn, each epoch consists of many gradient update steps of small batches of training data. By default, the DNN consists of three hidden layers with 200 ReLU (rectified linear unit) neurons and is trained for 50 epochs using the Adam 28 optimization algorithm. Within an iteration, DNN training starts with a learning rate of 0.001, which is periodically reduced (ten times) by a factor of up to 50 over the course of five training epochs and reset again. At the low-point of each of these ten reduction cycles, a snapshot model 20 is taken. The resulting ten snapshot models are combined into one final ensemble model, while taking into account the validation accuracy of the individual models, and this ensemble model serves as the trained classifier for the iteration. This snapshot-based process has the distinct advantage of producing a reasonably diverse ensemble classifier without requiring additional training time. The loss function used during training is a modified cross entropy loss, which increases the loss incurred by false positive predictions by a factor of four (leading to higher penalties for incorrectly predicted training decoys). Additionally, label smoothing is used, which effectively shifts the target probability for each class slightly away from 100%. These loss function modifications were implemented to account for the asymmetry in the reliability of the labels in PSM datasets. Prior to the first iteration, a large ensemble of snapshot models (30 by default) is used to estimate initial PSM scores, which decreases the number of iterations necessary for overall convergence. Furthermore, to regularize training with coarse initial scores, the dropout rate in the first iteration is set to 0.5. As the estimated scores improve in subsequent iterations, the benefit of dropout lowers significantly and, thus, the dropout rate is set to zero for all iterations beyond the first. All discussed DNN hyperparameters are the default values used in Section 3. Database search results were collected using Crux 34 and Comet 10 for four different high-resolution MS1/MS2 datasets: a SARS-COV-2 dataset collected from COVID-19 patients, 21 downloaded from the PRoteomics IDEntifications (PRIDE) database (project PXD018682); Saccharomyces cerevisiae (yeast) 39 and Plasmodium falciparum datasets, 40 both downloaded from MassIVE MSV000084000; and a draft of the human proteome, 27 downloaded from PRIDE project PXD000561. COVID-19, yeast, and Plasmodium datasets were searched with Crux using the mzML spectra files supplied in the respective data repositories, and converted to ms2 format using msconvert 26 for Comet searches. RAW human files were converted to ms2 format using msconvert with peak-picking and deisotoping filters. Yeast proteins were downloaded from Uniprot on April 26, 2020, resulting in 6,729 sequences, and plasmodium falciparum proteins were accessed the same day from PlasmoDB, resulting in 14,722 sequences. For the COVID-19 dataset, both human and SARS-COV-2 proteins were searched by combining the UniProt Human reference proteome and SARS2 organism proteins (https://covid-19.uniprot.org), both accessed April 30, 2020 and resulting in 20,363 sequences. The database used to search the human dataset was the same Uniprot human reference proteome used for the COVID-19 data. All searches were fully tryptic, allowed two missed cleavages, specified a fixed modification of carbamidomethylation cysteine, and concatenated target-decoy results. The COVID-19 and human datasets were searched using a precursor tolerance of 10 ppm and fragment mass tolerance of 0.05 Da, while the yeast and Plasmodium datasets were searched using a precursor tolerance of 50 ppm and fragment mass tolerance of 0.02 Da. For searches of the COVID-19 dataset, methionine oxidation and aspartic acid deamidation were specified as variable modifications. For searches of the yeast and Plasmodium datasets, a variable peptide N-terminal pyro-glu modification was specified and, for Comet searches, a variable protein N-terminal acetylation was further specified (this variable modification was not yet supported in the utilized version of Crux). For the largescale human dataset (consisting of 426 files and 24,931,642 total spectra), a variable methionine oxidation modification and a variable N-terminal glutamine cyclization were specified, as well as a variable protein N-terminal acetylation for Comet searches. Crux tide-search was run with --exact-p-value true --score-function both, thus ranking PSMs by NegLog10CombinePValue (i.e., the calibrated combination of XCorr 19 and res-ev 31 pvalues) during database search. Comet's num output lines parameter was set to five (matching Crux tide-search's default). Both Comet and Crux searches were set to output pepXML and PIN (Percolator INput) files. All Crux and Comet settings not discussed were left to their default values. For each dataset, resulting pepXML files were combined using InteractParser from the Trans-Proteomic Pipeline 8 (resulting PIN files were similarly concatenated using Linux commandline utilities). Per dataset, the combined pepXML files were subsequently converted to tsv format (for Q-ranker processing) using Crux psm-convert. Prosit PIN files were downloaded from PRIDE project PXD010871. As described in, 11 these four PIN files are the result of Prosit post-processing Andromeda searches over the same human gut MS data 37 with four increasingly complex protein databases: (1) Human (20,260 proteins), (2) Human+Bacteria (276,628 proteins), (3) All organisms (469,313 proteins), and (4) IGC+All, which included the human gut microbial integrated gene catalog 30 (IGC, 10,330,558 proteins). The described benchmark files are summarized in Table 1 . All database search and postprocessing results are available at http://jthalloran.ucdavis.edu/proteoTorchData.html. ProteoTorch is implemented in Python, uses PyTorch 36 as its deep learning backend, and is available for download at https://github.com/proteoTorch/proteoTorch. Crux version 3.2-2372dae was used for combined res-ev p-value searches using tide-search, Q-ranker post-processing using q-ranker, and conversion of pepXML files to tsv format using psm-convert. Additionally, a patch was necessary to prevent Q-ranker from crashing while writing recalibrated scores, available at https://github.com/johnhalloran321/crux-toolkit. Comet version 2019.01 rev. 5 was used for all Comet searches. Percolator post-processing was performed using version 3.05. Peptide-Prophet post-processing was run using the Trans-Proteomic Pipeline 8 version v5.2.0 Flammagenitus. Scavager post-processing was performed using version 0.2.1a3. Unless specified otherwise, all post-processing parameters were left to their defaults. Percolator and ProteoTorch were run using target-decoy competition 9 with flags -Y and --tdc true, respectively. Additionally, Percolator was run with flags -m $targetfile -M $decoyfile -U, thus reporting both target and decoy results and skipping taking the maximum-scoring PSM per peptide. PeptideProphet was run in the semi-supervised mode and reporting of all recalibrated target/decoy PSM scores, i.e., with flags ZERO DECOY=DECOY DECOYPROBS. To ensure methods utilized the same set of input features for as fair of a comparison of recalibration performance as possible, retention time prediction was not used in post-processors supporting this option (Percolator and Scavager). All described database searches and post-processing results were run on the same machine with an Intel Xeon E5-2620, 64GB RAM, and an NVIDIA Tesla K40 GPU. Post-processing performance for all datasets in Table 1 are plotted in Figure 2 , and the number of significant PSMs identified at q-value threshold 1% for each method per dataset is listed in Table 2 . Prosit and Crux NegLog10CombinePValue (i.e., the calibrated combination of res-ev and XCorr pvalues) datasets were analyzed using universal post-processors ProteoTorch and Percolator. Comet searches were supported, and recalibrated, using all discussed post-processors (i.e., ProteoTorch, Percolator, PeptideProphet, Q-ranker, and Scavager). For Comet and Crux datasets, search function accuracy is plotted for Comet XCorr and Crux NegLog10CombinePValue, respectively. For all evaluated methods, q-values were calculated using target-decoy competition. 9 At a q-value threshold of 1%, deep semi-supervised learning identifies more significant PSMs than all considered methods. Across all other evaluated q-values, deep semi-supervised learning achieves more significant identifications than Percolator for Prosit analysis, recalibration of NegLog10CombinePValue search results, and recalibration of Comet search results. For the Prosit dataset generated using the largest protein database (IGC+All, produced searching 10,330,558 protein sequences), the use of DNNs in ProteoTorch increases the number of significant PSMs over Percolator by 5.8% at a q-value of 1%. For the complex search function NegLog10CombinePValuedeveloped to accurately compute p-values for high-resolution MS/MS data-DNNs increase the number of significant identifications by 8.2% on average when using a strict q-value threshold of 1%, whereas Percolator results in an average increase of only 4.1%. For the non-adaptive methods, which support post-processing Comet search results, DNNs outperform all methods on all datasets except for one case (Scavager on the Comet Yeast dataset) for q-value thresholds greater than 5%. This was the first study to explore the impact of recent deep learning advances for semi-supervised post-processing of shotgun proteomics results. Across all datasets, the use of DNNs allows Pro-teoTorch to outperform the other universal post-processor, Percolator. This increase in performance is markedly pronounced for complex datasets; for two of the Crux datasets generated using the already calibrated (and highly accurate) NegLog10CombinePValue search function, the shallow networks used by Percolator have difficulty improving discrimination between calibrated target and decoy PSM scores (even failing to improve performance for the Crux Human dataset). In contrast, DNNs are capable of using these calibrated features to better distinguish target and decoy PSMs, resulting in double the recalibration improvement compared to using SVMs in Percolator at a strict q-value threshold of 1%. Furthermore, for the complex Prosit dataset, which is generated by searching a massive protein database of over ten-million sequences (i.e., Prosit IGC+All), deep models are able to better differentiate targets from decoys, identifying over five-percent more PSMs than Percolator at a q-value threshold of 1%. For Comet searches, post-processing performance using DNNs largely outperforms all other methods; over all twelve benchmark datasets, DNNs in ProteoTorch only identify fewer PSMs on one dataset than one other method for higher considered q-values. For the non-adaptive method PeptideProphet, it is also worth noting the effect of using precomputed LDA weights across different datasets. While PeptideProphet improves XCorr accuracy across all q-values for the datasets Comet Plasmodium, Comet Yeast, and Comet Human, no postprocessed matches are deemed significant at stringent (near zero) q-values for Comet COVID-19, which leads to worse performance than simple XCorr (i.e. without post-processing) for this low q-value range. This demonstrates that the use of precomputed weights prevents effective generalization across the different datasets that were evaluated, and even degrades Comet search accuracy over one of the datasets at very strict q-values. In this work, the impact of using deep learning models to further improve the recalibration of MS/MS database-search results was explored and compared to the shallow models, which are used in the most accurate and widely-used MS/MS post-processors. It was shown that deep semi-supervised learning is adaptable to diverse MS/MS data and feature sets, thus enabling effective universal post-processing and outperforming the state-of-the-art universal post-processor Percolator across a variety of downstream tasks: for Prosit analysis, recalibration analysis of a high-resolution p-value search function, and for recalibration of Comet search results. Furthermore, DNNs significantly improve the recalibration of Comet search results compared to the other evaluated, non-adaptive, post-processors PeptideProphet, Q-ranker, and Scavager. Thus, it was demonstrated that deep learning may be used to significantly improve the universal post-processing of shotgun proteomics data beyond the capabilities of existing approaches that rely on shallow machine learning models. Table 1 . The x-axis corresponds to q-values and the y-axis displays the number of PSMs deemed significant at each q-value. Target-decoy competition 9 was used to compute q-values for all methods. Prosit and Crux NegLog10CombinePValue post-processing are currently only supported by universal post-processors ProteoTorch and Percolator. Underlying search function performance for Comet's XCorr and Crux's NegLog10CombinePValue are additionally plotted using dashes. Mass-spectrometric exploration of proteome structure and function Predicting the sequence specificities of dna-and rna-binding proteins by deep learning Deep Learning in Science: Theory, Algorithms, and Applications Searching for exotic particles in high-energy physics with deep learning Accurate and sensitive peptide identification with Mascot Percolator Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics Andromeda: a peptide search engine integrated into the maxquant environment A guided tour of the trans-proteomic pipeline Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry A deeper look into Comet -implementation and features Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning Fast and accurate database searches with MS-GF+Percolator A cross-validation scheme for machine learning algorithms in shotgun proteomics Dynamic bayesian network for accurate detection of peptides from tandem mass spectra Deep speech: Scaling up end-to-end speech recognition Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups Computing exact p-values for a cross-correlation shotgun proteomics score function Snapshot ensembles: Train 1, get m for free Mass spectrometric identification of sars-cov-2 proteins from gargle solution samples of covid-19 patients Scavager: a versatile postsearch validation algorithm for shotgun proteomics based on gradient boosting A semi-supervised machine learning technique for peptide identification from shotgun proteomics datasets A modified finite newton method for fast solution of large scale linear svms Empirical statistical model to estimate the accuracy of peptide identification made by MS/MS and database search Proteowizard: open source software for rapid proteomics tools development A draft map of the human proteome Adam: A method for stochastic optimization Imagenet classification with deep convolutional neural networks An integrated catalog of reference genes in the human gut microbiome Combining high-resolution and exact calibration to boost statistical power: A well-calibrated score function for high-resolution ms2 data Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I Sánchez. A survey on deep learning in medical image analysis Fully convolutional networks for semantic segmentation Crux: rapid open source protein tandem mass spectrometry analysis Deep patient: an unsupervised representation to predict the future of patients from the electronic health records Pytorch: An imperative style, high-performance deep learning library Challenges in clinical metaproteomics highlighted by the analysis of acute leukemia patients with gut colonization by multidrug-resistant enterobacteriaceae Improving peptide and protein identification rates using a novel semi-supervised approach in scaffold (abstract (3141)) Chromatogram libraries improve peptide detection and quantification by data independent acquisition mass spectrometry Generating high quality libraries for dia ms with empirically corrected peptide predictions Deep learning in medical image analysis Improvements to the percolator algorithm for peptide identification from shotgun proteomics data sets Adversarial super-resolution of climatological wind and solar data Annotation of tandem mass spectrometry data using stochastic neural networks in shotgun proteomics Optimization of search engines and postprocessing approaches to maximize peptide and protein identification for high-resolution mass data Mass-spectrometry-based draft of the human proteome Combining percolator with x! tandem for accurate and sensitive peptide identification Predicting effects of noncoding variants with deep learningbased sequence model The work of JH and DR is in part supported by the National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, through grant UL1 TR001860 and a GPU donation from the NVIDIA Corporation. The work of GU and PB is in part supported by grants NSF NRT 1633631 and NIH GM123558 to PB. The authors declare no conflicts of interest. The sponsors had no role in the design, execution, interpretation, or writing of the study.