key: cord-0336140-ykyr2gvl authors: Ivanov, Mark V.; Bubis, Julia A.; Gorshkov, Vladimir; Abdrakhimov, Daniil A.; Kjeldsen, Frank; Gorshkov, Mikhail V. title: Boosting the MS1-only proteomics with machine learning allows 2000 protein identifications in 5-minute proteome analysis date: 2020-10-29 journal: bioRxiv DOI: 10.1101/2020.10.29.359075 sha: 69ba68c54ddfcfe0d3991485eae425d4eb1b9ff8 doc_id: 336140 cord_uid: ykyr2gvl Proteome-wide analyses most often rely on tandem mass spectrometry imposing considerable instrumental time consumption that is one of the main obstacles in a broader acceptance of proteomics in biomedical and clinical research. Recently, we presented a fast proteomic method termed DirectMS1 based on MS1-only mass spectra acquisition and data processing. The method allowed significant squeezing of the proteome-wide analysis to a few minute time frame at the depth of quantitative proteome coverage of 1000 proteins at 1% FDR. In this work, to further increase the capabilities of the DirectMS1 method, we explored the opportunities presented by the recent progress in the machine learning area and applied the LightGBM tree-based learning algorithm into the scoring of peptide-feature matches when processing MS1 spectra. Further, we integrated the peptide feature identification algorithm of DirectMS1 with the recently introduced peptide retention time prediction utility, DeepLC. Additional approaches to improve performance of the DirectMS1 method are discussed and demonstrated, such as FAIMS coupled to the Orbitrap mass analyzer. As a result of all improvements to DirectMS1, we succeeded in identifying more than 2000 proteins at 1% FDR from the HeLa cell line in a 5 minute LC-MS1 analysis. High throughput and sensitive analytical approaches enabling proteomewide measurements of large cohorts of samples will open the way for application of mass spectrometry-based proteomics in clinical trials, population proteomics, as well as in emerging areas of drug-to-proteome interactions and metaproteome characterization. 1, 2 Moreover, the needed throughput and depth of proteome coverage in these studies has to be accompanied with protein quantitation consistent across the sample cohorts, which is necessary for these approaches being useful in personalized medicine studies. 3, 4 A recent example on using the large cohort of COVID-19 patients clearly demonstrated the need for ultra-highthroughput proteomics to generate hypotheses about therapeutic targets and aid classification or diagnostic decision making in clinical environments. 5 The typical MS/MS-based proteomic methods are extremely instrument time consuming, which is partially overcome by expensive multiplexing using isotopic labeling strategies, such as tandem mass tag (TMT). 6 Nevertheless, a number of recent studies employing the bottom-up proteomic approaches become increasingly focused on increasing the throughput of the proteome-wide analysis. These studies include all steps of the proteome characterization workflow, such as increasing the speed of sample preparation prior to MS-based analysis [7] [8] [9] [10] [11] , squeezing data acquisition time to a few minute range [12] [13] [14] and increasing the throughput of data processing [15] [16] [17] [18] [19] [20] . One of the most time-consuming parts of the analysis is MS/MS data acquisition, which requires using long HPLC gradient times to have enough room for sequential isolation of as many as possible precursor ions from each MS1 spectrum for subsequent fragmentation. A number of methods and approaches to MS/MS-free proteome analysis have been proposed and widely explored starting from the early Accurate Mass and Tag (Time) method [21] [22] [23] [24] [25] to a recent truly (e.g., without employing tandem mass spectrometry at any step in the workflow) MS/MS-free realizations 26, 27 . These methods rely heavily on the accuracies of both peptide m/z measurements and retention time (RT) predictions. The latter is especially important for the MS/MS-free strategy as only retention times contain sequence-specific information in a (m/z, RT) space. 28, 29 With advances in development of machine learning algorithms a variety of highly accurate RT prediction models become increasingly available. [30] [31] [32] [33] The latter work is particularly interesting as it shows that DeepLC model's performance is comparable with the other deep learning-based alternatives, yet, it provides better generalization between different chromatography setups. This feature is particularly useful for MS/MS-free proteomic approaches such as DirectMS1, when the number of peptides available for RT prediction model training is typically small. Therefore, the use of a pretrained DeepLC model for rapid adoption to a dataset obtained for particular separation conditions becomes advantageous. Previously, we described the DirectMS1 method, in which proteolytic peptide mixture analysis is performed in MS/MS-free mode of acquisition using high resolution mass spectrometry. 27 Because the method does not employ isolation and fragmentation steps, the time for peptide separation can be reduced significantly to few minutes and the number of MS1 spectra available for processing will only be limited by the acquisition rate of the mass analyzer operating at high mass resolution. For the first time, DirectMS1 demonstrated the capability to identify up to 1000 proteins from a HeLa cell line using only a 5-minute HPLC gradient. Moreover, the average sequence coverage for each identified protein in this method exceeded the one of a standard MS/MS-based approach (even when long gradient is used) by almost an order of magnitude, thus, significantly improving the quantitation. On a proteome-wide scale this kind of analysis efficiency was not considered feasible before because of the whole proteome sample complexity, a lack of sequence specific information in the measured m/z values of peptide ions, and the low accuracy of existing phenomenological retention time prediction models. However, advances in high resolution mass spectrometry technology and machine learning- In this work, we integrated two novel machine learning algorithms into the data processing workflow of DirectMS1 to further improve its efficiency. These algorithms include DeepLC as a retention time prediction model used in the method's peptide identification and the gradient boosted machine, LightGBM 34 , for scoring peptide-feature matches. Also, the method was upgraded and tested for processing MS1-only data obtained using high resolution mass spectrometry and the high-field asymmetric waveform ion mobility, FAIMS 35 , which was found increasingly applicable in proteomic research 36, 37 as it provides additional separation dimension for peptides at the front end of a mass spectrometer. Parameters for the search were as follows: minimum 3 scans for detected peptide isotopic cluster; minimum one visible C13 isotope; charges from 1+ to 6+, no missed cleavage sites, and 8 ppm initial mass accuracy. All searches were performed against Swiss-Prot human concatenated database containing 20247 protein sequences and its decoys unless otherwise stated. Results were filtered to 1% protein level false discovery rate (FDR) using target-decoy approach 41 with its "picked" modification 42 and "+1" correction 43 . Data Availability. The datasets generated and analyzed during the current study have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD022094. The workflow. Figure 1 shows details of DirectMS1 method. The method starts with acquiring high resolution peptide ion mass spectra. A mass spectrometer operates in MS1-only mode and simply collects spectra for eluting peptides at the speed determined by the AGC and mass resolution settings. The total number of MS1 spectra acquired during 5-min LC gradient at the mass resolution of 120,000 at m/z 200 and AGC of 3*10 6 ranges from 1,000 to 1,500 depending on the mass spectrometer model. False discovery rate analysis at peptide level is shown in Figure 3c . from which we found that 0.18% of E. coli proteins at 1% FDR in the identification results as expected (E. coli and human databases contain shows that DirectMS1 provides an accurate threshold for filtering protein identifications. A Novel LC System Embeds Analytes in Pre-Formed Gradients for Rapid Mass Spectrometry for Translational Proteomics: Progress and Clinical Implications Mass Spectrometry Applied to Bottom-Up Proteomics: Entering the High-Throughput Era for Hypothesis Testing Data, Reagents, Assays and Merits of Proteomics for SARS-CoV-2 Research and Testing Ultra-High-Throughput Clinical Proteomics Reveals Classifiers of COVID-19 Infection Tandem Mass Tags: A Novel Quantification Strategy for Comparative Analysis of Complex Protein Mixtures by MS/MS Sample Clean-up Strategies for ESI Mass Spectrometry Applications in Bottom-up Proteomics: Trends from Modified Filter-Aided Sample Preparation (FASP) Method Increases Peptide and Protein Identifications for Shotgun Proteomics an Ultrafast Sample-Preparation Approach for Shotgun Proteomics Comparison of In-Solution, FASP, and S-Trap Based Digestion Methods for Bottom-Up Proteomic Studies Sample Preparation by Easy Extraction and Digestion (SPEED) -A Universal, Rapid, and Detergent-Free Protocol for Proteomics Based on Acid Extraction Online Parallel Accumulation-Serial Fragmentation (PASEF) with a Novel Trapped Ion Mobility Mass Spectrometer Evosep One Enables Robust Deep Proteome Coverage Using Tandem Mass Tags While Significantly Reducing Instrument Time A Compact Quadrupole-Orbitrap Mass Spectrometer with FAIMS Interface Improves Proteome Coverage in Short LC Gradients Faster SEQUEST Searching for Peptide Identification from Tandem Mass Spectra A Full Open Modification Search Method Performing All-to-All Spectra Comparisons within Minutes Ultrafast and Comprehensive Peptide Identification in Mass Spectrometry-Based Proteomics Fast Open Modification Spectral Library Searching through Approximate Nearest Neighbor Indexing SW-Tandem: A Highly Efficient Tool for Large-Scale Peptide Identification with Parallel Spectrum Dot Product on Sunway TaihuLight Fast Quantitative Analysis of TimsTOF PASEF Data with MSFragger and IonQuant Proteomic Analyses Using an Accurate Mass and Time Tag Strategy Accurate Mass Measurements in Proteomics Influence of Mass Resolution on Species Matching in Accurate Mass and Retention Time (AMT) Tag Proteomics Experiments Advances in Proteomics Data Analysis and Display Using an Accurate Mass and Time Tag Approach Identification of Phosphorylated Human Peptides by Accurate Mass Measurement Alone Protein Identification in Complex Mixtures Using Multiple Enzymes with Complementary Specificity DirectMS1: MS/MS-Free Identification of 1000 Proteins of Cellular Proteomes in 5 Minutes Predicting Peptide Retention Times for Proteomics Predictive Chromatography of Peptides and Proteins as a Complementary Tool for Training, Selection, and Robust Calibration of Retention Time Models for Targeted Proteomics Improved Peptide Retention Time Prediction in Liquid Chromatography through Deep Learning Prosit: Proteome-Wide Prediction of Peptide Tandem Mass Spectra by Deep Learning DeepLC can predict retention times for peptides that carry as-yet unseen modifications LightGBM: A Highly Efficient Gradient Boosting Decision Tree A New Method of Separation of Multi-Atomic Ions by Mobility at Atmospheric Pressure Using a High-Frequency Amplitude-Asymmetric Strong Electric Field High-Field Asymmetric Waveform Ion Mobility Spectrometry for Mass Spectrometry-Based Proteomics Enhancement of Mass Spectrometry Performance for Proteomic Analyses Using High-Field Asymmetric Waveform Ion Mobility Spectrometry (FAIMS) The PRIDE Database and Related Tools and Resources in 2019: Improving Support for Quantification Data Open Source Software for Rapid Proteomics Tools Development Dinosaur: A Refined Open-Source Peptide MS Feature Detector Target-Decoy Search Strategy for Increased Confidence in Large-Scale Protein Identifications by Mass Spectrometry A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets Unbiased False Discovery Rate Estimation for Shotgun Proteomics Based on the Target-Decoy Approach Prediction of Peptide Retention Times in High-Pressure Liquid Chromatography on the Basis of Amino Acid Composition LC-MS Alignment in Theory and Practice: A Comprehensive Algorithmic Review