65426741 Metabolite discovery through global annotation of untargeted metabolomics data Li Chen1,2, Wenyun Lu2,3, Lin Wang2,3, Xi Xing2,3, Xin Teng2, Xianfeng Zeng2,3, Antonio D.Muscarella2, Yihui Shen2, Alexis Cowan2,4, Melanie R. McReynolds2,3, Brandon Kennedy5, Ashley M. Lato6, Shawn R. Campagna6, Mona Singh2,7, Joshua Rabinowitz2,3,4,# 1Institute of Metabolism and Integrative Biology, Fudan University, Shanghai, 200438, China. 2Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, 08544, USA. 3Department of Chemistry, Princeton University, Princeton, NJ, 08544, USA. 4Department of Molecular Biology, Princeton University, Princeton, NJ, 08544, USA. 5Lotus Separation LLC, Department of Chemistry, Princeton University, Princeton, NJ, 08544, USA 6Department of Chemistry, The University of Tennessee at Knoxville, Knoxville, TN, 37996, USA 7Department of Computer Science, Princeton University, Princeton, NJ, 08544, USA. # Corresponding author, e-mail: joshr@princeton.edu Abstract A primary goal of metabolomics is to identify all biologically important metabolites. One powerful approach is liquid chromatography-high resolution mass spectrometry (LC-MS), yet most LC-MS peaks remain unidentified. Here, we present a global network optimization approach, NetID, to annotate untargeted LC-MS metabolomics data. We consider all experimentally observed ion peaks together, and assign annotations to all of them simultaneously so as to maximize a score that considers properties of peaks (known masses, retention times, MS/MS fragmentation patterns) as well network constraints that arise based on mass difference between peaks. Global optimization results in accurate peak assignment and trackable peak-peak relationships. Applying this approach to yeast and mouse data, we identify a half-dozen novel metabolites, including thiamine and taurine derivatives. Isotope tracer studies indicate active flux through these metabolites. Thus, NetID applies existing metabolomic knowledge and global optimization to annotate untargeted metabolomics data, revealing novel metabolites. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Introduction Metabolomics provides a snapshot of small-molecule concentrations in a biological system. In so doing, it reflects the integrated impact of genetics and the environment on metabolism. One important role of metabolomics is annotating previously unknown or underappreciated metabolites. For example, metabolomics facilitated identification of 2-hydroxyglutarate as an oncometabolite, eventually leading to the development of inhibitors of 2-hydroxyglutarate synthesis as anticancer agents1,2. Metabolomics also contributed to identification of a diversity of natural products3,4 and disease biomarkers5. A common experimental strategy in metabolomics is liquid chromatography-high resolution mass spectrometry (LC-MS). LC-MS metabolomics measures thousands of ion peaks, of which hundreds are associated with known metabolites. A much greater number of peaks, however, still remain unannotated. The standard approach to peak annotation is to compare exact mass and either retention time or MS/MS fragmentation pattern to authenticated standards. To facilitate such comparisons, extensive chemical databases have been developed (e.g. METLIN6, HMDB7, MoNA8, KEGG9, Pubchem10, ChEBI11 and NIST12), with software tools available for automated peak picking and database comparison. Modern software also includes features for annotating peaks arising from isotopes and adducts of known metabolites, based on co-elution and characteristic mass differences (e.g. XCMS13,14, GNPS15, MS-DIAL16, MZmine17, and CAMERA18). Such peaks seem to account for at least half of non-background LC-MS features19,20. Despite this progress, a great number of unknown peaks remain, and figuring out their identities is a primary challenge in the field. One promising approach is network analysis, capitalizing on peak-peak relationships to increase annotation scope and accuracy. Connections can be drawn based on similar responses across experiments and/or MS2 similarity. Such connections can arise either through biochemical activities or mass spectrometry phenomena, such as isotopes, adducts, or in-source fragments. While distinct metabolites typically separate chromatographically, ions connected through mass spectrometry phenomena co-elute. Workflows employing the concept of molecular connectivity have been used to build networks (e.g., GNPS21,22, CliqueMS23, MetDNA24, BioCAn25, and IPA26), and are showing increasing utility for annotating metabolomics data in diverse contexts. For example, GNPS has been used broadly in identifying natural products. Existing algorithms generally focus on metabolite peaks with MS2 spectra available, using MS2 spectral data as the main annotation driver. This is an effective strategy for annotating high abundance peaks with informative MS2 spectra, such as major secondary metabolites. It is less effective, however, for many low abundance metabolomics peaks, due to poor quality or less informative MS2 spectra. We accordingly set out to develop a network algorithm for annotating the breadth of metabolomics peaks, capitalizing on available MS2 spectra but including also low .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ abundance peaks lacking MS2 spectra. Effective incorporation of peaks without MS2 spectra required making yet better use of peak-peak relationships to enhance annotation accuracy, which we achieved through the computational approach of global optimization: not dealing with peak annotation one- by-one, but instead all at once to take full advantage of the entire available information. This global optimization strategy had not previously been applied in the context of molecular networking analysis. To this end, we present the algorithm “NetID”. Similar to existing network analysis approaches, nodes are experimentally observed non-background ion peaks and connections are mass differences between peaks. We explicitly distinguish connections due to biotransformations (“biochemical connections” linking two metabolites) from those due to mass spectrometry phenomenon (“abiotic connections” linking isotopes, adducts, and fragments to the metabolites from which they are derived). Peak annotation occurs in a single global optimization step, based on linear programming, that enforces a single formula assignment for each experimentally observed ion peak. Using this approach, we can annotate roughly 80% of untargeted metabolomics peaks, with a majority being isotopes and adducts of known metabolites. Through these efforts, we provide likely formulae for several hundred novel metabolites, and confirm the identities of half-dozen species not currently in metabolomics databases. Results NetID algorithm NetID involves three computational steps: initial annotation, scoring, and optimization (Figure 1). The workflow starts with a peak table that contains a list of peak m/z, RT, intensity, and (when available) associated MS2 spectra, with background peaks removed by comparing to a process blank sample. Each peak defines a node in the network. In the initial annotation phase, we match every experimentally measured node m/z to formulae in the HMDB database. Peaks matching to HMDB formula within 10 ppm are annotated as seed nodes, from which we extend edges to build the network. Edges connect two nodes via gain or loss of specific chemical moieties (atoms). The atom differences can occur either due to metabolism (biochemical connection) or due to mass spectrometry phenomena (abiotic connections). For example, a difference of H2 suggests an oxidation/reduction relationship and defines a biochemical edge. A difference of Na-H suggests sodium adducting and is a type of abiotic edge (adduct edge). Other atom differences define other types of abiotic connections (isotope or fragment edges). Most atom differences are specific to biochemical, adduct, isotope, or fragment edges, but a few occur in multiple categories. For example, H2O loss can be either biochemical (enzymatic dehydration) or abiotic (in-source water loss). By integrating literature and in-house data, we assembled a list of 25 biochemical atom differences and 59 abiotic atom .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ differences which together define all connections in the network (Supplementary Table 1, 2). Using these lists, starting from the seed nodes, we draw all feasible edges such that (i) Δm/z between the connected nodes matches the atom mass difference and (ii) only co-eluting peaks are connected by abiotic edges. Through the edge extension process, possible formulae are assigned to nodes outside the initial seeds. A few rounds of edge extension suffice to give thorough coverage. Due to finite mass measurement precision, a single node may be assigned multiple contradictory formulae, which are resolved at the optimization step (see Methods). NetID then scores every node and edge annotation. Node annotations are scored based on precision of m/z match to the molecular formula, precision of retention time match to known metabolite retention time and (when the relevant information is available) quality of MS2 spectra match to database structure. In addition, there is a bonus for matching to formula in HMDB and a penalty for breaking basic chemical rules (Seven Golden Rules for filtering molecular formulae27). Biochemical edges receive a positive score for MS2 spectra similarity match between the connected nodes, and are otherwise unscored. Abiotic edges are scored based on precision of co-elution with the parent metabolite, connection type (adduct, isotope, etc.), and features specific to the connection type, such as expected natural abundance for isotope peaks (see Methods). The overall impact is to assign high scores to annotations that effectively align the experimentally observed ion peaks with prior metabolomics knowledge. With a score assigned for each potential node and edge annotation, we formulate the global network optimization problem as that of maximizing the network score with linear constraints that each node and edge has a single unique annotation and that these are consistent (e.g. peaks connected by H2 edge must have formula differing by 2H). Such optimization is readily performed by linear programing with a typical runtime of hours in R on a personal computer, and results in an optimal and consistent network annotation. Global network optimization As an example of the utility of global network optimization, where all peaks and connections are simultaneously considered to enhance annotation accuracy, we present an example network containing five peaks (Figure 2A). We first match experimental measurements to the database, annotating node a and node b as seed nodes adenosine monophosphate (AMP, C10H14N5O7P) and adenosine (C10H13N5O4), respectively. We also identify five possible connections between the five nodes. Two alternative networks are generated by extending annotations. In the left network, node c is annotated as adenosine HCl adduct (C10ClH14N5O4), whereas in the right network, node c is annotated as a putative metabolite (C9H14N5O5P) resulting from CO2 loss from AMP. Node d is 13C isotope of node c in both networks. Node e is annotated as 37Cl isotope of node c in the left network, and is unannotated in the right network because there is no Cl atom in the parent molecule. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ The left network has higher total node and edge annotation scores than the right network, and thus is selected by NetID. This selection makes sense to an experienced mass spectroscopist: the 37Cl isotope signature in node e indicates that node c should contain Cl. The power of NetID is that it automatically captures such logic, and uses the power of global computational optimization to extend such inferences across the network in an automated manner. To test the NetID workflow, we applied it to both yeast and liver datasets, in both positive and negative ionization mode (Figure 2B, 2C). Considering the example of negative mode yeast data with a total of 5,588 non-background peaks, in the initial annotation step, roughly 1,600 potential formulae were assigned to 1,400 peaks, with about 200 peaks receiving multiple formula annotations. These nodes were connected by just over 50,000 potential edges. Edge extension expanded coverage to over 5,000 nodes with an average of twelve potential formulae each, highlighting the importance of scoring and network optimization to assign proper formulae. After scoring node and edge annotations, global network optimization settled on about 4,800 unique node annotations. About 20% of the annotated peaks were metabolites, 14% were putative novel metabolites, and the rest were mass spectrometry phenomena, such as adducts, fragments, isotopes. Nodes were connected by about 10,000 edges, roughly evenly split between biochemical and abiotic connections (Figure 2C, Supplementary Fig. 1A). More than 90% of annotated nodes fell into a single dominant connected network (Supplementary Fig. 1B), reflecting most peaks being connected to core metabolism. About 15% of peaks, however, remained unannotated. These unannotated peaks likely reflect deficiencies in our lists of allowed atom differences, including additional forms of mass spectrometry phenomena. For example, manual examination of the unconnected peaks revealed a dozen nickel adducts of known compounds (Supplementary Table. 3). Importantly, the annotated peaks included several hundred novel metabolite formulae (Supplementary Fig. 2, Supplementary Data 1). Collectively, these provide a wealth of opportunities for metabolite discovery. Thiamine-derived metabolites NetID optimization provided not only a list of putative metabolites, but also connections linking these putative metabolites to known metabolites. In the yeast metabolomics dataset, we found three putative metabolites that have total ion current > 105, connected in a subnetwork around thiamine. Their formulae are C12H16N4O2S (thiamine+O), C14H20N4O2S (thiamine+C2H2O) and C14H18N4O2S, (thiamine+C2H4O) (Figure 3A). While not found in HMDB, thiamine+O is documented in METLIN as a thiamine oxidation product, so we focused on the other two potential thiamine derivatives. MS/MS spectra of the putative thiamine+C2H2O and thiamine+C2H4O contained characteristic thiamine fragments. Both contained a classical pyrimidine fragment, with thiamine+C2H2O also containing an acetylated pyrimidine fragment, leading to a probable structure (Figure 3A,B). The structural assignment is further supported by the presence of an unmodified thiazole fragment. In .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ contrast, thiamine+C2H4O lacked a classical unmodified thiazole fragment, instead showing a thiazole+C2H4O fragment (and a fragment with further water loss) (Figure 3A,B). Isotope tracing experiments further confirm these two peaks contain thiamine. When fed [U- 13C]glucose as sole carbon source, yeast synthesize thiamine de novo, resulting fully labeled thiamine species, with carbon counts matching the NetID formula assignments (Figure 3C). Adding unlabeled thiamine to the [U-13C]glucose culture media, yeast uptake the unlabeled thiamine, resulting in unlabeled thiamine and M+2 labeled thiamine+C2H2O and thiamine+C2H4O species. Although discovered in yeast, these are conserved metabolites, found also in mammalian samples (Figure 3D). Acetylation is one of the 25 biochemical atom transformations allowed in NetID. The addition of C2H4O is much less common biochemically, and was captured in NetID as two steps, acetylation followed by reduction. Accordingly, we looked into thiamine metabolism to explore how thiamine+C2H4O might be produced. Thiamine pyrophosphate is an important cofactor in pyruvate dehydrogenase (PDH, the entry step to TCA cycle) (Figure 3E). The de-pyrophosphorylation product of thiamine intermediate in PDH reaction yields thiamine+C2H4O matches the proposed thiamine+C2H4O structure (Figure 3F). Based on this biochemical route, we realized that analogous products could be formed by α- ketoglutarate dehydrogenase (thiamine+C4H6O3) and branched-chain keto acid dehydrogenase (thiamine+C4H8O) (Figure 3F). Peaks at both of these exact masses were also experimentally observed, with isotope labeling results supporting their being thiamine-derived metabolites (Supplementary Fig. 3). Thus, NetID enabled the discovery of four novel thiamine-derived metabolites. N-glucosyl-taurine We similarly carried out NetID annotation of a mouse liver dataset. We observed multiple putative metabolite peaks linked to taurine, by apparent glucosylation (+C6H10O5), palmitylation (+C16H30O) and transamination (+O-NH3) (Figure 4A). The latter two, while missing in HMDB, were found in METLIN: N-palmitoyl taurine (C18H37NO4S) and sulfoacetaldehyde (C2H4O4S). To elucidate the structure of the putative taurine glucosylation product (C8H17NO8S), we chemically synthesized N- glucosyl-taurine. Synthetic N-glucosyl-taurine matched the retention time and MS/MS fragmentation pattern of the observed C8H17NO8S peak (Figure 4B,C). In liver samples of mice infused with [U- 13C]glucose, C8H17NO8S appeared in M+6 form, suggesting active synthesis of the N-glucosyl-taurine from circulating glucose (Figure 4D). N-glucosyl-taurine was not observed in yeast extract but was detected in multiple mouse tissues. Quantitation using the synthetic standard shows that liver has the highest level of glucosyl-taurine at ~170 μM (Figure 4E, Supplementary Fig. 4). This ranks among the few dozen most abundant liver metabolites. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Discussion The advent of LC-MS metabolomics revealed tens of thousands of metabolite peaks not matching known formulae, raising the possibility that the majority of metabolites remained to be discovered. While the biosphere likely contains many novel metabolites, it has been increasingly recognized that most peaks in typical untargeted metabolomics studies do not arise from novel metabolites, but rather mass spectrometry phenomena. The goal of comprehensively annotating untargeted metabolomics peaks with molecular formulae has, however, remained elusive. One promising strategy for peak annotation involves building molecular networks where nodes are LC-MS peaks (with associated molecular formulae) and edges are atom transformations linking the peaks. Here we advance this strategy by combining metabolomics knowledge with computational global optimization. We explicitly differentiate biochemical connections reflecting metabolic activity and abiotic connections arising from mass spectrometry phenomena. By formulating the peak annotation challenge as a linear program, we identify an optimal network in light of all observed peaks. Rather than weeding out peaks from mass spectrometry phenomena like adducts and natural isotopes, this approach takes advantage of the information embedded in them. It further provides traceable peak-peak relationships, which illuminate the basis for assigned formulae and suggest candidate structures. Applying this approach to untargeted LC-MS data from yeast and liver samples, we assign formulae to roughly three-quarters of all non-background peaks. In each of positive and negative mode, the annotated peaks cover about 1000 known metabolites, with on average more than four mass peaks for every metabolite (e.g. M+H plus three adduct or isotope peaks). This leaves a couple thousand unannotated peaks from each LC-MS run. Based on the observed ratio between peaks and metabolites, this likely correspond to hundreds (but not thousands) of unidentified metabolites. This number may actually be less, due to novel adducts (e.g. nickel adducts, which we discovered via careful examination of the unannotated peaks) or other mass spectrometry phenomena. Importantly, this approach has already generated likely formulae for many hundreds of putative novel metabolites (Supplementary Fig. 2, Supplementary Data 1), including a half-dozen for which we assign structures (Figure 3, 4). A key benefit of molecular network-based annotation is the ability to assimilate steadily new information21,22. Each newly identified metabolite provides an additional anchor point for optimizing the network. Other data types can be seamlessly added. For example, compound class categorization based on MS/MS data28 or retention time prediction29 can be added to score nodes. Labeling similarity upon feeding different isotope-labeled nutrients could potentially be added to score edges. Global optimization, integrating all new information comprehensively with prior knowledge to arrive at optimal annotations, is novel and potentially transformative for the field more broadly. The cycle .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ of careful experimentation and focused computational method developments holds the potential to identify most unknown metabolites over the coming decade, providing a robust blueprint of the metabolome (Figure 5). .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Methods Yeast metabolomics sample preparation and isotope labeling S. cerevisiae strain FY4 was grown for at least 10 generations in minimal essential media containing 0.4% [U-12C] or [U-13C] glucose and 10 mM ammonium sulfate with or without 0.4 mg/L thiamine hydrochloride30. Then, in mid-exponential phase, 5 mL culture broth (OD600 = 0.80) was filtered and metabolites were extracted using 1 mL extraction buffer (40:40:20:0.5 acetonitrile:methanol:water:formic acid), followed by adding 88 μL neutralization buffer (15% NH4HCO3). The extracts were kept at -20℃ for at least15 min to precipitate protein before centrifuging at 16,000 g for 10 min. The supernatant was used for LC–MS analysis. Murine metabolomics sample preparation and intravenous infusion experiment Animal studies followed protocols approved by the Princeton University Institutional Animal Care and Use Committee. Twelve-month-old female wild-type C57BL/6 mice (The Jackson Laboratory, Bar Harbor, ME) on normal diet were sacrificed by cervical dislocation and tissues quickly dissected and snap frozen in liquid nitrogen with precooled Wollenberger clamp. Frozen samples from liquid nitrogen were then transferred to −80°C freezer for storage. To extract metabolites, frozen liver tissue samples were first weighed (~ 20 mg each) and transferred to 2 mL round-bottom Eppendorf Safe-Lock tubes on dry ice. Samples were then ground into powder with a cryomill machine (Retsch, Newtown, PA) for 30 seconds at 25 Hz, and maintained at cold temperature using liquid nitrogen. For every 25 mg tissues, 922 uL extraction buffer (as above) was added to the tube, vortexed for 10 seconds, and allowed to sit on ice for 10 minutes. Then 78 L neutralization buffer was added and the samples vortexed. The samples were allowed to sit on ice for 20 minutes and then centrifuged at 16,000 g for 25 min at 4°C. The supernatants were transferred to another Eppendorf tube and centrifuged at 16,000 g for another 25 min at 4°C. The supernatants were transferred to glass vials for LC-MS analysis. A procedure blank was generated identically without tissue, which was used later to remove the background ions. Detailed methods for intravenous infusion of mice have been described previously31. Briefly, in vivo infusions were performed on 12–14-week-old C57BL/6 mice pre-catheterized in the right jugular vein (Charles River Laboratories). Mice were kept fasted for 6 h and then infused for 2.5 h with [U- 13C]glucose (200 mM, 0.1 L/min/g). The mouse infusion setup (Instech Laboratories) included a tether and swivel system so that the animal had free movement in the cage. Venous samples were taken from tail bleeds. At the end of the infusion, the mouse was euthanized by cervical dislocation and tissues were collected and extracted as above. Serum metabolites were extracted by adding 100 l methanol to 5 L of serum and centrifuging for 20 min. The supernatant was used for LC–MS analysis. LC-MS and LC-MS/MS .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ LC separation was achieved using a Vanquish UHPLC system (Thermo Fisher Scientific) with an Xbridge BEH Amide column (150×2mm, 2.5 µm particle size; Waters). Solvent A is 95:5 water: acetonitrile with 20 mM ammonium acetate and 20 mM ammonium hydroxide at pH 9.4, and solvent B is acetonitrile. The gradient is 0 min, 90% B; 2 min, 90% B; 3 min, 75%; 7 min, 75% B; 8 min, 70%, 9 min, 70% B; 10 min, 50% B; 12 min, 50% B; 13 min, 25% B; 14 min, 25% B; 16 min, 0% B, 20.5 min, 0% B; 21 min, 90% B; 25 min, 90% B. Total running time is 25 min at a flow rate of 150 µl/min. LC-MS data were collected on a Q-Exactive Plus mass spectrometer (Thermo Fisher) operating in full scan mode with a MS1 scan range of m/z 70-1000, and resolving power of 160,000 at m/z 200. Other MS parameters are as follows: sheath gas flow rate, 28 (arbitrary units); aux gas flow rate, 10 (arbitrary units); sweep gas flow rate, 1 (arbitrary units); spray voltage, 3.3 kV; capillary temperature, 320°C; S- lens RF level, 65; AGC target, 3E6 and maximum injection time, 500 ms. To demonstrate the utility of inclusion of MS2 data for NetID analysis, 1479 and 803 MS2 spectra were obtained for selected peaks with intensity > 105 in positive and negative ionization mode respectively from a previous liver dataset32. Targeted MS2 spectra were collected using the PRM function at 25 eV HCD energy with other instrument setting being, resolution 17500, AGC target 106, Maximum IT 250 ms, isolation window 1.5 m/z. Glucosyl-taurine synthesis Glucosyl-taurine synthesis was carried out following previous literature reports with slight modifications33. In brief, dry methanol was obtained by distillation of HPLC-grade methanol (Fisher; HPLC grade 0.2 micron filtered) over CaH2 (Acros Organics; ca. 93% extra pure, 0-2 mm grain size). A flame-dried round-bottom flask equipped with a reflux condenser and stir bar was charged with 2.0 g taurine (Alfa Aesar; 99%), 3.1 g D-glucose (Acros Organics; ACS reagent), and 80 mL of dry methanol. This mixture was sonicated under an inert atmosphere for 30 minutes before being returned to the manifold for the reaction. To the fine-suspension of taurine and glucose in dry methanol at room temperature, 4.0 mL 5.4 M sodium methoxide in methanol (Acros Organics) was added via glass syringe. At this point, the suspension began to dissolve and after 30 minutes, gave a clear and colorless solution. The solution was stirred vigorously under an inert atmosphere for 72 hours, which resulted in a faint peach-colored solution. This solution was chilled to 0 ˚C, and ~200 mL of absolute ethanol (200 proof) was added and precipitation was allowed to occur at this temperature for 30 minutes. Solvent was then removed by filtration over a glass filter (medium porosity), and washed with ~100 mL of absolute ethanol, affording a fine pale-yellow powder (2.4 g; crude material). NMR experiment was carried out to validate the structure of synthesized N-glucosyl-taurine. Selective TOCSY experiments using DIPSI2 spin-lock and with added chemical shift filter34 were run on a Bruker Avance III HD NMR spectrometer equipped with a custom-made QCI-F cryoprobe (Bruker, Billerica, MA) at 800 MHz and at 295.2K controlled temperature. The sample was dissolved in DMSO- .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ d6. The spectra shown on the plots are results of 200 ms SL mixing, 8 scans each. Data processing (MNova v.14, Mestrelab Research S.L., Santiago de Compostela, Spain) included zero filling, 1 Hz Gaussian apodization, phase- and baseline correction. NMR analysis suggests that the final crude material contains 5.2% N-glucosyl-taurine and unreacted substrates (Supplementary Figure 5). NetID algorithm I. Data preparation and input LC-MS raw data files (.raw) were converted to mzXML format using ProteoWizard35 (version 3.0.11392). El-MAVEN (version 7.0) was used to generate a peak table containing m/z, retention time, intensity for peaks. Parameters for peak picking were the defaults except for the following: mass domain resolution is 10 ppm; time domain resolution is 15 scans; minimum intensity is 1000; minimum peak width is 5 scans. The resulting peak table was exported to a .csv file. Redundant peak entries due to imperfect peak picking process are removed if two peaks are within 0.1 min and their m/z difference are within 2 ppm. Background peaks are removed if its intensity in procedure blank sample is > 0.5-fold of that in biological sample. The m/z of the remaining peaks are recalibrated by applying an absolute m/z adjustment factor εabsolute (independent of measured m/z) and a relative m/z adjustment factor εrelative (linearly dependent on measured m/z). For each peak i the recalibrated values im/z, adjusted are computed as 𝑖 / , = 𝑖 / , × (1 + 𝜀 ) + 𝜀 (1) The εrelative and εabsolute values are fit via linear regression using measured m/z values of selected known metabolite ion peaks and their calculated m/z. That is, for each of these known metabolite k, we have equations 𝑘 / , = 𝑘 / , × (1 + 𝜀 ) + 𝜀 (2) LC-MS/MS data were extracted from the mzXML files using lab-developed Matlab code. MS2 spectra may contain interfering product ions from co-eluting isobaric parent ions. These interfering product ions were removed by examining the extracted ion chromatogram (EIC) similarity between the product ions in MS2 data and the parent ion in MS1 data. A Pearson correlation coefficient of 0.8 was used as a cutoff to retain those product ions that has similar EIC as the parent ion. The cleaned MS2 data were exported to Excel files for further processing. Structures, formulae, m/z and MS2 spectra of metabolites were obtained from the Human Metabolome Database (HMDB, version 4.0), and retention times of selected metabolites were determined through running authentic standards using the above-mentioned LC-MS method. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ NetID algorithm requires three types of input files: a peak table (in .csv format) recording m/z, retention time, intensity for peaks; an atom difference rule table (in .csv format) containing a list of 25 biochemical atom differences and 59 abiotic atom differences which together define all connections in the network (Supplementary Table 1, 2), and metabolite information files containing structure, formula, m/z and MS2 spectra of HMDB metabolites and retention time of selected metabolites under different LC conditions. Exemplary peak table from the yeast dataset, atom difference rule table and HMDB metabolite information file are provided in Supplementary Data 2. II. Initial annotation of nodes and edges in the network The first step of NetID algorithm is to make an initial annotation for seed nodes, determine possible annotations for other nodes, and determine edges in the network. Each peak is a node in the network. We compare the experimentally measured m/z for each node to those of all metabolite formulae in the HMDB database. When the m/z difference is within 10 ppm, candidate formulae and HMDB IDs are assigned to the node, and this node is defined as a primary seed node. A primary seed node can contain more than one candidate formulae and HMDB IDs if all are within the m/z difference range. Edges connect two nodes via gain or loss of specific atoms. We assembled a list of 25 biochemical atom differences and 59 abiotic atom differences which together define all connections in the network (Supplementary Table 1, 2). Let each of these differences be denoted by Di. For each node u, if there is a node v such that the difference in the measured m/z of the nodes matches one of the those in the list of atom mass differences, we add an edge between u and v. That is, if um/z and vm/z are the experimentally measured m/z for the peaks corresponding to nodes u and v respectively (assuming vm/z > um/z for simplicity), then there is an edge between these nodes if there is some difference Di such that | 𝑣 / − 𝑢 / − 𝐷 | < 𝑣 / × 10 ppm (3) If Di is an abiotic difference, in order to add an edge, it is additionally required that the retention time between two nodes should be within 0.2 min. That is, if uRT and vRT are the retention times for u and v respectively, then it is required that | 𝑣 − 𝑢 | < 0.2 min (4) For each node, its candidate formulae set will expand due to propagating formulae from its neighboring nodes through edge atom differences. For example, when applying the atom difference of edge (u, v) on the formula assigned to primary seed node u, we can derive a new candidate formula for the connected node v. If the derived formula’s calculated m/z is within 5 ppm of node v’s measured m/z, then a new candidate formula is added for node v. Iterating the process to all candidate formulae of node u through edge (u, v) will further expand candidate formulae for node v. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ We apply the above extension process to formulae of all primary seed nodes through atom difference edges, and these new candidate formulae can themselves be used for another round of extension. Note that a primary seed node will be treated as the rest of nodes during the subsequent rounds of extension, and may as well be assigned with new formulae. To avoid duplicated efforts in the extension process, we allow formulae of primary seed nodes and biotransformed formulae thereof to be extended through both biotransformation and abiotic atom difference edges, and do not allow abiotic candidate formulae be further extended through biotransformation atom difference edges. The default extension process includes two rounds of biotransformation edge extensions and three rounds of abiotic edge extensions. III. Scoring node annotations NetID then scores every candidate node and edge annotation assigned in the initial annotation step. The node scoring system aims to assign high scores to annotations that align observed ion peaks with known metabolites based on m/z, retention time, MS/MS, and/or isotope abundances. Let the set of candidate annotation for node u be denoted as {𝑎 … 𝑎 … 𝑎 }. For each node u and each of its candidate annotation 𝑎 , let S(u, 𝑎 ) denotes the score of candidate annotation 𝑎 for node u. Different scoring components for candidate node annotations are defined as below: (a) Sm/z(u, 𝑎 ) is negative when measured m/z differs from the calculated m/z of assigned molecular formula. A larger ppm difference between calculated formula m/z and measurement m/z results to lower scores. The default scale factor is -0.5. Let 𝑎 , / be the calculated formula m/z of annotation 𝑎 , and 𝑢 / be the measured m/z of node u, then S / (𝑢, 𝑎 ) = −0.5 × 𝑢 / − 𝑎 , / / 𝑢 / × 10 (5) (b) SRT(u, 𝑎 ) is positive if the measured RT for the peak corresponding to node u matches to a known standard. A smaller difference between known and measured RT results in a higher score. Let 𝑎 , is the known RT of annotation 𝑎 , and 𝑢 be the measured RT of node u, then S (𝑢, 𝑎 ) = 1 − 𝑢 − 𝑎 , , if 𝑢 − 𝑎 , < 0.5 min Otherwise, S (𝑢, 𝑎 ) = 0 (6) (c) SMS2(u, 𝑎 ) is positive if the measured MS2 spectrum of node u matches the database MS2 spectrum of annotation 𝑎 . A dot product scoring function is used to score the MS2 spectra similarity24. The intensities of the fragment ions in the MS2 spectra are rescaled so that the highest fragment ion is set to 1. MS2 spectra are represented as W = [relative intensity of MS2 ions]n[m/z value]m, with n = 1, m = 0. Dot product (DP) and score for MS2 match (SMS2(u, 𝑎 )) are defined as below. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ 𝐷𝑃 = ∑ ∑ × ∑ (7) S (𝑢, 𝑎 ) = DP, if DP > 0.5 Otherwise S (𝑢, 𝑎 ) = 0 (8) (d) Sdatabase(u, 𝑎 ) is positive if the annotated formula 𝑎 exists in HMDB. We give a positive score to a primary seed node annotation if that annotated formula exists in HMDB. S (𝑢, 𝑎 ) = 0.5, if 𝑎 in HMDB Otherwise, S (𝑢, 𝑎 ) = 0 (9) (e) Smissing_isotope(u, 𝑎 ) is negative if an isotopic peak is missing. We penalize a formula annotation if it passes the intensity threshold (default at 5x104) but does not have isotopic peaks of specified elements. The default isotope being evaluated is 37Cl. Any other elements, such as 13C or 18O, can be included by users. S _ (𝑢, 𝑎 ) = −1, if isotopic peak is missing Otherwise S _ (𝑢, 𝑎 ) = 0 (10) (f) Srule(u, 𝑎 ) is negative if annotation 𝑎 violates basic chemical rules. We strongly penalize formulae that violate basic chemical rules, including a negative RDBE (ring and double bond equivalents), and unlikely element ratios in metabolites (O/P < 3, O/Si < 2). S (𝑢, 𝑎 ) = −10, if chemical rules are violated Otherwise, S (𝑢, 𝑎 ) = 0 (11) (g) Sderivative(u, 𝑎 ) is positive if the annotation 𝑎 is derived from a parent peak p with an annotation h that has high score Sparent(p, h), which is calculated by summing up scores in (a)-(f) for S(p, h). S (𝑢, 𝑎 ) = S (𝑝, ℎ) − 0.5 (12) S (𝑝, ℎ) = S / (𝑝, ℎ) + S (𝑝, ℎ) + S (𝑝, ℎ) + S (𝑝, ℎ) + S _ (𝑝, ℎ) + S (𝑝, ℎ) (13) This is particularly helpful in annotating abiotic peaks. For example, annotation of glutamate sodium adduct will be given a positive Sderivative when its parent node is annotated as glutamate with high Sparent score. A final score S(u, 𝑎 ) for each candidate annotation 𝑎 of node u is calculated by summing scores in (a)-(g). .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ S(𝑢, 𝑎 ) = S / (𝑢, 𝑎 ) + S (𝑢, 𝑎 ) + S (𝑢, 𝑎 ) + S (𝑢, 𝑎 ) + S _ (𝑢, 𝑎 ) + S (𝑢, 𝑎 ) + S (𝑢, 𝑎 ) (14) Note that for each node u, we have one of candidate “annotations” that corresponds to no annotation being chosen for that node. The node score for this null annotation is 0 at default, and can be set at a negative value to promote choosing actual annotations. IV. Scoring edge annotations (biological, adduct, isotope) The edge scoring system aims to assign high scores to edge annotations that correctly capture biochemical connections between metabolites (based on MS2 spectra similarity) and abiotic connections between metabolites and their mass spectrometry phenomena derivatives, such as isotopes and adducts. Biochemical, isotope, and adduct edge annotations are the most common types, and other less common abiotic connection types are then described in the subsequent section. Suppose we consider two nodes u and v that are connected by an edge (u, v). For each pair of nodes u and v such that there is an edge (u, v), let the set of candidate formula for node u and v be denoted as {𝑎 … 𝑎 … 𝑎 } and {𝑏 … 𝑏 … 𝑏 }, respectively, and let the set of candidate atom differences for edge (u, v) be {𝐷 … 𝐷 … 𝐷 }. Let S(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) be the score of choosing candidate formula 𝑎 for node u, candidate formula 𝑏 for node v and candidate atom difference 𝐷 for edge (u, v). Note that S(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is set to be 0 if atom difference 𝐷 does not represent the formula difference of 𝑎 and 𝑏 . S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0, if 𝑎 − 𝑏 ≠ 𝐷 Different scoring components for candidate edge annotations are defined as below: (h) When node u and v have experimental measured MS2 spectra, SMS2_similarity( 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is defined for a biochemical edge, and is a positive score if two connected nodes u and v have MS2 similarity, given the formula difference of 𝑎 and 𝑏 matches the atom difference defined by 𝐷 . SMS2_similarity is determined using the dot product (DP), as described in previous section, and reverse dot product (DP_R), which evaluates the neutral ion loss similarity in the MS2 spectra24. A reverse MS2 spectrum is represented as R = [relative intensity of MS2 ions]n[parent m/z – measured m/z value]m, with n = 1, m = 0. DP = ∑ ∑ × ∑ (15) DP_R = ∑ ∑ × ∑ (16) S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = max (DP, DP_R), if max(DP, DP_R) > 0.3 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Otherwise, S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0 (17) (i) Sco_elution(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is defined for an abiotic edge, and is a negative score if the RT of two connected nodes differ more than a threshold (0.05 min), given the formula difference of 𝑎 and 𝑏 matches the atom difference defined by 𝐷 . S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = −5 × |𝑢 − 𝑣 |, if |𝑢 − 𝑣 | ≥ 0.05 min Otherwise, S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0 (18) (j) Stype(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is defined for all edges, given the formula difference of 𝑎 and 𝑏 matches the atom difference defined by 𝐷 , and is a non-negative score depending on the connection type of edge, which is defined by 𝐷 , including biotransformation, adduct, isotope and fragment (Supplementary Table 1, 2). The magnitude of scores reflects the empirical confidence in the annotation type when certain atom differences occur, and can be adjusted based on personal use. S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0, if 𝐷 ϵ biotransformation S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0.5, if 𝐷 ϵ adduct S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 2, if 𝐷 ϵ isotope S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0.3, if 𝐷 ϵ fragment (19) (k) For each 𝐷 ϵ isotope, Sisotope_intensity(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is defined for isotope edge (u, v) where 𝑏 is the isotopic derivative of 𝑎 with atom difference of 𝐷 , and is a negative score if the measured isotope peaks deviate from expected natural abundance. The score for an isotope edge depends on how likely the ratio of measured and expected isotopic intensity (Ratioisotope) is observed in an empirical normal distribution N 1, σ . Isotopes of all elements included in the atom difference table are evaluated. Ratio = / ( , , ) (20) S (𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) = 𝑙𝑜𝑔 𝜇 = Ratio N 1, σ 𝜇 = 1 N 1, σ (21) .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ σisotope is empirically defined as below, so that when measured isotope intensity is close to detection limit, a larger σisotope (a widened distribution, which is more tolerant to discrepancy) will be used. σ = 0.2 + 10 ( ) (22) A final edge annotation score S(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) for choosing candidate formula 𝑎 for node u, candidate formula 𝑏 for node v and candidate atom difference 𝐷 for edge (u, v) is calculated by summing scores in (h)-(k), if other less common abiotic connection types are not considered (see next section). S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 (23) V. Additional abiotic edge types LC-MS metabolomics may include additional abiotic relationships. In orbitrap data, these include oligomers, multi-charge species, heterodimers, in-source fragments of known or unknown metabolites36, and ringing artifact peaks surrounding high intensity ions20,37. These relationships were included in NetID as additional edge types, which are evaluated for all m/z pairs within a predefined RT range (0.2 min). (l) Oligomer and multi-charge species. An oligomer/multi-charge edge is assigned between two nodes u and v, if their m/z satisfy |𝑣 / − n × 𝑢 / | < 𝑢 / × 10 ppm, n ϵ {positive integers} (23) (m) Heterodimer. Heterodimer peak (node v) may be observed when one abundant metabolite (node u) forms ion cluster with other ion species (node t). We examine nodes that have intensity above 105, and assign a heterodimer edge between two nodes u and v if their m/z difference satisfy |( 𝑣 / − 𝑢 / ) − 𝑡 / | < 𝑢 / × 10 ppm (24) (n) In-source fragments. Fragmentation peaks may be observed when one abundant metabolite breaks up into fragments during the ionization process. Database MS2 of known metabolites can be used to identify known ion fragmentation peaks36. If candidate annotation 𝑏 of node v is annotated with a HMDB ID associated with database MS2 spectrum, and m/z of node u matches to a fragment m/z in 𝑏 ’s MS2 spectrum, then a database fragment edge will connect such two nodes. That is, .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ 𝑢 / ϵ Database MS2 spectrum of candidate annotation 𝑏 of node v (25) Measured MS2 spectra can be used to identify unknown ion fragmentation peaks. If node v is associated with a measured MS2 spectrum, and m/z of another node u matches to a fragment m/z in the MS2 spectra, then an experiment fragment edge will connect such two nodes. That is, 𝑢 / ϵ Measured MS2 spectrum of node v (26) (o) Ringing artifacts. Ringing peaks are artifact peaks (node v) often observed on both sides of the m/z of an intense ion peak (node u) in Fourier-transformed MS instrument including orbitrap. We examine nodes that have intensity above 106, and assign a ringing artifact edge between two nodes if two nodes satisfy 50 ppm < | 𝑣 / − 𝑢 / | / 𝑢 / < 1000 ppm 𝑢 / 𝑣 > 50 (27) Scoring of these additional abiotic edges follow the same rules described in the “Scoring edge annotations” section with additional Stype defined as below. S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0.5, if 𝐷 ϵ oligomer or multi-charge S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0, if 𝐷 ϵ heterodimer S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0.3, if 𝐷 ϵ database MS2 fragment S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 1, if 𝐷 ϵ measured MS2 fragment S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 2, if 𝐷 ϵ ringing artifacts (28) A final edge annotation score S( 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) for choosing candidate formula 𝑎 for node u, candidate formula 𝑏 for node v and candidate atom difference 𝐷 for edge (u, v) is calculated by summing scores in (h)-(o). S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 (29) VI. Global network optimization using linear programing .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Using scores assigned for each candidate node and edge annotation, our goal is to find annotations for each node so as to maximize the sum of the scores across the network under the constraints that each node is assigned a single annotation, and that the network annotation is consistent. We use linear programming to solve this optimization problem optimally, as described next. For each node u and each of its candidate formula 𝑎 , we define a node binary decision variable 𝑥 , to denote whether candidate formula 𝑎 is selected as the annotation for node u. That is, 𝑥 , = 1, if node u is annotated with formula 𝑎 Otherwise, 𝑥 , = 0 (28) We define a binary decision variable 𝑐 , , , , to denote whether candidate formulae 𝑎 and 𝑏 are chosen for nodes u and v , and the candidate atom difference 𝐷 corresponds to the formula difference of candidate formulae 𝑎 and 𝑏 of the connected nodes u and v. That is, 𝑐 , , , , = 1, if 𝑎 and 𝑏 are chosen for nodes u and v respectively, and 𝑎 − 𝑏 = 𝐷 Otherwise, 𝑐 , , , , = 0 (29) We constrain the optimization so that each node has a single annotation, and an edge exists and only exist if the atom difference of that edge annotation matches the formula difference of nodes. As a result, the node and edge binary variables should satisfy ∑ 𝑥 , = 1 (30) 𝑐 , , , , ≤ 𝑥 , , 𝑐 , , , , ≤ 𝑥 , (31) 𝑐 , , , , ≥ 𝑥 , + 𝑥 , − 1 (32) For all variables defined above, we add the constraints that they are either 0 or 1. With each candidate node and edge annotation being scored, the objective for the optimization is to find values for all variables 𝑥 , and 𝑐 , , , , so as to maximize the sum of all node scores and edge scores in a network while satisfying the constraints. Maximize: ∑ 𝑥 , × S(𝑢, 𝑎) + ∑ 𝑐 , , , , × S(𝑢, 𝑣, 𝑎, 𝑏, 𝐷) (32) The optimization result provides a string of binary numbers that denote if a candidate node or edge annotation is selected for the global optimal network. IBM ILOG CPLEX Optimization Studio (Version 12.8.0 or later) is used to solve the linear programing problem. A cplexAPI package for R is used to .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ call CPLEX optimization function in an R environment. For the yeast datasets and using the above scoring parameters, optimization finishes within an hour on a standard laptop. Depending on the number of peaks in data tables, the entries in the atom difference tables, and the parameters involved in scoring, runtimes during internal testing ranged from minutes to 48 h. Code availability NetID was developed mainly in R, and used a mixture of IBM ILOG CPLEX Optimization Studio, Matlab and Python. NetID code is available for non-commercial use in github at https://github.com/LiChenPU/NetID, under the GNU General Public License v3.0. A ShinyR app is provided to visualize the network results from NetID in a local environment, along with a detailed user guide and example files (Supplementary Note 1, Supplementary Data 2). Acknowledgement This work was supported by a Department of Energy (DOE) grant (no. DE-SC0012461 to J.D.R.), the Center for Advanced Bioenergy and Bioproducts Innovation (grant no. DE-SC0018420, subcontract to J.D.R.) and NIH grant R50CA211437 to W.L. M.R.M is funded by the Howard Hughes Medical Institute and Burroughs Wellcome Fund via the PDEP and Hanna H. Gray Fellows Programs. We thank Istvan Pelczer at NMR facility of Department of Chemistry, Princeton University for the NMR analysis, and X. Su for scientific discussion and help. The Center for Advanced Bioenergy and Bioproducts Innovation and the Center for Bioenergy Innovation are both U.S. Department of Energy Bioenergy Research Centers supported by the Office of Biological and Environmental Research in the DOE Office of Science. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the U.S. Department of Energy. Competing interests The authors declare no competing interests. Author contributions L.C., M.S. and J.D.R. conceived the project. L.C. developed the NetID algorithm. W.L., L.W., X.Z., A.C. M.M. performed experiments on mouse. L.W., W.L. and L.C. performed experiments on yeast. L.C., W.L., L.W. and X. X. analyze LC-MS and LC-MS/MS data. X.T., A.M. and Y.S. contributed to coding development. B.K., A.M.L., and S.R.C. provided chemical synthesis of taurine-related compounds. L.C. and J.D.R. wrote the manuscript. All authors discussed the results and commented on the manuscript. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure legends Figure 1. A global network optimization approach for untargeted metabolomics data annotation (NetID). The input data are LC-MS peaks with m/z, retention times, intensities and optional MS2 spectra. The output is a molecular network with peaks (nodes) assigned unique formulae and connected by edges reflecting atom differences arising either through enzymatic reaction (biochemical connection) or mass spectrometry phenomenon (abiotic connection). Peaks are classified as “metabolite” (M+H or M-H peak of formula found in HMDB), “putative metabolite” (formula not found in HMDB but with biochemical connection to a metabolite), or “artifact” (only abiotic connection to a metabolite). NetID algorithm involves three steps. Initial annotation first matches peaks to HMDB formulae. These seed annotations are then extended through edges to cover most nodes, with the majority of nodes receiving multiple formula annotations. Each node and edge annotation are then scored based on match to known masses, retention times, and MS/MS fragmentation patterns. Global network optimization maximizes sum of node scores and edge scores, while enforcing a unique formula for each node and unique transformation relationship for each edge. Figure 2. Utility of global network optimization. (A) An example network demonstrating the value of the global optimization step in NetID. Node a and node b match HMDB formulae and are connected by an edge of phosphate (HPO3). Node c can be connected to either node a or node b through mutually incompatible annotations, resulting in two different candidate networks. The table below the two candidate networks shows the annotations and scoring criteria for each, with the left network preferred for more good node and edge annotations. (B) Visualization of the optimal network obtained from negative mode LC-MS analysis of Baker’s yeast, containing 4851 nodes and 9699 connections. Metabolite and putative metabolite peaks are in green and artifact peaks in purple. (C) Summary table of NetID annotations of negative and positive mode LC-MS data from Baker's yeast and mouse liver. Figure 3. NetID reveals thiamine-derived metabolites in yeast. (A) Subnetwork surrounding thiamine. Nodes, connections, and formulae are direct output of NetID. Boxes with structures were manually added. (B) MS2 spectra of thiamine, thiamine+C2H2O, and thiamine+C2H4O, with proposed structures of the major fragments. (C) Labeling fraction of thiamine and its derivatives, in [U- 13C]glucose with and without unlabeled thiamine in the medium. (D) The thiamine derivatives are also found in mouse tissues and urine. (E) Proposed mechanism for formation of thiamine+C2H4O. Pyruvate dehydrogenase (PDH) decarboxylates pyruvate, and adds the resulting [C2H4O] unit (in red) to thiamine. (F) The same enzymatic mechanism occurs in oxoglutarate dehydrogenase (OGDH) and branched-chain α-ketoacid dehydrogenase complex (BCKDC), and generates thiamine+C4H6O3 and thiamine+C4H8O respectively. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 4. NetID discovers mammalian taurine derivatives. (A) Subnetwork surrounding taurine from mouse liver extract data. Nodes, connections, and formulae are direct output of NetID. Boxes with structures were manually added. (B) LC-MS chromatogram of N-glucosyl-taurine standard and the putative glucosyl-taurine from liver extract. (C) MS2 spectrum of glucosyl-taurine peak from liver extract (top), and synthetic N-glucosyl-taurine standard (bottom). (D) Isotope labeling pattern of putative glucosyl-taurine in mice, infused via jugular vein catheter for 2 h with [U-13C]glucose. (E) Absolute N-glucosyl-taurine concentration in murine serum and tissues. Figure 5. NetID applies global optimization for metabolomics data annotation and metabolite discovery. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Reference 1. DiNardo, C. D. et al. Durable Remissions with Ivosidenib in IDH1-Mutated Relapsed or Refractory AML. N. Engl. J. Med. 378, 2386–2398 (2018). 2. Dang, L. et al. Cancer-associated IDH1 mutations produce 2-hydroxyglutarate. Nature 462, 739 (2009). 3. Doroghazi, J. R. et al. A roadmap for natural product discovery based on large-scale genomics and metabolomics. Nature Chemical Biology 10, 963–968 (2014). 4. Aron, A. T. et al. Reproducible molecular networking of untargeted mass spectrometry data using GNPS. Nature Protocols 15, 1954–1991 (2020). 5. Johnson, C. H., Ivanisevic, J. & Siuzdak, G. Metabolomics: beyond biomarkers and towards mechanisms. Nature Reviews Molecular Cell Biology 17, 451–459 (2016). 6. Guijas, C. et al. METLIN: A Technology Platform for Identifying Knowns and Unknowns. Anal. Chem. 90, 3156–3164 (2018). 7. Wishart, D. S. et al. HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res 46, D608–D617 (2018). 8. Tsugawa, H. et al. Hydrogen Rearrangement Rules: Computational MS/MS Fragmentation and Structure Elucidation Using MS-FINDER Software. Anal. Chem. 88, 7946–7958 (2016). 9. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res 44, D457–D462 (2016). 10. Kim, S. et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res 47, D1102–D1109 (2019). .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ 11. Hastings, J. et al. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Res 44, D1214–D1219 (2016). 12. sherena.johnson@nist.gov. NIST Standard Reference Database 1A. NIST https://www.nist.gov/srd/nist-standard-reference-database-1a (2014). 13. Tautenhahn, R., Patti, G. J., Rinehart, D. & Siuzdak, G. XCMS Online: A Web-Based Platform to Process Untargeted Metabolomic Data. Anal. Chem. 84, 5035–5039 (2012). 14. Forsberg, E. M. et al. Data processing, multi-omic pathway mapping, and metabolite activity analysis using XCMS Online. Nature Protocols 13, 633–651 (2018). 15. Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nature Biotechnology 34, 828–837 (2016). 16. Tsugawa, H. et al. A cheminformatics approach to characterize metabolomes in stable-isotope- labeled organisms. Nature Methods 16, 295 (2019). 17. Pluskal, T., Castillo, S., Villar-Briones, A. & Orešič, M. MZmine 2: Modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinformatics 11, 395 (2010). 18. Kuhl, C., Tautenhahn, R., Böttcher, C., Larson, T. R. & Neumann, S. CAMERA: An Integrated Strategy for Compound Spectra Extraction and Annotation of Liquid Chromatography/Mass Spectrometry Data Sets. Anal. Chem. 84, 283–289 (2012). 19. Sindelar, M. & Patti, G. J. Chemical Discovery in the Era of Metabolomics. J. Am. Chem. Soc. 142, 9097–9105 (2020). .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ 20. Wang, L. et al. Peak Annotation and Verification Engine for Untargeted LC–MS Metabolomics. Anal. Chem. 91, 1838–1846 (2019). 21. Schmid, R. et al. Ion Identity Molecular Networking in the GNPS Environment. http://biorxiv.org/lookup/doi/10.1101/2020.05.11.088948 (2020) doi:10.1101/2020.05.11.088948. 22. Nothias, L.-F. et al. Feature-based molecular networking in the GNPS analysis environment. Nat Methods 17, 905–908 (2020). 23. Senan, O. et al. CliqueMS: A computational tool for annotating in-source metabolite ions from LC-MS untargeted metabolomics data based on a coelution similarity network. 8. 24. Shen, X. et al. Metabolic reaction network-based recursive metabolite annotation for untargeted metabolomics. Nature Communications 10, 1516 (2019). 25. Alden, N. et al. Biologically Consistent Annotation of Metabolomics Data. Anal. Chem. 89, 13097–13104 (2017). 26. Del Carratore, F. et al. Integrated Probabilistic Annotation: A Bayesian-Based Annotation Method for Metabolomic Profiles Integrating Biochemical Connections, Isotope Patterns, and Adduct Relationships. Anal. Chem. (2019) doi:10.1021/acs.analchem.9b02354. 27. Kind, T. & Fiehn, O. Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics 8, 105 (2007). 28. Dührkop, K. et al. Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nat Biotechnol (2020) doi:10.1038/s41587-020-0740-8. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ 29. Bonini, P., Kind, T., Tsugawa, H., Barupal, D. K. & Fiehn, O. Retip: Retention Time Prediction for Compound Annotation in Untargeted Metabolomics. Anal. Chem. 92, 7515–7522 (2020). 30. Xu, Y.-F. et al. Discovery and Functional Characterization of a Yeast Sugar Alcohol Phosphatase. ACS Chem. Biol. 13, 3011–3020 (2018). 31. Hui, S. et al. Glucose feeds the TCA cycle via circulating lactate. Nature 551, 115–118 (2017). 32. Lu, W. et al. Improved Annotation of Untargeted Metabolomics Data through Buffer Modifications That Shift Adduct Mass and Intensity. Anal. Chem. 92, 11573–11581 (2020). 33. Cho, H. J., You, J. S., Chang, K. J., Kim, K. S. & Kim, S. H. Anti-adipogenic Effect of Taurine- Carbohydrate Derivatives. Bulletin of the Korean Chemical Society 35, 1863–1866 (2014). 34. Robinson, P. T., Pham, T. N. & Uhrıń, D. In phase selective excitation of overlapping multiplets by gradient-enhanced chemical shift selective filters. Journal of Magnetic Resonance 170, 97– 103 (2004). 35. Chambers, M. C. et al. A Cross-platform Toolkit for Mass Spectrometry and Proteomics. Nat Biotechnol 30, 918–920 (2012). 36. Xue, J. et al. Enhanced in-Source Fragmentation Annotation Enables Novel Data Independent Acquisition and Autonomous METLIN Molecular Identification. Anal. Chem. 92, 6051–6059 (2020). 37. Mitchell, J. M. et al. New methods to identify high peak density artifacts in Fourier transform mass spectra and to mitigate their effects on high-throughput metabolomic data analysis. Metabolomics 14, 125 (2018). .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 1. A global network optimization approach for untargeted metabolomics data annotation (NetID). The input data are LC-MS peaks with m/z, retention times, intensities and optional MS2 spectra. The output is a molecular network with peaks (nodes) assigned unique formulae and connected by edges reflecting atom differences arising either through enzymatic reaction (biochemical connection) or mass spectrometry phenomenon (abiotic connection). Peaks are classified as “metabolite” (M+H or M-H peak of formula found in HMDB), “putative metabolite” (formula not found in HMDB but with biochemical connection to a metabolite), or “artifact” (only abiotic connection to a metabolite). NetID algorithm involves three steps. Initial annotation first matches peaks to HMDB formulae. These seed annotations are then extended through edges to cover most nodes, with the majority of nodes receiving multiple formula annotations. Each node and edge annotation are then scored based on match to known masses, retention times, and MS/MS fragmentation patterns. Global network optimization maximizes sum of node scores and edge scores, while enforcing a unique formula for each node and unique transformation relationship for each edge. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 2. Utility of global network optimization. (A) An example network demonstrating the value of the global optimization step in NetID. Node a and node b match HMDB formulae and are connected by an edge of phosphate (HPO3). Node c can be connected to either node a or node b through mutually incompatible annotations, resulting in two different candidate networks. The table below the two candidate networks shows the annotations and scoring criteria for each, with the left network preferred for more good node and edge annotations. (B) Visualization of the optimal network obtained from negative mode LC-MS analysis of Baker’s yeast, containing 4851 nodes and 9699 connections. Metabolite and putative metabolite peaks are in green and artifact peaks in purple. (C) Summary table of NetID annotations of negative and positive mode LC-MS data from Baker's yeast and mouse liver. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 3. NetID reveals thiamine-derived metabolites in yeast. (A) Subnetwork surrounding thiamine. Nodes, connections, and formulae are direct output of NetID. Boxes with structures were manually added. (B) MS2 spectra of thiamine, thiamine+C2H2O, and thiamine+C2H4O, with proposed structures of the major fragments. (C) Labeling fraction of thiamine and its derivatives, in [U-13C]glucose with and without unlabeled thiamine in the medium. (D) The thiamine derivatives are also found in mouse tissues and urine. (E) Proposed mechanism for formation of thiamine+C2H4O. Pyruvate dehydrogenase (PDH) decarboxylates pyruvate, and adds the resulting [C2H4O] unit (in red) to thiamine. (F) The same enzymatic mechanism occurs in oxoglutarate dehydrogenase (OGDH) and branched-chain α-ketoacid dehydrogenase complex (BCKDC), and generates thiamine+C4H6O3 and thiamine+C4H8O respectively. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 4. NetID discovers mammalian taurine derivatives. (A) Subnetwork surrounding taurine from mouse liver extract data. Nodes, connections, and formulae are direct output of NetID. Boxes with structures were manually added. (B) LC-MS chromatogram of N-glucosyl-taurine standard and the putative glucosyl-taurine from liver extract. (C) MS2 spectrum of glucosyl- taurine peak from liver extract (top), and synthetic N-glucosyl-taurine standard (bottom). (D) Isotope labeling pattern of putative glucosyl-taurine in mice, infused via jugular vein catheter for 2 h with [U-13C]glucose. (E) Absolute N-glucosyl-taurine concentration in murine serum and tissues. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 5. NetID applies global optimization for metabolomics data annotation and metabolite discovery. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/