10_1101-2020_05_15_090266 ---- 34320483 SpacePHARER: Sensitive identification of phages from CRISPR spacers in prokaryotic hosts Zhang R.,1 Mirdita M.,1 Levy Karin E.,1 Norroy C.,1 Galiez C.,1, 2 and Söding J.1, 3 1Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany 2Univ. Grenoble Alpes, CNRS, Grenoble INP/Institute of Engineering Univ. Grenoble Alpes, Grenoble, France 3soeding@mpibpc.mpg.de Summary: SpacePHARER (CRISPR Spacer Phage-Host Pair Finder) is a sensitive and fast tool for de novo prediction of phage-host relationships via identifying phage genomes that match CRISPR spacers in genomic or metagenomic data. SpacePHARER gains sensitivity by comparing spacers and phages at the protein level, optimizing its scores for matching very short sequences, and combining evidence from multiple matches, while controlling for false positives. We demonstrate SpacePHARER by searching a comprehensive spacer list against all complete phage genomes. Availability and implementation: SpacePHARER is available as an open-source (GPLv3), user-friendly command-line software for Linux and macOS: spacepharer.soedinglab.org. I. INTRODUCTION Viruses of bacteria and archaea (phages) are the most abundant biological entities in nature. However, little is known about their roles in the microbial ecosystem and how they interact with their hosts, as cultivating most phages and hosts in the lab is challenging. Many prokaryotes (40% of bacteria and 81% of archaea) possess an adaptive immune system against phages, the Clus- tered Regularly Interspaced Short Palindromic Repeat (CRISPR) system [6]. After surviving a phage infec- tion, they can incorporate a short DNA fragment (28- 42 nt) as a spacer in a CRISPR array. The transcribed spacer will be used with other Cas components for a targeted destruction of future invaders. Some CRISPR- Cas systems require a 2-6 nucleotide long, highly con- served protospacer-adjacent motif (PAM) flanking the viral target to prevent autoimmunity. Multiple spac- ers targeting the same invader are not uncommon, due to either multiple infection events or the primed spacer acquisition mechanism identified in some CRISPR sub- types. CRISPR spacers have been previously exploited to identify phage-host relationship [3, 11, 12, 15]. These methods compare individual CRISPR spacers with phage genomes using BLASTN [1] and apply stringent filtering criteria, e.g. allowing only up to two mismatches. They are thus limited to identifying very close matches. How- ever, a higher sensitivity is crucial because phage refer- ence databases are very incomplete and often will not contain phages highly similar to those to be identified. To increase sensitivity, (1) we compare protein coding se- quences because phage genomes are mostly coding, and, to evade the CRISPR immune response, are under pres- sure to mutate their genome with minimal changes on the amino acid level; (2) we optimized a substitution matrix and gap penalties for short, highly similar protein frag- ments; (3) we combine evidence from multiple spacers matching to the same phage genome. II. METHODS Input. SpacePHARER accepts spacer sequences as multiple FASTA files each containing spacers from a sin- gle prokaryotic genome or as multiple output files from the CRISPR detection tools PILER-CR [7], CRT [5], MinCED [13] or CRISPRDetect [4]. Phage genomes are supplied as separate FASTA files or can be downloaded by SpacePHARER from NCBI GenBank [2]. Optionally, additional taxonomic labels can be provided for spacers or phages to be included in the final report. Algorithm. SpacePHARER is divided into five steps (Figure 1A, Supp. Materials). (0) Preprocess in- put: scan the phage genome and CRISPR spacers in six reading frames, extract and translate all putative coding fragments of at least 27 nt, with user-definable transla- tion tables. Each query set Q consists of the translated ORFs q of CRISPR spacers extracted from one prokary- otic genome, and each target set T comprises the puta- tive protein sequences t from a single phage. We refer to similar q and t as hit, and an identified host-phage relationship Q−T as match. (1) Search all q’s against all t’s using the fast, sensitive MMseqs2 protein search [14], with VTML40 substitution matrix [10], gap open cost of 16 and extension cost of 2 (Figure S1). We op- timized a short, spaced k-mer pattern for the prefilter stage (10111011) with six informative (‘1’) positions. In addition, align all q−t hits reported in previous search on nucleotide level and prioritize near-perfect nucleotide hits (Supp. Materials). (2) For each q−T pair, compute the P-value for the best hit pbh from first-order statis- tics. (3) Compute a combined score Scomb from best-hit P-values of multiple hits between Q and T using a modi- fied truncated-product method (Supp. Materials). (4) Compute the false discovery rate (FDR = FP /(TP + FP)) and only retain matches with FDR < 0.05. For that purpose, SpacePHARER is run on a null model database and the fraction of null matches with Scomb below a cut- off (empirical P-value) is used to estimate the FDR. (5) Scan 10 nt upstream and downstream of the phage’s pro- tospacer for a possible PAM. Output is a tab-separated text file. Each host-phage match spans two or more lines. The first starts with ‘#’: prokaryote accession, phage accession, Scomb, number of hits in the match. Each following line describes an indi- .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.05.15.090266doi: bioRxiv preprint https://spacepharer.soedinglab.org https://doi.org/10.1101/2020.05.15.090266 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 CRISPR Locus Second best hit Best hit pbh1 pbh2 pbh3 PAM Protospacer Threshold p0 X Search ... ... (1) MMseqs2 search of six-frame translated ORFs (2) P-value of best hit per q-T (3) Compute combined score Q T q t (5) Scan for possible PAMs (4) Select true matches by FDR Hit List q1 q2 q3 # P re di ct io ns TP FP FN TN -log(p) S ort Empirical P-values 0 10.5 Extracted Spacer Sets (FASTA, PILER-CR, CRT, MinCED, CRISPRDetect) Phage Genomes A 0% 5% 10% 15% 0 50,000 100,000 150,000 200,000 250,000 #True Positives F a ls e D is co ve ry R a te BLASTN Eukaryotic Viral Control BLASTN Inverted Phage Control SpacePHARER Eukaryotic Viral ORF Control SpacePHARER Inverted Phage ORF Control B B L A S T N S p a ce P H A R E R Species Genus Family Order Class Phylum 0 100 200 300 400 F re q u e n cy Incorrect Correct C FIG. 1. (A) SpacePHARER algorithm. A query set Q consists of 6-frame translated ORFs (q) from CRISPR spacers, and a target set T consists of 6-frame translated ORFs (t) of phage proteins. (1) Search all qs against all ts using MMseqs2. Align the q −t hits on nucleotide level and prioritize near-perfect nucleotide hits. (2) For each q −T pair, compute the P-value for the best hit from first-order statistics. (3) Compute score Scomb by combining the best-hit P-values from multiple hits between Q and T using a modified truncated-product method. (4) Estimate the FDR by searching a null database. (5) Scan for possible protospacer adjacent motif (PAM). (B) Performance comparison between SpacePHARER (blue) and BLASTN (red) using inverted phage sequences (solid lines) or eukaryotic viral ORFs as null set (dashed lines) demonstrated by expected number of true positive (TP) predictions at different false discovery rates (FDRs). (C) Performance comparison between BLASTN (left), SpacePHARER using the weighted lowest common ancestor procedure (LCA, right) at FDR = 0.02, evaluated by the number of correct (blue) and incorrect (red) predictions, for all the host predictions made at each taxonomic rank or below. vidual hit: spacer accession, phage accession, pbh, spacer start and end, phage start and end, possible 5’ PAM|3’ PAM, possible 5’ PAM|3’ PAM on the reverse strand. If requested, the spacer–phage alignments are included. If taxonomic labels are provided, taxonomic reports based on the weighted lowest common ancestor (LCA) procedure described in [9] are created for host LCAs of each phage genome or phage LCAs of each spacer as ad- ditional tab-separated text files. III. RESULTS Datasets. We split a previously published spacer dataset [12] of 363,460 unique spacers from 30,389 prokaryotic genomes randomly into an optimization set (20%, 6,067 genomes) and a test set (80%, 24,322 genomes). The performance of SpacePHARER was eval- uated on the spacer test set against a target database of 7,824 phage genomes. We used two null databases: 11,304 eukaryotic viral genomes and the inverted trans- lated sequences of the target database. Viral genomes were downloaded from GenBank in 09/2018. The performance of SpacePHARER in Figure 1C was evaluated on a validation dataset of spacers from 1,066 bacterial genomes against 809 phage genomes with anno- tated host taxonomy [8]. For each phage, we predicted the host based on the host LCA. Prediction quality. At FDR = 0.05, SpacePHARER predicted 3 to 4 times more prokaryote-phage matches than BLASTN (Figure 1B, Figure S2). SpacePHARER predicted the correct host for more phages than BLASTN at all taxonomic ranks, while including most of the BLASTN predictions, at better precision (Figure 1C, Figure S3). If the host or a close relative of a phage is absent in the database (either because the host is uniden- tified or the host lacks a CRISPR-Cas system), the pre- dicted host may be correct only at a higher rank than species. Run time. SpacePHARER took 12 minutes to pro- cess the test dataset on 2×6-core 2.40 GHz CPUs, 47 times faster than BLASTN (575 minutes). IV. CONCLUSION SpacePHARER is 1.4 to 4× more sensitive than BLASTN in detecting phage-host pairs, due to searching with protein sequences, optimizing short sequence com- parisons, and combining statistical evidence, and it is fast enough to analyze large-scale genomic and metagenomic datasets. FUNDING ELK is a FEBS long-term fellowship recipient. The work was supported by the ERC’s Horizon 2020 Frame- work Programme [‘Virus-X’, project no. 685778] and the BMBF CompLifeSci project horizontal4meta. Conflict of Interest: none declared .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.05.15.090266doi: bioRxiv preprint https://doi.org/10.1101/2020.05.15.090266 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3 REFERENCES [1] Altschul, S.F. et al (1990). Basic local alignment search tool. J. Mol. Biol., 215(3), 403–410. [2] Benson, D.A. et al (2013). GenBank. Nucleic Acids Res., 41(D1), D36–D42. [3] Biswas, A. et al (2013). CRISPRTarget: bioinformatic prediction and analysis of crRNA targets. RNA Biol., 10(5), 817–827. [4] Biswas, A. et al (2016). CRISPRdetect: A flexible algorithm to define CRISPR arrays. BMC Genom., 17(1), 356. [5] Bland, C. et al (2007). CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinform., 8(1), 209. [6] Burstein, D. et al (2016). Major bacterial lineages are essentially devoid of crispr-cas viral defence systems. Nature Communica- tions, 7(1), 10613. [7] Edgar, R.C. (2007). PILER-CR: Fast and accurate identification of CRISPR repeats. BMC Bioinform., 8(1), 18. [8] Edwards, R.A. et al (2015). Computational approaches to predict bacteriophage–host relationships. FEMS Microbiol. Rev., 40(2), 258–272. [9] Mirdita, M. et al (2020). Fast and sensitive taxonomic assignment to metagenomic contigs. bioRxiv. doi:10.1101/2020.11.27.401018. [10] Müller, T. et al (2002). Estimating amino acid substitution mod- els: A comparison of Dayhoff’s estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol., 19(1), 8–13. [11] Paez-Espino, D. et al (2016). Uncovering Earth’s virome. Nature, 536(7617), 425–430. [12] Shmakov, S.A. et al (2017). The CRISPR spacer space is domi- nated by sequences from species-specific mobilomes. mBio, 8(5), e01397–17. [13] Skennerton, C. (2016). Minced - mining CRISPRs in environmen- tal datasets. https://github.com/ctSkennerton/minced. [14] Steinegger, M. and Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol., 35(11), 1026–1028. [15] Stern, A. et al (2012). CRISPR targeting reveals a reservoir of common phages associated with the human gut microbiome. Genome Res., 22(10), 1985–1994. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.05.15.090266doi: bioRxiv preprint https://github.com/ctSkennerton/minced https://doi.org/10.1101/2020.05.15.090266 http://creativecommons.org/licenses/by-nc-nd/4.0/ .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.05.15.090266doi: bioRxiv preprint https://doi.org/10.1101/2020.05.15.090266 http://creativecommons.org/licenses/by-nc-nd/4.0/ .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.05.15.090266doi: bioRxiv preprint https://doi.org/10.1101/2020.05.15.090266 http://creativecommons.org/licenses/by-nc-nd/4.0/ .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.05.15.090266doi: bioRxiv preprint https://doi.org/10.1101/2020.05.15.090266 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2020_02_04_934216 ---- EMBER: Multi-label prediction of kinase-substrate phosphorylation events through deep learning EMBER: Multi-label prediction of kinase-substrate phosphorylation events through deep learning Kathryn E. Kirchoff 1 and Shawn M. Gomez 2,3, 1 Department of Computer Science, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA 2 Department of Pharmacology, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA 3 Joint Department of Biomedical Engineering at the University of North Carolina at Chapel Hill and North Carolina State University, Chapel Hill, NC 27599, USA Abstract Kinase-catalyzed phosphorylation of proteins forms the back- bone of signal transduction within the cell, enabling the coor- dination of numerous processes such as the cell cycle, apop- tosis, and differentiation. While on the order of 105 phos- phorylation events have been described, we know the specific kinase performing these functions for less than 5% of cases. The ability to predict which kinases initiate specific individual phosphorylation events has the potential to greatly enhance the design of downstream experimental studies, while simultane- ously creating a preliminary map of the broader phosphoryla- tion network that controls cellular signaling. To this end, we de- scribe EMBER, a deep learning method that integrates kinase- phylogeny information and motif-dissimilarity information into a multi-label classification model for the prediction of kinase- motif phosphorylation events. Unlike previous deep learning methods that perform single-label classification, we restate the task of kinase-motif phosphorylation prediction as a multi-label problem, allowing us to train a single unified model rather than a separate model for each of the 134 kinase families. We utilize a Siamese network to generate novel vector representations, or an embedding, of motif sequences, and we compare our novel em- bedding to a previously proposed peptide embedding. Our mo- tif vector representations are used, along with one-hot encoded motif sequences, as input to a classification network while also leveraging kinase phylogenetic relationships into our model via a kinase phylogeny-weighted loss function. Results suggest that this approach holds significant promise for improving our map of phosphorylation relations that underlie kinome signaling. Availability: https://github.com/gomezlab/EMBER Correspondence: smgomez@unc.edu Introduction Phosphorylation is the most abundant post-translational mod- ification of protein structure, affecting from one to two-thirds of eukaryotic proteins. In humans, the number of kinases catalyzing this reaction hints at its importance, with kinases being one of the largest gene families with roughly 520 mem- bers distributed among 134 families (1–3). During phospho- rylation, a kinase facilitates the addition of a phosphate group at serine, threonine, tyrosine, or histidine residues; though other sites exist. Phosphorylation of a substrate at any of these residues occurs within the context of specific consen- sus phosphorylation sequences, which we refer to here as “motifs”. Additional substrate binding sequences within the kinase or substrate, as well as protein scaffolds that facili- tate structural orientation and downstream catalysis of the re- action, modify the efficacy of motif phosphorylation. Typi- cally, the net effect of kinase phosphorylation is to switch the downstream target into an “on” or “off” state, enabling the transmission of information throughout the cell. Kinase ac- tivity touches nearly all aspects of cellular behavior, and the alteration of kinase behavior underlies many diseases while simultaneously establishing the basis for therapeutic inter- ventions (4–11). Although the importance of phosphorylation in cell informa- tion processing and its dysregulation as a driver of disease is well-recognized, the map of kinase-motif phosphorylation in- teractions is mostly unknown. So, while upwards of 100,000 motifs are known to be phosphorylated, less than 5% of these have an associated kinase identified as the catalyzing agent (12). This knowledge gap provides a considerable impetus for the development of methods aimed at predicting kinase- motif phosphorylation events that, at a minimum, could help focus experimental efforts. As a result, a number of computational tools have been devel- oped, spanning a myriad of methodological approaches in- cluding random forests (13), support vector machines (14), logistic regression (15), and Bayesian decision theory (16). Advances in deep learning have similarly spawned new ap- proaches, with two methods recently described. The first, MusiteDeep, utilizes a convolutional neural network (CNN) with attention to generate single predictions (17). The sec- ond deep learning method, DeepPhos, exploits densely con- nected CNN (DC-CNN) blocks for its predictions (18). Both of these approaches train individual models for each kinase family, requiring a separate model for each of the 134 ki- nase families. In addition to the practical challenge of train- ing many individual models, a further disadvantage of these two deep learning approaches is the potential lost opportunity gains from transfer learning, as models trained independently do not directly incorporate knowledge of motif phosphoryla- tion by kinases from different kinase families. Here, we describe, EMBER (Embedding-based multi-label prediction of phosphorylation events), a deep learning ap- proach for predicting multi-label kinase-motif phosphoryla- tion relationships. In our approach, we utilize a Siamese neu- Kirchoff et al. | bioR‰iv | February 10, 2021 | 1–10 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ ral network, modified for our multi-label prediction task, to generate a high-dimensional embedding of motif vectors. We further utilize one-hot encoded motif sequences. These two representations are leveraged together as a dual input into our classifier, improving learning and prediction. We also find that our Siamese embedding generally outperforms a previ- ously proposed protein embedding, ProtVec, which is trained on significantly more data (19). We further integrate infor- mation regarding evolutionary relationships between kinases into our classification network loss function, informing pre- dictions in light of the sparsity associated with these data, and we find that this information improves prediction accuracy. As EMBER utilizes transfer learning across families, we ex- pect that model accuracy will improve more so than other deep learning approaches as more data describing kinase- substrate relationships are collected. Together, these results suggest that EMBER holds significant promise for improving our map of phosphorylation relationships that underlie the ki- nome and broader cellular signaling. Methods Kinase-motif interaction data. As documented kinase- motif interactions are sparse in relation to the total number of known phosphorylation events, we attempted to maximize the number of examples of such interactions for training. To do this, we integrated multiple datasets describing kinase- motif relationships across multiple vertebrate species. Our data was sourced from PhosphoSitePlus, PhosphoNetworks, and Phospho.ELM, all of which are collections of annotated and experimentally verified kinase-motif relationships (20– 22). From these data sources, non-redundant kinase-motif relationships were extracted and integrated into a single set of interactions. We used the standard single-letter amino acid code for representation of amino acids, with an additional ’X’ symbol to represent an ambiguous amino acid. We defined our motifs as peptides composed of a central phosphorylat- able amino acid — either serine (S), threonine (T), or tyrosine (Y) — flanked by seven amino acids on either side. There- fore, each motif is a 15-amino acid peptide or “15-mer”. As a phosphorylatable amino acid may not have seven flanking amino acids to either side if it is located near the end of a substrate sequence, we used ‘-’ to represent the absence of an amino acid in order to maintain a consistent motif length of 15 amino acids across all instances. Deep learning models are known to generally require large amounts of examples per class in order to achieve adequate performance. Our original dataset was considerably imbal- anced in that all positive labels (verified kinase-motif inter- actions) had a very low positive-to-negative label ratio. For example, the TLK kinase family only has nine positive la- bels (verified TLK-motif interactions) and more than 10,000 negative labels (lack of evidence for a TLK-motif interac- tion). To maximize our ability to learn from our data, we utilized only kinases that had a relatively large number of ex- perimentally validated motif interactions, reducing the num- ber of kinase-motif relationships to be used as input for our model. This filtering also served to considerably mitigate Table 1. Summary of our kinase-motif phosphorylation dataset. Shown are the number of kinases per family along with the number of motifs phosphorylated by each kinase family in the training and test sets. Family Kinases Training motifs Testing motifs Akt 3 382 63 CDK 21 752 116 CK2 2 775 97 MAPK 14 1275 187 PIKK 7 497 63 PKA 5 1235 231 PKC 10 1497 251 Src 11 869 99 the label imbalances in our data. From the 7531 remain- ing motifs, we set aside 853 motifs for the independent test set, leaving 6678 for the training set. Then we removed any sequences from the training set that met a 60% similarity threshold with any sequence in the test set, based on Ham- ming distance scores. This process removed 229 motifs from the training set. Kinase labels were then grouped into re- spective kinase families contingent on data collected from the RegPhos (1) database, resulting in eight kinase families. Our resulting data set is comprised of 7302 phosphorylatable mo- tifs and their reaction-associated kinase families (Table 1). Furthermore, our data are multi-label in that a single motif may be phosphorylated by multiple kinases, including those from other families, resulting in a data point with potentially multiple positive labels. Motif embeddings. ProtVec embedding. We chose to investigate two methods to achieve our motif embedding. First, we considered ProtVec, a learned embedding of amino acids, originally intended for protein function classification (19). ProtVec is the result of a Word2Vec algorithm trained on a corpus of 546,790 se- quences obtained from Swiss-Prot, which were broken up into 3 amino acid-long subsequences, or "3-grams". As a result of this approach, ProtVec provides a 100-dimensional distributed representation, analogous to a natural language "word embedding", that establishes coordinates for each pos- sible amino acid 3-gram. This results in a 9048 ◊ 100 matrix of coordinates, one 100-dimensional coordinate for each 3- gram. In a preliminary investigation, we found that averaging the ProtVec coordinates resulted in a higher-quality embed- ding compared to the original ProtVec coordinates. Compar- isons between the two embeddings are provided in Supple- mental Material. We averaged the embedding coordinates, per amino acid, in the following fashion: We define T = [AAA,ALA,LAA, ...,unknown], the vector of 9048 amino acid 3-grams provided by the authors of ProtVec. We also define A = [A,L,S, ...,-], the alphabet comprising the 22 amino acid symbols. We equate “-” to the “unknown” character defined by ProtVec. Then, we compute the matrix of averaged ProtVec coordinates, C(avg), which will be 22 ◊ 100 dimensions: 2 | bioR‰iv Kirchoff et al. | EMBER: kinase-substrate multi-label prediction .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ C(avg) = S WWWWW U c0,0 c0,1 c0,2 . . . c0,99 c1,0 c1,1 c1,2 . . . c1,99 c2,2 c2,1 c2,2 . . . c2,99 ... . . . ... c21,0 c21,1 c21,2 . . . c21,99 T XXXXX V (1) We solve for each element of C(avg) based on the values of C(raw), the original (9048 x 100) ProtVec matrix: c (avg) ij = 1 |Qi| ÿ kœQi c (raw) kj (2) where c(avg)ij belongs to C (avg), c(raw)ij belongs to C (raw), and Qi = {q : Ai œ Tq } (3) Note that the original ProtVec matrix was 9048 ◊ 100 dimen- sions, thus each j corresponds to the index of one of the 100 original ProtVec dimensions along the second tensor dimen- sion. Siamese embedding. We aimed to produce a final model, composed of an embedding technique and a classification method, that was specific to our motif dataset. To this end, we implemented a Siamese network to provide a novel learned representation of our motifs (Figure 1). The Siamese net- work is composed of two identical "twin" networks, deemed as such due to their identical hyperparameters as well as their identical learned weights and biases (23). During training, each twin network receives a separate motif sequence that is represented as a one-hot encoding, denoted either as a or b in Figure 1. Motifs are processed through the network until reaching the final fully-connected layers, ha and hb, which provide the resultant embeddings for the original motif se- quences. Next, the layers are joined by calculating the pair- wise Euclidean distance, Dw , between ha and hb. Dw can be interpreted as the overall dissimilarity between the origi- nal motif sequences, a and b. The loss function operates on the final layer, striving to embed relatively more similar data points closer to each other, and relatively more different data points farther away from each other. In this way, the network amplifies the similarities and differences between motifs, and it translates such relationships into a semantically meaningful vector representation for each motif in the embedding space. We utilized a contrastive loss as described in Hadsell et al. (24), but we sought to modify the function to account for the multi-label aspect of our task. The canonical Siamese loss between a pair of samples, a and b, is defined as L(a, b, Y ) = (1 ≠ Y ) 1 2 (Dw)2 + (Y ) 1 2 [max(0, m ≠ Dw)]2, (4) where Dw is the Euclidean distance between the outputs of the embedding layer, m is the margin which is a hyperparam- eter defined prior to training, and Y œ {0, 1}. The value of Y is determined by the label of each data point in the pair. If a Fig. 1. Siamese network architecture, composed of twin convolutional neural net- works (CNNs). The twin networks are joined at the final layer. a and b represent a pair of motifs from the training set, while ha and hb represent the respective hidden layers output by either CNN. The difference between the hidden layers is calculated to obtain the distance layer, Dw . Dw is input into the loss along with Y , a variable indicating the dissimilarity, regarding kinase interactions, between a and b. After training is complete, the "twin" architecture is no longer necessary; each motif is input into a single twin and the output of the embedding layer gives the resultant representation of the given motif. pair of samples has identical labels, they are declared “same” (Y = 0). Conversely, if a pair of samples has different la- bels, they are declared “different” (Y = 1). This definition relies on the assumption that each sample may only have one true label. To adapt the original Siamese loss to account for the multi-label aspect of our task, we replaced the discrete variable Y with a continuous variable, namely, the Jaccard distance between kinase-label set pairs. Thus, our modified loss function is defined as LJ (a, b, Y ) = (1≠Ja,b) 1 2 (Dw)2 +(Ja,b) 1 2 [max(0, m≠Dw)]2, (5) where Ja,b is shorthand for J (Ka, Kb), which is the Jaccard distance between the kinase-label set Ka and the kinase-label set Kb, associated with motif sample a and motif sample b, respectively. Formally, J (Ka, Kb) = 1 ≠ |Ka fl Kb| |Ka fi Kb| (6) and consequently, 0 Æ J (Ka, Kb) Æ 1. (7) In this way we have defined a continuous metric by which to compare a pair of motifs, rather than the usual “0” or “1” distinction. The Siamese network was trained for 10,000 iterations on the training set, precluding the data points in the independent test set. When composing a mini-batch, we alternated between "similar" and "dissimilar" motif pairs during training. Simi- lar pairs were defined as motifs whose J (Ka, Kb) > 0.5, and dissimilar pairs were defined as motifs whose J (Ka, Kb) Æ 0.5. After training, we must produce the final embedding space to be used in training of our subsequent classification Kirchoff et al. | EMBER: kinase-substrate multi-label prediction bioR‰iv | 3 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ Fig. 2. EMBER model architecture. Here, the previously-trained Siamese network is colored pink, and the classifier architecture is colored orange. The 15 amino acid-length motif, a, is converted into a one-hot encoded matrix, V . The one-hot encoded matrix is then fed into a single twin from the Siamese network. The 100-d embedding, e, is output by the Siamese network. Here, we reduce e to a 2-dimensional space for illustrative purposes using UMAP. Then, e is fed into a multilayer perceptron (MLP) alongside V , which is fed into a convolutional neural network (CNN). Then, the last layers of the separate networks are concatenated, followed by a series of fully-connected layers. The final output is a vector, k, of length eight, where each value corresponds to the probability that the motif a was phosphorylated by one of the kinase families indicated in k. network. To obtain the final embedding, we input each motif into a single arbitrary twin of the original network (because both twins learn the same weights and biases), producing a high-dimensional (100-dimensional) vector representation of the original motif sequence. The resultant motif embedding effected by the single Siamese twin is further discussed in the Results section. We used k-nearest neighbors (k-NN) classi- fication on each family to quantitatively compare the predic- tive capabilities of ProtVec and Siamese embeddings in the coordinate-only space. For our k-NN computation, we used a k of 85. Predictive model framework. EMBER architecture. An overview of the architecture of EM- BER is shown in Figure 2. EMBER takes as input raw motif sequences and the coordinates of each respective motif in the embedding space. We use one-hot encoded motifs as the sec- ond type of input into our model. Each motif sequence is represented by a 15 ◊ 22 matrix. In addition, we utilize the embedding provided by our Siamese network, which creates a latent space of dimensions m ◊ 90 where m is the number of motifs. The inputs into our classifier network, one-hot sequences and embeddings, are fed through a convolutional neural network (CNN) and a multilayer perceptron (MLP), respectively. The outputs of the two networks are then concatenated, and the concatenated layer is fed through a series of fully-connected layers (a MLP), followed by a sigmoid activation function. We performed 5-fold cross validation to assess the accuracy of our model when trained on different training-validation folds. We averaged the performance on the independent test set across the five folds to compute our final performance on the classification task. Evaluation metrics. In order to quantify the performance of our models, we computed the area under the receiver oper- ating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). These metrics were eval- uated per kinase family. We also show the micro-average and macro-average for both AUROC and AUPRC. We define � = {⁄j : j = 0, ..., q} as the set of all labels. The micro- average, Emicro, aggregates the label-wise contributions of each class: Emicro = E( qÿ ⁄=0 tp⁄, qÿ ⁄=0 tn⁄, qÿ ⁄=0 f p⁄, qÿ ⁄=0 f n⁄), (8) where E is an evaluation metric, in our case, either AUROC or AUPRC. Alternatively, the macro-average, Emacro, takes into account the score for each respective class and averages those scores together, thus treating all classes equally: Emacro = 1 q qÿ ⁄=0 E(tp⁄, tn⁄, f p⁄, f n⁄), (9) where E is once again an evaluation metric, in our case, ei- ther AUROC or AUPRC. Both the Emacro and the Emicro are calculated based on tp⁄, tn⁄, f p⁄, and f n⁄, which are, respectively, the number of true positives, the number of true negatives, the number of false positives, and the number of false negatives of label ⁄. 4 | bioR‰iv Kirchoff et al. | EMBER: kinase-substrate multi-label prediction .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ Kinase phylogenetic distances. We sought to leverage the phylogenetic relationships between kinases to improve predictions of kinase-motif interactions. Specifically, we considered the dissimilarity of a pair of kinase families in conjunction with the dissimilarity of the two respective groups of motifs that either kinase family phosphorylates (i.e., “kinase-family dissimilarity” vs. “motif-group dissim- ilarity”). Note that the terms “distance” and “dissimilarity” are interchangeable. As the phylogenetic distances given by Manning et al. (2) do not provide distances between typical and atypical kinase families, we established a proxy phylo- genetic distance that allows us to define distances between these two families. We define this proxy phylogenetic dis- tance through the Levenshtein edit distance, Lev(ka, kb), be- tween kinase-domain sequences. Kinase-domain sequences are the specific subsequences of kinases that are directly in- volved in phosphorylation. These kinase-domain sequences were obtained from an online source provided by Manning et al. (2). Distances between kinase domain sequences was calculated by performing local alignment, utilizing the BLO- SUM62 substitution matrix to weight indels and substitu- tions. To calculate overall kinase-family dissimilarity, we took the average of the Levenshtein edit distances between each kinase domain pair, per family, d(fa, fb) = q kaœfa q kbœfb Lev(ka, kb) |fa| · |fb| (10) where d(fa, fb) is the dissimilarity metric (distance) between kinase family a and kinase family b. ka is the kinase-domain sequence of a kinase belonging to family a, kb is the kinase- domain sequence of a kinase belonging to family b, and the Levenshtein distance between kinase domain ka and kinase domain kb is determined by Lev(ka, kb). This formula was applied per kinase family pair and stored in an a ◊ b kinase- family dissimilarity matrix. We will refer to this proxy metric for evolutionary dissimilarity between kinase families as the “phylogenetic distance” between kinase families. Kinase-family dissimilarity vs. motif-group dissimilarity. For our (kinase-family dissimilarity)-(motif-group dissimilarity) correlation, we defined motif-group dissimilarity in the same manner as kinase-family dissimilarity, finding the Leven- shtein distance between motifs based on local alignment us- ing BLOSUM62. Then, we sought to find the correlation between kinase-family dissimilarity and motif-group dissim- ilarity. Therefore, calculation of motif-group dissimilarity, per kinase family pair, was defined identically as in Equation 10, but based on the motifs specific to each kinase family, resulting in an a ◊ b motif-group dissimilarity matrix. Kinase phylogenetic loss. To leverage evolutionary rela- tionships between kinase families into our predictions, we weighted the original binary cross entropy (BCE) loss by a kinase phylogenetic metric. Specifically, our weighted BCE loss per minibatch is defined as: Fig. 3. Heatmap matrix depicting pairwise kinase-domain distances. Levenshtein distances were normalized, with the yellow end of the color bar representing far dis- tances (less similar) and the pink end representing close distances (more similar). P BCE(ŷ, y) = ≠ 1 n nÿ i P Ti yi log(ŷi), (11) where n is the size of the mini batch, yi is the one-hot actual label vector for sample i, ŷi is the predicted label vector for sample i, and Pi is the phylogenetic weight vector for sample i given by Pi = # w0,i, ..., w|K|,i $T , (12) with wk,i being the average phylogenetic weight scalar of label k for sample i: wk,i = 1 |Li| ÿ jœLi Fk,j , (13) and Fk,j is the vector of family weights of label k. Finally, Li is the set of indices corresponding to positive labels for sample i Li = {i œ [0, ..., m ≠ 1] : yi = 1} , (14) where m is the length of the one-hot true label vector for sample i. Results Correlation between kinase phylogenetic dissimilarity and phosphorylated motif dissimilarity. We sought to il- luminate the relationship between kinase-family dissimilar- ity and phosphorylated motif-group dissimilarity described in the Methods section. That is, we wanted to determine if “similar” kinases tend to phosphorylate “similar” motifs based on some quantitative metric. To this end, we calcu- lated the correlation between average kinase-family dissimi- larities and motif-group dissimilarities based on normalized pairwise alignment scores. From this, we found a Pearson correlation of 0.667, indicating a moderate positive relation- ship between kinase dissimilarity and that of their respective Kirchoff et al. | EMBER: kinase-substrate multi-label prediction bioR‰iv | 5 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ phosphorylated motifs. While moderate, this correlation be- tween kinase dissimilarity and motif dissimilarity suggests a potential signal in the phylogenetic relationships that could be leveraged to improve predictions. Using our normalized distances as a proxy for phylogenetic distance (see Methods), the dissimilarity between kinases is displayed as a heatmap in Figure 3. The Akt and PKC fam- ily have the greatest similarity (lowest dissimilarity) of all pairwise comparisons, with PKA-Akt and MAPK-CDK fol- lowing as the next most similar family pairs. Together, these results provide motivation to incorporate both motif dissim- ilarity and kinase relatedness into the predictive model, as achieved through our custom phylogenetic loss function de- scribed in Methods. The effects of this approach are de- scribed later in Results. Motif embedding via Siamese network. We sought to develop a novel learned representation of motifs using a Siamese neural network. Siamese networks were first in- troduced in the early 1990s as a method to solve signature verification, posed as an image-to-image matching problem (23). Siamese networks perform metric learning by exploit- ing the dissimilarity between a pair of data points. Training a Siamese network effects a function with the goal of produc- ing a meaningful embedding, capturing semantic similarity in the form of a distance metric. We hypothesized that incor- porating high-dimensional vector representations of motifs (i.e., an embedding) into the input of a classification network would provide more predictive power than methods that do not utilize such information. In our Siamese model, we opted to use convolutional layers as described in Methods. We per- formed k-NN on both the ProtVec and Siamese embeddings of motifs and found that the Siamese embedding produced better predictions, on average, than the ProtVec embedding (see Table 2). More specifically, the Siamese embedding resulted in a macro-average AUROC of 0.903 compared to ProtVec’s 0.898 and a micro-average AUROC of 0.924 com- pared to ProtVec’s 0.902. Likewise, the Siamese embedding had better AUPRC, with a macro-average AUPRC of 0.692 compared to ProtVec’s 0.670 and a micro-average AUPRC of 0.747 compared to ProtVec’s 0.643. Furthermore, we cal- culated the silhouette scores of both embeddings and found our Siamese embedding to have a significantly better mean silhouette score of 0.114 compared to ProtVec’s 0.005. We performed dimensionality reduction for visualization of the Siamese embeddings using uniform manifold approxi- mation and projection (UMAP) (25). For our UMAP im- plementation, we used 200 neighbors, a minimum distance of 0.1, and Euclidean distance for our metric. The resulting 2-dimensional UMAP motif embeddings derived from the Siamese network are shown in Figure 4. As can be seen, the motifs phosphorylated by a given kinase family have a dis- tinctive distribution in the embedding space, with some distri- butions being highly unique, and with some significant over- lap between certain families. More specifically, our Siamese embedding shows that motifs phosphorylated by either PKC, PKA, or Akt appear to occupy a similar latent space. Sim- ilarly, motifs phosphorylated by either CDK or MAPK also Fig. 4. Siamese embedding of motifs. Each point represents one of the 7302 mo- tifs, and each panel displays kinase family-specific phosphorylation patterns. Each colored point corresponds to a motif in the test set phosphorylated by a member of the specified kinase family. Highlighted points are slightly enlarged in size to enhance readability. 6 | bioR‰iv Kirchoff et al. | EMBER: kinase-substrate multi-label prediction .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ Table 2. Area under the receiver operating characteristic curve (AUROC) and area under the precision recall curve (AUPRC) scores on independent test set prediction, given by k -NN performed on the ProtVec and Siamese embedding. Precision Recall Family ProtVec Siamese ProtVec Siamese Akt 0.908 0.897 0.462 0.513 CDK 0.889 0.892 0.511 0.538 CK2 0.906 0.893 0.665 0.714 MAPK 0.907 0.908 0.739 0.720 PIKK 0.845 0.900 0.579 0.663 PKA 0.865 0.852 0.716 0.659 PKC 0.865 0.885 0.697 0.741 Src 0.998 0.995 0.993 0.991 macro-average 0.898 0.903 0.670 0.692 micro-average 0.902 0.924 0.643 0.747 occupy a similar space. These observations mirror the phylo- genetic relationships shown in Figure 3, where the MAPK and CDK families have a relatively short mean evolution- ary distance between them, and the PKC-PKA distance, even shorter still. In addition to these overlapping families, we also observe that Src-phosphorylated motifs form a distinct cluster. This is likely driven by the fact that Src is the only tyrosine ki- nase family among the eight kinase families we investigated, with its motifs invariably having a tyrosine (Y) at the eighth position in the 15-amino acid sequence, compared to the other 7 families whose motifs have either a serine (S) or a (T) in this position. This effects a significant sequence dis- crepancy between Src-phosphorylated motifs and remaining motifs. The fact that Src-phosphorylated motifs cluster so precisely serves as a sanity check that our Siamese embed- ding is capturing sequence (dis)similarity information despite being trained through comparison of kinase-motif phospho- rylation events in lieu of motif sequence comparisons. We note that the embedding produced by our Siamese network is quite qualitatively similar to the ProtVec embedding in terms of these kinase-label clusters indicated in the UMAP projec- tions. The UMAP projections of the ProtVec embeddings are included in Supplementary Material. Prediction of phosphorylation events. Following train- ing of EMBER on both motif sequences and motif vector representations as input, we conducted an ablation test in which we removed the motif vector representation (or coordi- nate) input along with its respective MLP; this was achieved by applying a dropout rate of 1.00 on the final layer of the coordinate-associated MLP. This ablation test allowed us to observe how our novel motif sequence-coordinate model compares to a canonical deep learning model whose input consists solely of one-hot encoded motif sequences (such as in the methods utilized by Wang et al. (17) and Luo et al. (18)). We also compared EMBER trained with the standard BCE loss to EMBER trained with our kinase phylogenetic loss. All predictive models, as described in Table 3, were trained on identical training-validation splits and evaluated on the same independent test set. Fig. 5. Confusion matrix for EMBER predictions on the test set. The numbers inside each box represent the raw number of predictions per box. The color scale is based on the ratio of predictions (in the corresponding box) to total predictions, per label. A lighter color corresponds to a larger ratio of predictions to total predictions. Comparisons between the predictive capability of the mod- els described here are quantified by AUROC and AUPRC, and these metrics are presented for each of the three mod- els in Table 3. As indicated by Table 3, EMBER, utilizing both sequence and coordinate information, outperforms the canonical sequence model in both AUROC and AUPRC. In addition, integration of phylogenetic information into the loss provides a generally small but consistent additional boost in performance, showing the best overall results out of the three models for AUROC and AUPRC. Individual performance metric curves for each kinase label, produced by EMBER trained via the phylogenetic loss, are shown in Figure 6. A confusion matrix providing greater detail and illustrating the relative effectiveness of our model for prediction of differ- ent kinase families is shown in Figure 5. In order to compute the confusion matrix, we set a prediction threshold of 0.5, declaring any prediction above 0.5 as "positive" and any pre- diction equal to or less than 0.5 as "negative". As indicated by the confusion matrix, the model often confounds motifs that are phosphorylated by closely related kinase families, for ex- ample, MAPK and CDK. This is presumably due to the close phylogenetic relationship between MAPK and CDK, as in- dicated by their relatively low phylogenetic distance of 0.75 (Figure 3). Furthermore, our Siamese network embeds mo- tifs of these respective families into the same relative space, as shown in Figure 4, further illustrating the confounding na- ture of these motifs. A similar trend is found for motifs phos- phorylated by PKC, PKA, and Akt. This trio is also shown to be closely related as indicated by the correlations in Figure 3 and the embeddings in Figure 4. Comparison to existing methods. We sought to compare EMBER’s performance to the two existing deep learning methods, MusiteDeep and DeepPhos, which adopt single- Kirchoff et al. | EMBER: kinase-substrate multi-label prediction bioR‰iv | 7 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ label models. However, this is not a straight-forward com- parison because EMBER was trained on sequences 15 amino acids in length while MusiteDeep and DeepPhos were trained on sequences of 33 and 51 amino acids in length, respec- tively. Thus, we must elongate our 15-mers to lengths of 33 and 51 in order for MusiteDeep and DeepPhos to accept those sequences into their architectures as input. To accom- plish this, we queried the Uniprot database to find complete protein sequences of which our test set motifs were subse- quences. For instances in which a motif was a subsequence of multiple proteins we chose a protein at random from the set. By referencing the original complete protein sequence we were able to elongate our motifs by adding nine (in the case of MusiteDeep) or 18 (in the case of DeepPhos) amino acids to each flank of the original 15-mer motif. This re- sulted in a test set of 33-mers for MusiteDeep and 51-mers for DeepPhos. We note that of the eight kinase families for which our model produces predictions, DeepPhos has functioning models for only four of the families (CDK, CK2, MAPK, and PKC), and MusiteDeep has models for only five of the families (CDK, CK2, MAPK, PKA, and PKC). We show AUROC and AUPRC results per kinase label from each of the three meth- ods in Figure 6. EMBER outperforms MusiteDeep and Deep- Phos on all four averaged metrics, indicating that our multi- label approach may be better equipped to solve the problem of kinase-motif prediction compared to the single-label ap- proaches. Discussion Illuminating the map of kinase-substrate interactions has the potential to enhance our understanding of basic cellular sig- naling as well as drive health applications, for example, by facilitating the development of novel kinase inhibitor-based therapies that disrupt kinase signaling pathways. Here, we have presented a deep learning-based approach that aims to predict which substrates are likely to be phosphorylated by a specific kinase family. In particular, our multi-label ap- proach establishes a unified model that utilizes all available kinase-motif data to learn broader structures within the data and improve predictions across all kinase families in tandem. This approach avoids challenges in hyperparameter tuning in- herent in the development of an individual model for each kinase. We believe that this approach will enable continuing improvement in predictions, as newly generated data describ- ing any kinase-motif phosphorylation event can assist in im- proving predictions for all kinases. That is, a kinase-motif interaction discovered for PKA will improve the predictions not just for PKA, but also for Akt, PKC, MAPK, etc. through the transfer learning capabilities inherent in our multi-label model. We showed that incorporation of a learned vector repre- sentation of motifs, namely the motifs’ coordinates in the Siamese embedding space, serves to improve performance over a model that utilizes only one-hot encoded motif se- quences as input. Not only did the Siamese embedding im- prove prediction of phosphorylation events through a neu- Fig. 6. AUROC and AUPRC results achieved on the independent test set by Deep- Phos, MusiteDeep, and EMBER. The AUROC and AUPRC of each kinase family label is shown in the respective legends. 8 | bioR‰iv Kirchoff et al. | EMBER: kinase-substrate multi-label prediction .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ Table 3. AUROC and AUPRC results achieved on the independent test set across deep learning classification models. The AUROC and AUPRC are presented per kinase family for each model. From left to right, we include results for the ablated sequence-only CNN, EMBER trained using a canonical BCE loss, and EMBER trained using the kinase phylogeny-weighted loss as described in Methods. AUROC AUPRC Family Seq-CNN EMBER (BCE) EMBER (PBCE) Seq-CNN EMBER (BCE) EMBER (PBCE) Akt 0.844 0.865 0.889 0.377 0.379 0.483 CDK 0.882 0.891 0.902 0.552 0.582 0.600 CK2 0.902 0.915 0.923 0.681 0.750 0.755 MAPK 0.898 0.908 0.907 0.704 0.729 0.730 PIKK 0.850 0.864 0.889 0.534 0.557 0.610 PKA 0.838 0.856 0.867 0.657 0.689 0.718 PKC 0.857 0.877 0.888 0.704 0.732 0.763 Src 0.997 0.995 0.996 0.993 0.993 0.994 macro-average 0.884 0.897 0.908 0.650 0.676 0.706 micro-average 0.909 0.921 0.928 0.715 0.745 0.765 ral network architecture, but it also outperformed ProtVec, a previously developed embedding, in a coordinate-based k- NN task. This improvement over ProtVec was in spite of the fact that the Siamese network utilized less than 7,000 training sequences of 15 amino acids in length compared to ProtVec’s 500,000 sequences of approximately 300 amino acids in average length. The Siamese embedding was further generated through direct comparison of kinase-motif phos- phorylation events rather than simply the sequence-derived data used by ProtVec. Furthermore, ProtVec is a generalized protein embedding while the Siamese embedding described here has the potential to be customized. For example, the use of the Jaccard distance in the Siamese loss allows the network to be trained on any number of multi-label datasets such acetylation, methylation, and carbonylation reactions. We also found that there is a small though meaningful rela- tionship between the evolutionary distance between kinases and the motifs they phosphorylate, supporting the concept that closely related kinases will tend to phosphorylate similar motifs. When encoded in the form of our phylogenetic loss function, this relationship was able to slightly improve pre- diction accuracies. Together, these results suggest that EM- BER holds significant promise towards the task of illuminat- ing the currently unknown relationships between kinases and the substrates they act on. ACKNOWLEDGEMENTS We would like to acknowledge members of the GomezLab for helpful comments and feedback. This work was supported by grants through the National Institutes of Health (Grant #s CA177993, CA233811, CA238475, DK116204). Bibliography 1. Tzong-Yi Lee, Justin Bo-Kai Hsu, Wen-Chi Chang, and Hsien-Da Huang. RegPhos: a system to explore the protein kinase-substrate phosphorylation network in humans. Nucleic Acids Res., 39(Database issue):D777–87, January 2011. 2. G Manning, D B Whyte, R Martinez, T Hunter, and S Sudarsanam. The protein kinase complement of the human genome. Science, 298(5600):1912–1934, December 2002. 3. Panayotis Vlastaridis, Pelagia Kyriakidou, Anargyros Chaliotis, Yves Van de Peer, Stephen G Oliver, and Grigoris D Amoutzias. Estimating the total number of phosphopro- teins and phosphorylation sites in eukaryotic proteomes. Gigascience, 6(2):1–11, February 2017. 4. Leah J Wilson, Adam Linley, Dean E Hammond, Fiona E Hood, Judy M Coulson, David J MacEwan, Sarah J Ross, Joseph R Slupsky, Paul D Smith, Patrick A Eyers, and Ian A Prior. New perspectives, opportunities, and challenges in exploring the human protein kinome. Cancer Res., December 2017. 5. G L Johnson and Razvan Lapadat. Mitogen-Activated protein kinase pathways mediated by ERK , JNK , and p38 protein kinases. Science, 298(5600):1911, 2002. 6. Gayathri K Perera, Chrysanthi Ainali, Ekaterina Semenova, Christian Hundhausen, Guillermo Barinaga, Deepika Kassen, Andrew E Williams, Muddassar M Mirza, Mercedesz Balazs, Xiaoting Wang, Robert Sanchez Rodriguez, Andrej Alendar, Jonathan Barker, Sophia Tsoka, Wenjun Ouyang, and Frank O Nestle. Integrative biology approach iden- tifies cytokine targeting strategies for psoriasis. Sci. Transl. Med., 6(223):223ra22, February 2014. 7. Nicole Tegtmeyer, Matthias Neddermann, Carmen Isabell Asche, and Steffen Backert. Sub- version of host kinases: a key network in cellular signaling hijacked by helicobacter pylori CagA. Mol. Microbiol., May 2017. 8. Amandine Charras, Pinelopi Arvaniti, Christelle Le Dantec, Marina I Arleevskaya, Kaliopi Zachou, George N Dalekos, Anne Bordon, and Yves Renaudineau. JAK inhibitors suppress innate epigenetic reprogramming: a promise for patients with sjögren’s syndrome. Clin. Rev. Allergy Immunol., June 2019. 9. Alessia Alunno, Ivan Padjen, Antonis Fanouriakis, and Dimitrios T Boumpas. Pathogenic and therapeutic relevance of JAK/STAT signaling in systemic lupus erythematosus: Integra- tion of distinct inflammatory pathways and the prospect of their inhibition with an oral agent. Cells, 8(8), August 2019. 10. Ya Nan Deng, Joseph A Bellanti, and Song Guo Zheng. Essential kinases and transcrip- tional regulators and their roles in autoimmunity. Biomolecules, 9(4), April 2019. 11. Kyla A L Collins, Timothy J Stuhlmiller, Jon S Zawistowski, Michael P East, Trang T Pham, Claire R Hall, Daniel R Goulet, Samantha M Bevill, Steven P Angus, Sara H Velarde, Noah Sciaky, Tudor I Oprea, Lee M Graves, Gary L Johnson, and Shawn M Gomez. Proteomic analysis defines kinase taxonomies specific for subtypes of breast cancer. Oncotarget, 9 (21):15480–15497, March 2018. 12. Elise J Needham, Benjamin L Parker, Timur Burykin, David E James, and Sean J Humphrey. Illuminating the dark phosphoproteome. Sci. Signal., 12(565), January 2019. 13. Wenwen Fan, Xiaoyi Xu, Yi Shen, Huanqing Feng, Ao Li, and Minghui Wang. Prediction of protein kinase-specific phosphorylation sites in hierarchical structure using functional infor- mation and random forest. Amino Acids, 46(4):1069–1078, April 2014. 14. Shu-Yun Huang, Shao-Ping Shi, Jian-Ding Qiu, and Ming-Chu Liu. Using support vector machines to identify protein phosphorylation sites in viruses. J. Mol. Graph. Model., 56: 84–90, March 2015. 15. Fuyi Li, Chen Li, Tatiana T Marquez-Lago, André Leier, Tatsuya Akutsu, Anthony W Purcell, A Ian Smith, Trevor Lithgow, Roger J Daly, Jiangning Song, and Kuo-Chen Chou. Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphory- lation sites in the human proteome. Bioinformatics, 34(24):4223–4231, December 2018. 16. Yu Xue, Ao Li, Lirong Wang, Huanqing Feng, and Xuebiao Yao. PPSP: prediction of PK- specific phosphorylation site with bayesian decision theory. BMC Bioinformatics, 7:163, March 2006. 17. Duolin Wang, Shuai Zeng, Chunhui Xu, Wangren Qiu, Yanchun Liang, Trupti Joshi, and Dong Xu. MusiteDeep: a deep-learning framework for general and kinase-specific phos- phorylation site prediction. Bioinformatics, 33(24):3909–3916, December 2017. 18. Fenglin Luo, Minghui Wang, Yu Liu, Xing-Ming Zhao, and Ao Li. DeepPhos: prediction of protein phosphorylation sites with deep learning. Bioinformatics, 35(16):2766–2773, August 2019. 19. Ehsaneddin Asgari and Mohammad R K Mofrad. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One, 10(11):e0141287, November 2015. 20. Peter V Hornbeck, Jon M Kornhauser, Sasha Tkachev, Bin Zhang, Elzbieta Skrzypek, Beth Murray, Vaughan Latham, and Michael Sullivan. PhosphoSitePlus: a comprehen- sive resource for investigating the structure and function of experimentally determined post- translational modifications in man and mouse. Nucleic Acids Res., 40(Database issue): D261–70, January 2012. 21. Jianfei Hu, Hee-Sool Rho, Robert H Newman, Jin Zhang, Heng Zhu, and Jiang Qian. Phos- phoNetworks: a database for human phosphorylation networks. Bioinformatics, 30(1):141– 142, January 2014. 22. Holger Dinkel, Claudia Chica, Allegra Via, Cathryn M Gould, Lars J Jensen, Toby J Gibson, and Francesca Diella. Phospho.ELM: a database of phosphorylation sites–update 2011. Nucleic Acids Res., 39(Database issue):D261–7, January 2011. 23. Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signa- ture verification using a “siamese” time delay neural network. In J D Cowan, G Tesauro, and J Alspector, editors, Advances in Neural Information Processing Systems 6, pages 737–744. Morgan-Kaufmann, 1994. 24. R Hadsell, S Chopra, and Y LeCun. Dimensionality reduction by learning an invariant map- ping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recog- nition (CVPR’06), volume 2, pages 1735–1742, June 2006. Kirchoff et al. | EMBER: kinase-substrate multi-label prediction bioR‰iv | 9 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ 25. Leland McInnes, John Healy, and James Melville. UMAP: Uniform manifold approximation and projection for dimension reduction. February 2018. 10 | bioR‰iv Kirchoff et al. | EMBER: kinase-substrate multi-label prediction .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ S e e a Ma e a EMBER: Multi-label prediction of kinase-substrate phosphor lation events through deep learning 1. Metrics Definitions of metrics that characteri e the area under the receiver operator curve (AUROC) and the precision-recall curve (AUPRC) as described in the main manuscript: The AUROC is the integral of the receiver operator curve, which is found b plotting the true positive rate (TPR) against the false positive rate (FPR) at various decision thresholds. The TPR is defined as TP / (TP + FN) where TP are the true positive predictions and FN are the false negative predictions. The FPR is defined as FP / (FP + TN) where FP are the false positive predictions and TN are the true negative predictions. The AUPRC is the integral of the receiver operator curve, which is found b plotting the precision against the recall (i.e. TPR) at various decision thresholds. Precision is defined as: TP / (TP + FP) where TP are the true positive predictions and FP are the false positive predictions. 2. ProtVec embedding figures. Here, we show a qualitative comparison, via a UMAP reduction, between the original ProtVec embedding and the averaged ProtVec embedding. Original​: .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ Averaged​: 3. ProtVec embedding kNN results. In the table below we show the AUROC and AUPRC results of the kNN classification task on the original ProtVec embedding and the averaged ProtVec embedding. For our kNN calculation, we used k = 85. AUROC AUPRC Ki a e igi al P Vec a e aged P Vec igi al P Vec a e aged PV Ak 0.832 0.908 0.421 0.462 CDK 0.857 0.889 0.410 0.511 CK2 0.872 0.906 0.590 0.665 MAPK 0.888 0.907 0.688 0.739 PIKK 0.810 0.845 0.396 0.579 PKA 0.816 0.865 0.625 0.716 PKC 0.830 0.865 0.638 0.697 S c 0.979 0.998 0.906 0.993 ac 0.861 0.898 0.584 0.670 ic 0.851 0.902 0.551 0.643 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ 4. Hard are Training and testing of EMBER occurred on a Linux s stem with the following configuration: - ​Pop!_OS Linux 20.04 - Intel Xeon E5-2620 v4 with 32 cores @ 3.0 GH - 128 GB Ram - Nvidia Titan Xp On this s stem, the Siamese network took around 12 minutes to train, and the classification network took around 6 minutes to train. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ 10_1101-2020_01_28_923532 ---- The Landscape of Precision Cancer Combination Therapy: A Single-Cell Perspective 1 The Landscape of Precision Cancer Combination Therapy: A Single-Cell Perspective Saba Ahmadi1,7^, Pattara Sukprasert2,8^, Rahulsimham Vegesna3, Sanju Sinha3, Fiorella Schischlik3, Natalie Artzi4,5,6, Samir Khuller2,8, Alejandro A. Schäffer3*, Eytan Ruppin3* 1 Dept. of Computer Science, University of Maryland, College Park MD 20742 USA 2 Dept. of Computer Science, Northwestern University, Evanston IL 60208 USA 3 Cancer Data Science Laboratory, National Cancer Institute, Bethesda, MD 20892 USA 4 Dept. of Medicine, Engineering in Medicine Division, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02139 USA 5 Broad Institute of Harvard and MIT, Cambridge, MA 02139 USA 6 Institute for Medical Engineering and Science, MIT, Cambridge, MA 02139 USA 7 Part of this research done while at Dept. of Computer Science, Northwestern University, Evanston IL 60208 USA 8 Part of this research was done while at Dept. Computer Science, University of Maryland, College Park MD 20742 USA ^ Equally contributing first authors * Equally contributing corresponding authors Correspondence should be addressed to alejandro.schaffer@nih.gov and eytan.ruppin@nih.gov. Physical address: Cancer Data Science Laboratory, National Cancer Institute, Bldg. 15-C1, Bethesda, MD 20892 USA Keywords: targeted cancer therapy, combination therapy, personalized medicine, combinatorial optimization, hitting set, single-cell transcriptomics 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint mailto:alejandro.schaffer@nih.gov mailto:eytan.ruppin@nih.gov https://doi.org/10.1101/2020.01.28.923532 2 Abbreviations: CTS: cohort target set, synonym of global hitting set GEO: Gene Expression Omnibus GHS: global hitting set, synonym of cohort target set GTEx: Genotype-Tissue Expression (project or consortium) HPA: Human Protein Atlas HUGO: Human Genome Organization IHS: individual hitting set, synonym of individual target set ILP: integer linear programming ITS: individual target set lb: lower bound on fraction of tumor cells killed RME: receptor-mediated endocytosis TPM: transcripts per million ub: upper bound on fraction of non-tumor cells killed 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 3 Abstract The availability of single-cell transcriptomics data opens new opportunities for rational design of combination cancer treatments in a systematic manner. Mining such data, we employed combinatorial optimization techniques to explore the landscape of optimal combination therapies in solid tumors, including brain, head and neck, melanoma, lung, breast and colon cancers. We assume that each individual therapy can target any one of 1269 genes encoding cell surface receptors, which may be targets of CAR-T, conjugated antibodies or coated nanoparticle therapies. In most cancer types, personalized combinations composed of at most four targets are sufficient to kill at least 80% of the tumor cells while killing at most 10% of the non-tumor cells in each patient. The number of distinct targets needed to do that for all patients in 8 of the 9 cohorts we studied is at most 11, while one larger melanoma cohort requires over 30 distinct targets. Further requiring that the target genes be lowly expressed across many different healthy tissues uncovers qualitatively similar trends. However, as one requires either more stringent killing thresholds or more stringent sparing of non-cancerous tissues beyond these baseline values, the number of targets needed rises rapidly. Emerging promising targets include the gene PTPRZ1, which is frequently found in the optimal combinations for brain and head and neck cancers, and EGFR, a recurring target in multiple tumor types. In sum, this is the first systematic single-cell based characterization of the landscape of combinatorial receptor-mediated cancer treatments, identifying promising targets for future development. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 4 Introduction Personalized oncology offers hope that each patient's cancer can be treated based on its genomic characteristics1,2. Several trials have suggested that it is possible to collect genomics data fast enough to inform treatment decisions3-5. Meta-analysis of Phase I clinical trials completed during 2011-2013 showed that overall, trials that used molecular biomarker information to influence treatment plans gave better results than trials that did not6. However, most precision oncology treatments utilize only one or two medicines, and resistant clones frequently emerge, emphasizing the need to deliver personalized medicine as multiple agents combined6-11. Important opportunities to combine systems biology and design of nanomaterials have been recognized to deliver medicines in combination to overcome drug resistance and combine biological effects12. Here, we propose and rigorously study a new conceptual framework for designing future precision oncology treatments. It is motivated by the growing recognition that tumors typically have considerable intra-tumor heterogeneity (ITH)13,14 and thus need to be targeted with a combination of medicines such that as many as possible tumor cells are hit by at least one medicine. Our analysis is based on two recently emerging technologies: (1) the advancement of single-cell transcriptomics and proteomics measurements from patients’ tumors, which is anticipated to gradually enter into clinical use15, and (2) the introduction of “modular” treatments that target specific overexpressed genes/proteins to recognize cells in a specific manner and then use either the T cell immune response or a lethal toxin to kill the tumor cells preferentially. Based on these two foundations, we formulate and systematically answer two basic questions. First, how many targeted treatments are needed to selectively kill most tumor cells while sparing most of the non-tumor cells in a given patient? And second, given a cohort of patients to treat, how many distinct single-target treatments need to be prepared beforehand so that there is a combination that kills at least a specified proportion of the tumor cells of each patient? We focus our analysis on genes encoding protein targets that encode receptors on the cell surface, as these may be precisely targeted by any one of at least six technologies: e.g., by CAR- T therapy16, immunotoxins ligated to antibodies17-18, immunotoxins ligated to mimicking 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 5 peptides19, conventional chemotherapy ligated to nanoparticles20, degraders associated with ubiquitin E3 ligases21 and designed ankyrin repeat proteins (DARPins)22,23. These treatments are all “modular”, including one part that specifically targets the tumor cell via one gene/protein and another part, the cytotoxic mechanism that kills the cells. Two recent genome-wide analyses of modular therapies have focused on CAR-T therapy24,25, so we focus first on this technology to put our work in context. In the original formulation, CAR-T therapy used one cell surface target that marks the cells of interest, such as CD19 as a marker for B cells. To date, CAR-T therapy has been effective in achieving remissions for some blood cancers16,26, but less effective for solid tumors. MacKay et al.25 focused primarily on single targets and looked at combinations of two targets and did all analysis in silico. Dannenfelser et al.24 focused on predicting combinations of two and three targets and did most of their work in silico, with in vitro validation of two high-scoring predicted combinations in renal cancer. Importantly, these studies have analyzed bulk tumor and normal expression data to identify likely targets. Here we present the first analysis that aims to identify modular targets based on the analysis of tumor single-cell transcriptomics. This enables to study the research questions at a higher resolution but presents new analytical challenges that need to be addressed. Two related difficulties with CAR-T therapy are i) toxicity to non-cancer cells27,28 and ii) difficulty in finding single targets that are sufficiently selective25. To address the toxicity problem, MacKay et al.25 selected 533 targets that had low expression in most tissues in the Genotype-Tissue Expression (GTEx) data; however, their analysis did not require that the targets are cell surface proteins. We proceed in a stepwise manner; we start with a formal analysis of a space of 1269 candidate cell surface receptors. Then, we proceed to add a low-expression requirement like that of MacKay et al.25 and parameterized by a transcripts per million (TPM) expression threshold. For completeness, we also tested their set of 533 genes. To address the selectivity problem, various groups have engineered composite forms of CAR-T treatments that implement Boolean AND, OR, and NOT gates that have been tested for combinations of up to three target proteins29-33. Both MacKay et al. and Dannenfelser et al. presented in silico methods focusing on AND gates and pairs or trios of targets; Dannenfelser et al. analyzed 2538 likely cell surface proteins that are not necessarily receptors. We have chosen 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 6 to focus on the simpler logical OR construction because that can be achieved not only by CAR-T technology30,31, but can also be implemented via other modular treatment technologies by combining multiple single-target treatments, assuming that the composite treatment kills a cell if any one of the single treatments kill the cell. Conceptually, such a logical OR combination treatment can still achieve selectivity by choosing targets, each of which is expressed on a much higher proportion of cancer cells than non-cancer cells. One of our key contributions is to show that by using techniques from combinatorial optimization, one can find such effective combinations involving a large number of targets, while previous studies were limited to at most three targets. Beyond CAR-T, our analysis applies to several additional types of modular treatment technologies that rely instead on receptor-mediated endocytosis (RME) delivering a toxin via a targeted receptor to enter the cell34,35. Like CAR-T, these RME-based technologies do not downregulate the target receptor. For RME technologies and other technologies that work intracellularly, we anticipate combining modular treatments from one technology such that all treatments use the same toxin or mechanism of cell killing, thereby mitigating the need to test for interaction effects between pairs of different treatments. To address these research questions, we designed and implemented a computational approach named MadHitter (after the Mad Hatter from Alice in Wonderland) to identify optimal precision combination treatments that target membrane receptors (Figure 1, A-C). We define three key parameters related to the stringency of killing the tumor and protecting the non-tumor cells and explore how the optimal treatments vary with those parameters (Figure 1B, C). Solving this problem is analogous to solving the classical “hitting set problem” in combinatorial algorithms36, which is formally defined in the Methods (see also Supplementary Materials 1). Unlike the previous studies on CAR-T targets, we define the problem in a personalized manner, intending that each patient will get optimal treatments for her or his tumor from among a collection of available treatments. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 7 Figure 1. Conceptual schematic example of MadHitter analysis of single-cell data transcriptomics from three cancer patients. (A) A cohort of patients (three in this example) arrives for a study in which single-cell tumor microenvironment (TME) transcriptomics data are collected from each patient; the data are analyzed with MadHitter and each patient receives an optimal personalized combination of targeted therapies from a pre- specified set (pill bottle). MadHitter is aimed at optimizing combinations of targeted therapies that are modular, that is, having a recognition unit that is gene/protein-specific, and a joint killing subunit (similar for all gene targets). Icons of four such modular therapies are shown; for three of these, the target protein must be on the cell surface and for two it must be a receptor, so we focus our analyses on cell surface receptors. Three main algorithm parameters are denoted near the MadHitter icon in panel A and explained in the later panels. (B) The 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 8 single-cell TME data are represented in two matrices with the genes as rows and cells as columns, partitioned into tumor (T) and non-tumor (N) cells. The expression ratio r determines by how much a gene must be overexpressed for a cell to be considered as a targeted. A gene is considered ‘overexpressed’ in either a non- tumor cell or a tumor cell if its expression is at least r times the mean, reference level; e.g, the reference level for FLT1 is (7+11+9)/3 = 9 and only cell T3 has FLT1 expression above 9×2 = 18. The matrices on the right side show a Boolean representation of which targets kill which cells, based on the expression values presented in this toy problem in matrix B and taking r=2. Accordingly, the combination of EGFR and KDR would kill all tumor cells and would spare all non-tumor cells. (C) The main algorithm in MadHitter seeks a combination of targets that is as small as possible and would kill many tumor cells and few non-tumor cells, in a patient- specific manner. The 𝑙𝑏 and 𝑢𝑏 parameters are the lower bound on the fraction of tumor cells killed and the upper bound on the fraction of non-tumor cells whose killing is tolerated, respectively. Baseline settings used in our analyses are 𝑟 = 2, 𝑙𝑏 = 0.8 and 𝑢𝑏 = 0.1, and are varied in some of the analyses. The right side of the panel shows a hypothetical example of the tradeoff between killing tumor cells and sparing non-tumor cells. While target set A could kill a larger fraction of tumor cells than target set B, MadHitter would select target set B since only it satisfies both our baseline settings and kills at most 0.1 fraction of the non-tumor cells. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 9 Results The Data and the Combinatorial Optimization Framework We focused our analysis searching for optimal treatment combinations in nine single-cell RNAseq data sets that include tumor cells and non-tumor cells from at least three patients for that were publicly available at the onset of our investigation (Methods; Table 1). Those data sets include four brain cancer data sets and one each from head and neck, melanoma, lung, breast and colon cancers. Most analyses were done for all data sets, but for clarity of exposition, we focused in the main text analyses on four data sets from four different cancer types (brain, head and neck, melanoma, lung) that are larger than the other five and hence, make the optimization problems more challenging. Results on the other five data sets are provided in Supplementary Materials 2. Analyzing separately each of these data sets, we ask how many targets are needed to kill most cells of a given tumor and what is the tradeoff between cancer cells killed and non-cancer cells spared? Figure 2 shows a small schematic example in which there are alternative target sets of sizes two and three. One would prefer the target set of size two because the patients would need to receive only two distinct treatments rather than three treatments. Figure 2. A schematic small example of killing a four tumor cells illustrating why choosing a minimum-size combination of targets may be non-trivial. The schematic tumor has four cancer cells (A, B, C, D in separate columns), which may express any of five cell-surface receptor genes (rows) that may be targeted selectively by modular treatments (pills). If one targets {APP, KDR, MET}, all cancer 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 10 cells will be killed (left panel). However, if, instead, one would target {CR2, TEK} then all cancer cells in the given example tumor will be killed with just two targets (right panel) instead of three, providing a smaller solution. To formalize our questions as combinatorial optimization hitting set problems, we define the following parameters and baseline values and explore how the optimal answers vary as functions of these parameters: We specify a lower bound on the fraction of tumor cells that should be killed, 𝑙𝑏, which ranges from 0 to 1. Similarly, we define an upper bound on the fraction of non-tumor cells killed, 𝑢𝑏, which also ranges from 0 to 1. Our baseline settings are 𝑙𝑏 = 0.8 and 𝑢𝑏 = 0.1. To represent the concept that only cells that overexpress the target, we introduce an additional parameter 𝑟. The expression ratio 𝑟 defines which cells are killed, as follows (Figure 1B): Denote the mean expression of a gene 𝑔 in non-cancer cells that have non- zero expression by E(𝑔). A given cell is considered killed if gene 𝑔 is targeted and its expression level in that cell is at least 𝑟 × 𝐸(𝑔). Higher values of 𝑟 thus model more selective killing. Having 𝑟 as a modifiable parameter anticipates that in the future one could experimentally tune the overexpression level at which cell killing occurs24. In this respect, technologies that rely on RME to get a toxin into the cell are particularly tunable because there is known to be a non-linear relationship between the number of protein copies on the cell surface and the probability that RME occurs successfully37. In these technologies, the toxin or other therapy delivered by the modular treatment enters cells in a gene-specific manner38, while CAR- T therapy activates T-cell killing against cells in a gene-specific manner24,25. For most of our analyses, the expression ratio 𝑟 is varied from 1.5 to 3.0, with a baseline of 2.0, based on experiments in the lab of N.A. and related to combinatorial chemistry modeling37; in one analysis, we varied r up to 5.0 (Supplementary Table S1). Given these definitions, we solve the following combinatorial optimization hitting set problem (Methods): Given an input of a single-cell transcriptomics sample of non-tumor and tumor cells for each patient in a cohort of multiple patients, bounds 𝑢𝑏 and 𝑙𝑏, ratio 𝑟, and a set of target genes, we seek to find a solution that finds a minimum-size combination of targets in each individual patient, while additionally minimizing the size of all targets given to the patients cohort. The latter is termed the global minimum-size hitting set (GHS) in computer science terminology or the cohort target set (CTS) in terminology specific to our problem, while the 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 11 optimal hitting set of genes targeting one patient is termed the individual target set (ITS). This optimum hitting set problem with constraints can be solved to optimality using integer linear programming (ILP) (Methods). We solve different optimization problem instances, each of which considers a different set of candidate target genes: 1269 genes encoding cell surface receptor proteins, and subset of 58 out of these 1269 genes that already have published ligand- mimicking peptides, and a nested collection of sets of 424-900 out of the 1269 genes that are lowly expressed below a series of decreasing gene expression thresholds25. From a computational standpoint, there is no inherent limit on the size of the candidate gene set. Our formulation is personalized as each patient receives the minimum possible number of treatments. The global optimization comes into play only when there are multiple solutions of the same size to treat a patient. For example, suppose we have two patients such that patient A could be treated by targeting either {EGFR, FGFR2} or {MET, FGFR2} and patient B could be treated by targeting either {EGFR, CD44} or {ANPEP, CD44}. Then we prefer the CTS {EGFR, FGFR2, CD44} of size 3 and we treat patient A by targeting {EGFR, FGFR2} and patient B by targeting {EGFR, CD44}. As the number of cells per patient varies by three orders of magnitude across data sets, we use random sampling to obtain hitting set instances of comparable sizes and yet adequately capture tumor heterogeneity. We found that sampling hundreds of cells from the tumor is sufficient to get enough data to represent all cells. In most of the experiments shown, the number of cells sampled, which we denote by 𝑐, was 500. In some smaller data sets, we had to sample smaller numbers of cells (Methods). As shown in (Supplementary Materials 2, Figures S1-S2), 500 cells, when available, are roughly sufficient for CTS size to plateau for our baseline parameter settings, 𝑙𝑏 = 0.8, 𝑢𝑏 = 0.1, 𝑟 = 2.0. For each individual within a data set, we performed independent sampling of c cells 20 times and their results were summarized. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 12 Cohort and Individual Target Set Sizes as Functions of Tumor Killing and Non-Tumor Sparing Goals Given the single-cell tumor data sets and the ILP optimization framework described above, we first studied how the resulting optimal cohort target set (CTS) may vary as a function of the parameters defining the optimization objectives in different cancer types. Figures 3 and S3-S7 in Supplementary Materials 3 show heatmaps of CTS sizes when varying lb, ub, and r around the baseline values of 0.8, 0.1, and 2.0, respectively. The CTS sizes for melanoma were largest, partly due to the larger number of patients in that data set (Table 1). Indeed, as we sampled subsets of 5 or 10 patients uniformly and observed that the mean CTS sizes grew from 7.9 (5 patient subsets) to 12.3 (10 patient subsets) to 31.0 (all patients, as shown in Figure 3). Encouragingly, for most data sets and parameter settings, the optimal CTS sizes are in the single digits. However, in several data sets, we observe a sharp increase in CTS size as 𝑙𝑏 values are increased above 0.8 and/or as the 𝑢𝑏 is decreased below 0.1, with a more pronounced effect of varying 𝑙𝑏. This transition is more discernable at the lowest value of 𝑟 (1.5), probably because when 𝑟 is lower, it becomes harder to find genes that are individually selective in killing tumor cells and sparing non-tumor cells (Supplementary Figures S3-S7). The qualitative transition observed in CTS sizes occurs robustly regardless of the threshold for filtering out low expressing cells when preprocessing the data (Supplementary Materials 4, Figures S8-S10). 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 13 Figure 3. Heat maps showing how the cohort target set size (CTS) varies as a function of 𝒍𝒃, 𝒖𝒃, 𝒓 and across data sets. For each plot the x-axis and y-axis represent lb and ub parameter values, respectively. The scale on the right shows the cohort target set sizes by color scale. We show separate plots for 𝑟 = 2.0, 3.0 here and a larger set {1.5, 2.0, 2.5, 3.0} in Supplementary Materials 3. Individual values are not necessarily integers because each value represents the mean of 20 replicates of sampling 𝑐 (500 for each of the data sets shown here) cells (Figure S1). We next examined what are the resulting individual target set (ITS) sizes obtained in the optimal combinations under the same conditions. In all data sets, the mean ITS sizes are in the single digits for most values of 𝑙𝑏 and 𝑢𝑏. The distributions of ITS sizes are shown for four data sets and two combinations of (𝑙𝑏, 𝑢𝑏) (Figure 4) and for additional data sets in Supplementary Materials 5, Figure S11. Overall, the mean ITS sizes with the baseline parameter values (𝑟 = 2.0, 𝑙𝑏 = 0.8, 𝑢𝑏 = 0.1) range from 1.0 to 3.91 among the nine data sets studied (Supplementary Table S2); on average 4 targets per patient should hence suffice if enough single-target treatments are available in the cohort target set. However, there is considerable variability across patients. Evidently, as we make the treatment requirements more stringent (by increasing 𝑙𝑏 from 0.8 to 0.9 and decreasing 𝑢𝑏 from 0.1 to 0.05), the variability in ITS size across patients became 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 14 larger. Importantly, this analysis provides rigorous quantifiable evidence grounding the prevailing observation that among tumors of the same type, some individual tumors may be much harder to treat than others. Taken together, these results show that we can compute precise estimates of the number of targets needed for cohorts (in the tens) and individual patients (in the single digits usually) and that these estimates are sensitive to the killing stringency, especially when the 𝑙𝑏 increases above 0.8. The variation for more aggressive killing regimes, with values of 𝑙𝑏 up to 0.99 for the baseline 𝑟 = 2.0 is displayed in Figures S12-S13 in Supplementary Materials 6. For fixed 𝑙𝑏 = 0.8, 𝑢𝑏 = 0.1 and varying 𝑟, smallest CTS sizes are typically obtained for 𝑟 values close to 2.0, further motivating our choice of 𝑟 = 2 as the default value (Supplementary Materials 7, Figures S14-S15, Supplementary Table S1). Finally, we show that, as expected, a ‘control’ greedy heuristic algorithm searching for small and effective target combinations finds ITS sizes substantially larger than the optimal ITS sizes identified using our optimization algorithm (Figure 4). The greedy CTS size is greater than the ILP optimal CTS size for eight out of nine data sets (Table S2 in Supplementary Materials 8, Methods). 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 15 Figure 4. The distribution of optimal and greedy individual treatment combination sizes (ITS) values in four different cancer types. We study both our baseline parameter setting (upper row panels) and a markedly more stringent one (middle row plots). For the more stringent parameter setting, we compare the ITS sizes obtained using MadHitter (middle row plots) and a greedy algorithm that tries to add pairs of genes at a time (bottom row plots). In each plot, the patients are sorted from left to right according to their mean ITS values in the optimal stringent regime. Additional comparisons between ITS sizes at different parameter settings can be found in Supplementary Materials 5. Description of the greedy algorithm and more comparisons between the optimal and greedy algorithms are provided in Supplementary Materials 8. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 16 The Landscape of Combinations Achievable with Receptors Currently Targetable by Published Ligand-Mimicking Peptides To get a view of the combination treatments that are possible with receptor targets for which there are already existing modular targeting reagents, we conducted a literature search identifying 58 out of the 1269 genes with published ligand-mimicking peptides that have been already tested in in vitro models, usually cancer models (Methods; Tables 3 and 4). We asked whether we could find feasible optimal combinations in this case and if so, how do the optimal CTS and ITS sizes compare vs. those computed for all 1269 genes? Figure 5. Comparison of Individual Target Set Sizes with 1269 or 58 targets for three out of the six data sets that have feasible solutions. We attempted to find feasible solutions for all patients using 58 cell surface receptors that have published ligand-mimicking peptides that have been tested in vitro or in pre-clinical models. There are feasible solutions for all patients in six data sets, but not for the brain (GSE84465), melanoma (GSE115978), and lung (E-MTAB-6149), which were displayed in previous figures. Instead, we show here results for breast and colorectal cancers, for which other analyses, such as those in Figures 3 and 4, are in the Supplementary Materials. Some of the optimal solutions obtained on the 58-receptors restricted set are of the same size to those obtained on the whole receptors set and some are larger. Computing the optimal CTS and ITS solutions for this basket of 58 targets, we found feasible solutions for six of the data sets across all parameter combinations we surveyed and 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 17 three of these six are illustrated for each patient in Figure 5. However, for three data sets, in numerous parameter combinations we could not find optimal solutions that satisfy the optimization constraints (Supplementary Materials 9, Figures S16-S18). That is, the currently available targets do not allow one to design treatments that may achieve the specified selective killing objectives, underscoring the need to develop new targeted cancer therapies, to make personalized medicine more effective for more patients. Overall, comparing the optimal solutions obtained with 58 targets to those we have obtained with the 1269 targets, three qualitatively different behaviors are observed (Supplementary Materials 9, Figures S16-S18): (1) In some datasets, it is just a little bit more difficult to find optimal ITS and CTS solutions with the 58-gene pool, while in others, the restriction to a smaller pool can be a severe constraint making the optimization problem infeasible. (2) The smaller basket of gene targets may force more patients to receive similar individual treatment sets and thereby reduces the size of the CTS. (3) Unlike the CTS size, the ITS size must stay the same or increase when the pool of genes is reduced, because we find the optimal ITS size for each patient. Overall, the average ITS sizes across each cohort using the pool of 58 genes for baseline settings range from 1.16 to 4.0. Among cases that have any solution, the average increases in the ITS sizes at baseline settings in the 58 genes case vs. that of the 1269 case were moderate, ranging from 0.16 to 1.33. Optimal Fairness-Based Combination Therapies for a Given Cohort of Patients Until now we have adhered to a patient-centered approach that aims to find the minimum-size ITS for each patient, first and foremost. We now study a different, cohort-centered approach, where given a cohort of patients, we seek to minimize the total size of the overall CTS size, while allowing for some increase in the ITS sizes. The key question is how much larger are the resulting ITS sizes if we optimize for minimizing the cohort (CTS size), rather than the individuals (ITS size)? This challenge is motivated by a ‘fairness’ perspective (Supplementary Materials 1), where we seek solutions that are beneficial for the entire community from a social or economic perspective (in terms of CTS size) even if they are potentially sub-optimal at the individual level (in terms of ITS sizes). Here, the potential benefit is economic since running a 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 18 basket trial would be less expensive if one reduces the size of the basket of available treatments (Figure 6A-B). We formalized this ‘fair CTS problem’ by adding a cost parameter 𝛼 that specifies the limit on the excess number of (ITS) targets selected for any individual patient, compared to the number selected in the individual-based approach that was studied up until now (formally, the latter corresponds to setting 𝛼 = 0). We formulated and solved via ILP this fair CTS problem for up to 1269 possible targets on all nine data sets (Methods). We fixed 𝑟 = 2 and 𝑢𝑏 = 0.1 while varying 𝛼 and 𝑙𝑏. Figure 6C and Figures S19-S23 in Supplementary Materials 10 show the optimal CTS and ITS sizes for 𝛼 = 0, . . . ,5. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 19 Figure 6. A schematic example demonstrating the rationale and workings of fairness-based solutions. (A, B) Let us assume that each of three patients has two tumor cells (columns), each displaying five membrane receptors that are highly expressed only on the tumor cells and not on the non-tumor ones (rows). If we target {APP, MET} (panel A, 𝛼 = 1) in all patients, then this achieves a CTS size of 2, which is the minimum possible. Employing the original individual- based optimizing objective, each patient could instead be treated by an ITS of size 1 by targeting the distinct receptors called Target 1 (specific to Patient 1), Target 2 and Target 3, respectively, but this would result in an optimal CTS of size 3 (panel B, 𝛼 = 0). The solution in panel A has an unfairness value 𝛼 = 1 because the worst difference among all patients is that a patient 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 20 receives 1 more treatment than necessary. (C) Heatmaps showing how the CTS size varies as 𝛼 increases (y-axis), starting from its baseline value of 0 where each patient is assigned a minimum-sizes individual treatment set (top row). The lower bound on tumor cells killed (x- axis) is also varied while the upper bound on non-tumor cells killed is kept fixed at 0.1. We are particularly interested in finding the smallest value on the y-axis at which the CTS size reaches its minimum value, which is circled for the baseline 𝑙𝑏 = 0.8, because this bounds the tradeoff between the achievable reduction in the number of targets needed to treat the whole cohort and the number of extra targets above the ITS minimum that any patient might need to receive. For 8 out of 9 data sets, we encouragingly find that the unfairness cost parameter 𝛼 is bounded by a constant of 3; i.e., it is sufficient to increase 𝛼 by no more than 3 to obtain the smallest CTS sizes in the optimally fair solutions. For the largest data set (melanoma), 𝛼 = 4. As we show in Supplementary Materials 10, empirically, even if one requires lower α values, then as those approach 0, the size of the fairness-based CTS grows fairly moderately and remains in the lower double digits, and the mean size of the number of treatments given to each patient (their ITS) is overall < 5. Theoretically, we show that one can design instances for which 𝛼 would need to be at least √𝑛 − 1 to get a CTS of size less than the overall number of targets 𝑛 (Supplementary Materials 10). However, in practice, we find that given the current tumor single-cell expression data, fairness-based treatment strategies are likely to be a reasonable economic option in the future. The Landscape of Optimal Solutions Targeting Receptors that are Lowly Expressed Across Many Healthy Tissues We turn to examine the space of optimal solutions when restricting the set of eligible surface receptor gene targets to those that have lower expression across many noncancerous human tissues (Methods), aiming to mitigate potential damage to tissues unrelated to the tumor site. To this end, we selected subsets of the 1269 cell surface receptor targets in which the genes have overall lower expression across multiple normal tissues, by mining GTEx and the Human Protein Atlas (HPA) (Methods). Varying the selectivity expression thresholds (expressed in transcripts per million (TPM)) used to filter out genes whose mean expression across the normal adult tissues is above values of 10, 5, 2, 1, 0.5, and 0.25 (i.e., employing more and more extensive 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 21 filtering as this threshold is decreased), decreases the size of the target cell surface receptor gene list by more than half (Table 2). As shown in Figures 7A, B (and Supplementary Figures S24-S26), MadHitter identifies very different cohort target sets (which are larger than the original optimal solutions, as expected) as the TPM selectivity threshold value is decreased. Furthermore, different ITS instances may become infeasible (Supplementary Figure S27). At an individual patient level, using lower selectivity threshold levels, which leads to a smaller space of membrane receptors to choose from, also leads to increased mean ITS sizes (Supplementary Figures S28, S29). Across the nine data sets, the selectivity threshold at which the CTS problem became infeasible varied (Supplementary Figure S27). The differences observed could be the result of expression heterogeneity of the cancer, number of patients within the data set, size of target gene set, lack of expression of available gene targets and other unknown factors. In the future, further experimentation is required to identify tissue-specific optimal gene expression thresholds that will minimize side effects while allowing cancer cells to be killed by combinations of targeted therapies. Finally, for completeness, we also tested MadHitter on the set of 533 lowly expressed genes suggested by MacKay et al.25 All instances with default setting of 𝑟, 𝑙𝑏, 𝑢𝑏 have feasible solutions for all patients. Mean ITS sizes are below 4 for eight of nine data sets, but close to 10 for the brain cancer data set GSE84465. More details can be found in Supplementary Materials 11 and Table S3. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 22 Figure 7. Variation in the CTS size and composition as function of the magnitude of filtering of genes expressed in noncancerous human tissues, for different tumor types. (A- B) The number of times a gene (cell-surface receptor) is included in the CTS (out of 20 replicates, which is therefore the Max count in panels A-B), where each column presents the CTS solutions when the input target genes sets are filtered using a specific TPM filtering threshold (Methods), for (A) a breast cancer and (B) brain cancer. These data sets were selected due to their relatively small cohort target set sizes, permitting their visualization. (C-F) CircOs plots of the genes occurring most frequently in optimal CTS solutions (length of arc along the circumference) and their pairwise co-occurrence (thickness of the connecting edge) for the four main cancer types, in our original target space of 1269 encoding cell-surface receptors. For each data set, we sampled up to 50 optimal CTS solutions. Network representations of the 12 most common target genes out of 1269 encoding cell-surface receptors (with greater than 5% frequency of occurrence) are represented in a cancer specific manner for (C) brain cancer, (D) head and neck cancer, (only seven genes have a frequency of 5% or more across optimal 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 23 solutions), (E) melanoma, and (F) lung cancer. Genes and connections have distinct colors for improved visibility. Key Targets Composing Optimal Solutions Across the Space of 1269 membrane receptors To identify the genes that occur most often in optimal solutions for our baseline settings, since there may be multiple distinct optimal solutions composed of different target genes, we sampled up to 50 optimal solutions for each optimization instance solved and recorded how often each gene occurs and how often each pair of genes occur together (Methods). We analyzed and visualized these gene (co-)occurrences in three ways. First, we constructed co-occurrence circus plots in which arcs around the circle represent frequently occurring genes and edges connect targets that frequently co-occur in optimal CTS solutions. Figure 7C-F shows the co-occurrence visualizations for optimal CTS solutions obtained with the original, unfiltered target space of 1269 genes and in baseline parameter settings. The genes frequently occurring in optimal solutions are quite specific and distinct between different cancer types. In melanoma, the edges form a clique-like network because virtually all optimal solutions include the same clique of 12 genes (Figure 7E). The head and neck cancer data set has only one commonly co-occurring pair {GPR87, CXADR} (Figure 7D). Of the cancer types not depicted in Figure 7, the breast cancer data set has a commonly co-occurring set of size 4, {CLDN4, INSR, P2RY8, SORL}, and the colorectal cancer data set has a different commonly co-occurring set of size 4, {GABRE, GPRR, LGR5, PTPRJ} (data not shown). We next tabulated sums of how often each gene occurred in optimal solutions for all nine data sets (Supplementary Materials 12, Tables S4, S5 and S6), obtained when solving for either 58 gene targets or 1269 gene targets. Strikingly, one gene, PTPRZ1 (protein tyrosine phosphatase receptor zeta 1), appears far more frequently than others, especially in three brain cancer data sets (GSE70630, GSE89567, GSE102130, Supplementary Table S6). PTPRZ1 also occurs commonly in optimal solutions for the head and neck cancer data set (Figure 7D). The brain cancer finding coincides with previous reports that PTPRZ1 is overexpressed in glioblastoma (GBM)39,40. PTPRZ1 also forms a fusion with the nearby oncogene MET in some brain tumors that have an overexpression of the fused MET41. Notably, various cell line studies 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 24 and mouse studies have shown that inhibiting PTPRZ1, for example by shRNAs, can slow glioblastoma tumor growth and migration42,43. There have been some attempts to inhibit PTPRZ1 pharmacologically in brain cancer and other brain disorders44,45. In the four brain cancer data sets, PTPRZ1 is expressed selectively above the baseline 𝑟 = 2.0 in 0.99 (GSE89567), 0.84 (GSE70630), 0.96 (GSE102130) and 0.27 (GSE84465) proportion of cells in each cohort. The much lower relative level of PTPRZ1 expression in GSE84465 is likely due to the heterogeneity of brain cancer types in this data set46. Among the 58 genes with known ligand-mimicking peptides, EGFR stands out as most common in optimal solutions (Supplementary Table S4). Even when all 1269 genes are available, EGFR is most commonly selected for the brain cancer data set (GSE84465) in which PTPRZ1 is not as highly overexpressed (Figure 7C). PTPRZ1 was the fifth most frequently occurring gene in optimal solutions for the head and neck cancer data set (GSE103322). The two most common genes by a large margin are CXADR and GPR87. CXADR has been studied primarily by virologists and immunologists because it encodes a receptor for cocksackieviruses and adenoviruses47. In one breast cancer study, CXADR was found to play a role in regulating PTEN in the AKT pathway, but CXADR was underexpressed in breast cancer48 whereas it is overexpressed in the head and neck cancer data we analyzed. GPR87 is a rarely studied G protein-coupled receptor with an unknown natural ligand49. In the context of cancer, GPR87 has previously been reported as overexpressed in several tumor types including lung and liver49 and its overexpression may play an oncogenic role via either the p53 pathway50 the NFκB pathway51 or other pathways. Finally, we analyzed the set of genes in optimal solutions via the STRING database and associated tools52 to perform several types of gene set and pathway enrichment analyses. Figures S30-S33 (Supplementary Materials 12) show STRING-derived protein-protein interaction networks for the 25 most common genes in the same four data for which we showed co- occurrence graphs in Figure 7C-F. Again, EGFR stands out as being a highly connected protein node in the solution networks for both the brain cancer and head and neck cancer data sets. Among the 30 genes in the 1269-gene set most commonly in optimal solutions (Supplementary Table S5), there are six kinases (out of 88 total human transmembrane kinases with a catalytic domain, STRING gene set enrichment 𝑝 < 1𝑒 − 6), namely {EGFR, EPHB4, ERBB3, FGFR1, INSR, NTRK2} and two phosphatases {PTPRJ, PTPRZ1}. The KEGG pathways most 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 25 significantly enriched, all at 𝐹𝐷𝑅 < 0.005, are (“proteoglycans in cancer”) represented by {CD44, EGFR, ERBB3, FGFR1, PLAUR}, (“adherens junction”) represented by {EGFR, FGFR1, INSR, PTPRJ}, and (“calcium signaling pathway”) represented by {EDNRB, EGFR, ERBB3, GRPR, P2RX6}. The one gene in the intersection of all these pathways and functions is EGFR. Discussion In this multi-disciplinary study, we harnessed techniques from combinatorial optimization to analyze publicly available single-cell tumor transcriptomics data to chart the landscape of future personalized combinations that are based on ‘modular’ therapies, including CAR-T therapy. We showed that, for most tumors we studied, four modular medications targeting different overexpressed receptors may suffice to selectively kill most tumor cells, while sparing most of the non-cancerous cells (Figures 3 and 4 and Table S2). For the more restricted sets of low- expression genes25 or the 58 receptors with validated ligand-mimicking peptides (Tables 3 and 4), some patients do not have feasible solutions, especially as we reduce the TPM expression used for filtering the gene set to avoid targeting non-cancerous tissues. These findings indicate, on one hand, that researchers designing ligand-mimicking peptides have been astute in choosing targets relevant to cancer. On the other hand, these results suggest that there is a need for extending the set of cell surface receptors that can be targeted to enter tumor cells with ligated chemotherapy agents. Remarkably, we found that if one designs the optimal set of treatments for an entire cohort adopting a fairness-based policy, then the size of the projected treatment combinations for individual patients are at most 3 targets larger, and in most data sets at most 1 target receptor larger than the optimal solutions that would have been assigned for these patients based on an individual-centric policy (Figure 6, Supplementary Materials 10). This suggests that the concern that the personalized treatment for any individual will be suboptimal solely because that individual happens to have registered for a cohort trial appears to be tightly bounded. Like the study of MacKay et al.25, our study is a conceptual computational investigation. We studied nine data single-cell expression data sets for the first time, but it would be helpful to 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 26 analyze more and larger data sets in the future. Even among four data sets of the same (brain) cancer type, we observed considerable variability in CTS and ITS sizes. Since our approach is general and the software is freely available as source code, other researchers can test the method on new data sets or add new variables, constraints, and optimality criteria. These investigations may of course lead other to further improve our method and broaden its applicability. In future work, we plan to apply our approach to study ways for selectively killing specific populations of immune cells, such as myeloid-derived suppressor cells, because they inhibit tumor killing, while sparing most other non-cancer cells. We compared gene expression levels between non-cancer cells and cancer cells sampled from the same patient, which avoids inter-patient expression variability53. However, we did little to account for “expression dropout” beyond the normalization performed by the providers of the data sets, aiming to preserve the public data as it was submitted to GEO or Array Express. To achieve some uniformity and also to take as cautious an approach as possible, we added a step to filter low expressing cells because some data sets had already been filtered in this way. One could instead apply imputation methods such as MAGIC54 or scImpute55 to infer denser gene expression matrices and then apply our method to the adjusted input data. Therefore, our results about the sizes of optimal ITS should be viewed as estimated upper bounds that are likely to decrease if the dropout rate decreases or if cells expressing few genes are eliminated from the analysis more stringently. Another limitation of our method is that we viewed the measured gene expression as being valid over all time, even though gene expression is known to be a stochastic process56. In the future, we would like to extend our approach to use the single-cell data to infer how stochastic is the expression of each gene57 and to prefer targets whose expression is more stable. Even though the combinatorial optimization problems solved here are in the worst-case exponentially hard (NP-complete36 in computer science terminology), the actual instances that arise from the single-cell data could be either formally solved to optimality or shown to be infeasible with modern optimization software. Of note, Delaney et al., have recently formalized a related optimization problem in analysis of single-cell clustered data for immunology38. Their optimization problem is also NP-complete in the worst case and they could solve sets of up to size four using heuristic methods38. We have shown that the optimal ILP solutions we obtained 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 27 are often substantially smaller than solutions obtained via a greedy heuristic (Figure 4, Supplementary Materials 8 including Table S2). On the cautionary side, experiments with target gene sets that were further filtered by low expression in normal tissues showed that the individual target set problem can become infeasible in many instances. Even when the instance remained feasible, optimal cohort treatment set sizes increased rapidly as the expression levels allowed decreased (Figure 7), pointing to potential inherent limitations of applying such combination approaches to patients in the clinic and the need to carefully monitor their putative safety and toxicity in future applications. Finally, functional enrichment analysis of genes commonly occurring in the optimal target sets reinforced the central role of the widely studied oncogene EGFR and other transmembrane kinases. We also found that that the less-studied phosphatase PTPRZ1 is a useful target, especially in brain cancer. In summary, this study is the first to harness combinatorial optimization tools to analyze emerging single-cell data to portray the landscape of feasible personalized combinations in cancer medicine. Our findings uncover promising membranal targets for the development of future oncology medicines that may serve to optimize the treatment of cancer patient cohorts in several cancer types. The MadHitter approach presented and the accompanying software made public can be readily applied to address additional fundamental related research questions and analyze additional cancer data sets as they become available. Methods Data Sets We retrieved and organized data sets from NCBI’s Gene Expression Omnibus (GEO)58 and Ensembl’s ArrayExpess59 and the Broad Institute’s Single Cell Portal (https://portals.broadinstitute.org/single_cell). Nine data sets had sufficient tumor and non-tumor cells and were used in this study; an additional five data sets had sufficient tumor cells only and were used in testing early versions of MadHitter. Suitable data sets were identified by searching scRNASeqDB60, CancerSea61, GEO, ArrayExpress, Google Scholar, and the 10x Genomics list of publications (https://www.10xgenomics.com/resources/publications/). We required that each 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://portals.broadinstitute.org/single_cell https://www.10xgenomics.com/resources/publications/ https://doi.org/10.1101/2020.01.28.923532 28 data set contain measurements of RNA expression on single cells from human primary solid tumors of at least two patients and the metadata are consistent with the primary data. We are grateful to several of the data depositing authors of data sets for resolving metadata inconsistencies by e-mail correspondence and by sending additional files not available at GEO or ArrayExpress. We excluded blood cancers and data sets with single patients. When it was easily possible to separate cancer cells from non-cancer cells of a similar type, we did so. The main task in organizing each data set was to separate the cells from each sample or each patient into one or more single files. Representations of the expression as binary, as read counts, or as normalized quantities such as transcripts per million (TPM) were retained from the original data. When the data set included cell type assignments, we retained those to classify cells as “cancer” or “non-cancer”, except in the data set of Karaayvaz et al.62 where it was necessary to reapply filters described in the paper to exclude cells expressing few genes and to identify likely cancer and likely non-cancer cells. If cell types were not distinguished, all cells were treated as cancer cells. To achieve partial consistency in the genes included, we filtered data sets to include only those gene labels recognized as valid by the HUGO Gene Nomenclature Committee (http://genenames.org), but otherwise we retained whatever recognized genes that the data submitters chose to include. After filtering out the non-HUGO genes, but before reducing the set of genes to 1269 or 900 or 424 or 58, we filtered out cells as follows. Some data sets came with low expressing cells filtered out. To achieve some homogeneity, we filtered out any cells expressing fewer than 10% of all genes before we reduced the number of genes. In Supplementary Materials 4, we tested the robustness of this 10% threshold. Finally, we retained either all available genes from among either our set of 1269 genes encoding cell-surface receptors that met additional criteria on low expression or available ligand-mimicking peptides. Table 1. Summary descriptions of single-cell data sets from solid tumors used either for analysis (9) or preliminary testing (5 additional). Data sets are ordered so that those from the same or similar tumor types are on consecutive rows. The first 13 data sets were obtained either from GEO or the Broad Institute Single Cell Portal, but the GEO code is shown. The data set on the last row was obtained from ArrayExpress. In some data sets that have both cancer and non- cancer cells, there may be samples for which only one type or the other is provided. Hence, the 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint http://genenames.org/ https://doi.org/10.1101/2020.01.28.923532 29 numbers in parentheses in the third and fourth columns may differ. Data set GSE11597863 supersedes and partly subsumes GSE725664. Data set code Cancer type(s) Cancer cells(samples) Non- cancer cells (samples) Clinical follow-up Reference(s) GSE75688 Breast 441(11) -- Metastasis or not 65 GSE118389 Breast 804(6) 314(6) Metastasis or not 62 GSE89567 Brain (glioma) 5097(10) 1146(9) No 66 GSE103224 Brain (glioma) 23793(8) -- No 61 GSE70630 Brain (glioma) 4044(6) 303(6) No 67 GSE57872 Brain (glioma) 440(6) -- No 68 GSE102130 Brain (6 glioma and 3 glioblastoma) 2858(9) 94(5) No 69 GSE84465 Brain (glioblastoma) 1091(4) 651(4) No 46 GSE81861 Colorectal 272(10) 160(6) No 70 GSE103322 Head and Neck 2093(13) 3197(15) No 71 GSE115978 Melanoma 2018(23) 4334(32) Yes, immuno- therapy 63,64 GSE118828 Ovarian 1415(11 primary) 973 (5 metastasis) 578(2) No 72 GSE67980 Prostate 124(21) -- Metastasis or not 73 E-MTAB- 6149 Lung 7351(5) 2730(5) No 74 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 30 Sampling Process to Generate Replicates of Data Sets As shown in Table 1, the number of cells available in the different single-cell data sets varies by three orders of magnitude; to enable us to compare the findings across different data sets and cancer types on more equal footing, we employed sampling from the larger sets to reduce this difference to one order of magnitude. This goes along with the data collection process in the real world as we might get measurements from different samples at different times. Suppose for a data set we have 𝑛 genes, and 𝑚 cells comprising tumor cells and non-tumor cells. We want to select a subset of 𝑚 ′ < 𝑚 cells. We select a set of 𝑚′ cells uniformly at random without replacement from among all cells. Then we partition the selected cells into 𝑚𝑡 ′ tumor cells and 𝑚𝑛 ′ non-tumor cells to define one replicate. In most of the computational experiments shown we used 20 replicates and we report either the arithmetic mean or entire distribution of quantities such as the CTS size. Considering a previously defined set of target genes and of HPA gene expression across different normal tissues The general aim of our methods is to target the cancer cells while sparing the adjacent non- cancer cells as much as possible. A related concern is that genes within the target set could be expressed at high levels in other normal tissues that are not part of the non-cancer cells from the tumor microenvironment included in the input data sets. One way to address this problem is to identify genes that have low expression in the majority of the tissues and to use them to obtain a target set. This approach has been pioneered in a recent paper on selecting gene targets suitable for CAR-T therapy25. The authors selected 533 candidate genes that they judged could be reasonable targets for CAR-T. They made this selection based on expression data from the Human Protein Atlas75 and the Genotype-Tissue Expression consortium (GTEx)76, which have expression information from multiple tissues which was used to identify low expressed target genes. McKay et al.25 used a threshold of 15 TPM units of expression (written in their work as log2(TPM+1) ≤ 4), but they allowed a small number of tissues to exceed this threshold. Instead, we used quantitative levels of expression for finer granularity in analysis, as described in the next subsection. One clinical difference is that we looked only at adult tissues because we are 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 31 analyzing adult tumors, while CAR-T therapy can be used for either childhood or adult tumors. The reason to focus on cell-surface receptors, as suggested by Dannenfelser et al.24, is that CAR- T therapy requires a cell-surface target that may or may not be a receptor, antibody technologies require a cell surface receptor, and the ligand-mimicking peptide nanotechnology that we summarized in the Introduction also requires cell surface receptor targets. Construction of target gene sets that are lowly expressed in normal tissues To analyze the tissue specificity of the 1269 candidate target genes, the RNAseq based multiple tissue expression data was obtained from the Human Protein Atlas (HPA) database (https://www.proteinatlas.org/about/download ; Date: May 29, 2020). The HPA database includes expression values (in units of transcripts per million (TPM)) for 37 tissues from HPA (rna_tissue_hpa.tsv.zip)75 and 36 tissues from the Genotype-Tissue Expression consortium (rna_tissue_gtex.tsv.zip)76. Next, to identify target genes with low or no expression within majority of adult human tissues, for the 1269 candidate genes we identified genes whose average expression across tissues is below certain threshold value (0.25, 0.5, 1, 2, 5, and 10 TPM) in both HPA and GTEx data sets. Using the intersection of low expression candidate genes from HPA and GTEx data sets, we generated lists of high confidence targets. The size of the resulting high confidence target genes varied from 424 (average expression less than 0.25 TPM) to 900 (average expression across tissue less than 10 TPM) genes (Table 2). While the total number of genes decreases slowly, the decrease is much steeper if one excludes olfactory receptors and taste receptors (Table 2). These sensory receptors are not typically considered as cancer targets, although a few of these receptors are selected in optimal target sets when there are few alternatives (Figure 7). MadHitter was run on all nine data sets using the expression information from the high confidence gene lists. Table 2: Size of high confidence target gene sets for different thresholds. Thresholds expression across Size of gene set No of genes which are NOT olfactory No of genes with ligand mimicking 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://www.proteinatlas.org/about/download https://doi.org/10.1101/2020.01.28.923532 32 tissues (TPM) (OR*) and taste receptors (TAS*) peptides (intersection with Tables 3 and 4) 0.25 424 97 1 0.5 494 141 3 1 547 187 3 2 632 269 6 5 762 398 10 10 900 536 19 Assembling Lists of Membrane Target Genes We are interested in the set of genes 𝐺 that i) have the encoded protein expressed on the cell surface and ii) for which some biochemistry lab has found a small peptide (i.e. amino acid sequences of 5-30 amino acids) that can attach itself to the target protein and get inside the cell carrying a tiny cargo of a toxic drug that will kill the cell and iii) encode proteins that are receptors. The third condition is needed because many proteins that reside on the cell surface are not receptors that can undergo RME. The first condition can be reliably tested using a recently published list of 2799 genes encoding human predicted cell surface proteins77; we reduced the list to 1269 by requiring that the proteins be receptors, which is necessary for RME-based therapies but not for CAR-T therpy24. For condition ii), we found two review articles in the chemistry literature19-20 that list targets effectively meeting this condition. Intersecting the lists meeting conditions i) and ii) gave us 38 genes/proteins that could be targeted (Table 3). Most of the data sets listed in Table 1 had expression data on 1200-1220 of these genes because the list of 1269 includes many olfactory receptor genes that may be omitted from standard genome-wide expression experiments. Among the 38 genes in Table 3, 13/14 data sets have all 38 genes, but GSE57872 was substantially filtered and has only 10/38 genes; since GSE57872 lacks non-tumor cells, we did not use this data set in any analyses shown. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 33 Because the latter review20 was published in 2017, we expected that there are now additional genes for which ligand-mimicking peptides are known. We found 20 additional genes and those are listed in Table 4. Thus, our target set analyses restricted to genes with known ligand-mimicking peptides use 58 = 38 + 20 targets. Table 3. Single proteins that can be targeted by peptides based on references 18, 19 and are expressed on the cell surface77. For easier correspondence with the gene expression data, the entries are listed in alphabetical order by gene symbol. In this table, we follow the clinical genetics formatting convention that proteins are in Roman and gene symbols are in italics. Protein Gene Symbol APN/CD13 ANPEP APP APP PD-L1 CD274 CD44 CD44 P32/gC1qR CD93 E-cadherin CDH1 N-cadherin CDH2 CD21 CR2 EGFR EGFR Epha2 EPHA2 EphB4 EPHB4 HER2 ERBB2 FGFR1 FGFR1 FGFR2 FGFR2 FGFR3 FGFR3 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 34 FGFR4 FGFR4 VEGFR1 FLT1 VEGFR3 FLT4 PSMA FOLH1 GPC3 GPC3 IL-10RA IL10RA IL-11Rα IL11RA IL-13Rα2 IL13RA2 IL-6Rα IL6R GP130 IL6ST VEGFR2 KDR MUC18 MCAM Met MET MMP9 MMP9 Thomsen-Friedenreich carbohydrate antigen MUC1 NRP-1 NRP1 PDGFRβ PDGFRB CD133 PROM1 PTPRJ PTPRJ HSPG SDC2 E-selectin SELE Tie2 TEK VPAC1 VIPR1 Table 4. Single proteins that can be targeted by ligand-mimicking peptides but are not included in the two principal reviews that we consulted19-20 and are among 1269 cell surface receptors77. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 35 Since the evidence that these 20 genes have ligand-mimicking peptides is scattered in the literature, we include at least one PubMed ID of a paper describing a suitable peptide. Protein Gene Symbol At Least One PubMed ID ActRIIB ACVR2B 28955765 CD163 CD163 27563889 CXCR4 CXCR4 19482312, 22523575 ephrin A4 EPHA4 15681844, 22523575 ephrin B1 EPHB1 15722342, 22523575 ephrin B2 EPHB2 15722342, 22523575 ephrin B3 EPHB3 15722342, 22523575 gonadotrophin releasing hormone receptor GNRHR 20814857, 22523575 G Protein coupled receptor 55 GPR55 28029647 bombesin receptor 2 GRPR 20814857, 22523575 IL4 receptor IL4R 19012727 low density lipoprotein receptor LDLR 27656777 leptin receptor LEPR 19233229, 26265355 LRP1 LRP1 29090274 melanocortin 1 receptor MC1R 22964391 melanocortin 4 receptor MC4R 17591746 CD206 MRC1 30768279 urokinase plasminogen activator receptor PLAUR 25080049 neurokinin-1 receptor TACR1 29498264 VPAC2 VIPR2 30077368 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 36 Definition of the Minimum Hitting Set Problem and Solution Feasibility One of Karp’s original NP-complete problems is called “hitting set” and is defined as follows36. Let 𝑈 be a finite universal set of elements. Let 𝑆1, 𝑆2, 𝑆3, . . . , 𝑆𝑘 be subsets of 𝑈. Is there a small subset 𝐻 ⊆ 𝑈 such that for 𝑖 = 1, 2, 3, . . . , 𝑘, 𝑆𝑖 ∩ 𝐻 is non-empty. In our setting, U is the set of target genes and the subsets 𝑆𝑖are the single cells. In reference 78, numerous applications for hitting set and the closely related problems of subset cover and dominating set are described; in addition, practical algorithms for hitting set are compared on real and synthetic data. Among the applications of hitting set and closely related NP-complete problems in biology and biochemistry are stability analysis of metabolic networks79-83, identification of critical paths in gene signaling and regulatory networks84-86 and selection of a set of drugs to treat cell lines87-88 or single patients89-90. More information about related work can be found in Supplementary Materials 1. Two different difficulties arising in problems such as hitting set are that 1) an instance may be infeasible meaning that there does not exist a solution satisfying all constraints and 2) an instance may be intractable meaning that in the time available, one cannot either i) prove that the instance is infeasible or ii) find an optimal feasible solution. All instances of minimum hitting set that we considered were tractable on the NIH Biowulf system. Many instances were provably infeasible; in almost all cases. we did not plot the infeasible parameter combinations. However, in Figure 4, the instance for the melanoma data set with the more stringent parameters was infeasible because of only one patient sample, so we omitted that patient for both parameter settings in Figure 4. Basic Optimal Target Set Formulation Given a collection 𝑆 = {𝑆1, 𝑆2, 𝑆3, . . . } of subsets of a set 𝑈, the hitting set problem is to find the smallest subset 𝐻 ⊆ 𝑈 that intersects every set in 𝑆. The hitting set problem is equivalent to the set cover problem and hence is NP-complete. The following ILP formulates this target set problem: 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 37 𝑚𝑖𝑛 ∑ 𝑔∈𝑈 𝑥(𝑔) ∑ 𝑔∈𝐶 𝑥(𝑔) ≥ 1 ∀𝑆𝑖 ∈ 𝑆 (1) In this formulation, there is a binary variable 𝑥(𝑔) for each element 𝑔 ∈ 𝑈that denotes whether the element 𝑔 is selected or not. Constraint (1) makes sure that from each set 𝑆𝑖 in S, at least one element is selected. For any data set of tumor cells, we begin with the model that we specify a set of genes that can be targeted, and that is 𝑈. Each cell is represented by the subset of genes in 𝑈 whose expression is greater than zero. In biological terms, a cell is killed (hit) if it expresses at any level on one of the genes that is selected to be a target (i.e., in the optimal target set) in the treatment. In this initial formulation, all tumor cells are combined as if they come from one patient because we model that the treatment goal is to kill (hit) all tumor cells (all subsets). In a later subsection, we consider a fair version of this problem, taking into account that each patient is part of a cohort. Before that, we model the oncologist’s intuition that we want to target genes that are overexpressed in the tumor. Combining Data on Tumor Cells and Non-Tumor Cells To make the hitting set formulation more realistic, we would likely model that a cell (set) is killed (hit) only if one of its targets is overexpressed compared to typical expression in non- cancer cells. Such modeling can be applied in the nine single-cell data sets that have data on non- cancer cells to reflect the principle that we would like the treatment to kill the tumor cells and spare the non-tumor cells. Let 𝑁𝑇 be the set of non-tumor cells. For each gene 𝑔, define its average expression 𝐸(𝑔) as the arithmetic mean among all the non-zero values of the expression level of 𝑔 and cells in 𝑁𝑇. The zeroes are ignored because many of these likely represent dropouts in the expression measurement. Following the design of experiments in the lab of N. A., we define an expression ratio threshold factor 𝑟 whose baseline value is 2.0. We adjust the formulation of the previous subsection, so that the set representing a cell (in the tumor cell set) contains only those genes 𝑔 such that the expression of 𝑔 is greater than 𝑟 × 𝐸(𝑔) instead of greater than zero. We keep the 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 38 objective function limited to the tumor cells, but we also store a set to represent each non-tumor cell, and we tabulate which non-tumor cells (sets) would be killed (hit) because for at least one of the genes in the optimal target set, the expression of that gene in that non-tumor cell exceeds the threshold 𝑟 × 𝐸(𝑔). We add parameters 𝑙𝑏 and 𝑢𝑏 each in the range [0,1] and representing respectively a lower bound on the proportion of tumor cells killed and an upper bound on the proportion of non-tumor cells killed. The parameters 𝑙𝑏, ub are used only in two constraints, and we do not favor optimal solutions that kill more tumor cells or fewer non-tumor cells, so long as the solutions obey the constraints. The Fair Cohort Target Set Problem for a Multi-Patient Cohort We want to formulate an integer linear program that selects a set of genes 𝑆∗ from available genes in such a way that, for each patient, there exists an individual target set 𝐻𝑖 𝑆∗ ⊆ 𝑆∗of a relative small size (compared to the optimal ITS of that patient alone which is denoted by 𝐻(𝑖)). Let U = {g1, g2, ..., g|U|} be the set of genes. There are 𝑛 patients. For the i th patient, we denote by 𝑆𝑃(𝑖), the set of tumor cells related to patient i. For each tumor cell 𝐶 ∈ 𝑆𝑃(𝑖), we describe it as a set of genes which is known to be targetable to cell 𝐶. That is, 𝑔 ∈ 𝐶 if and only if a drug containing 𝑔 can target the cell 𝐶. In the ILP, there is a variable 𝑥(𝑔) corresponding to each gene 𝑔 ∈ 𝑈 that shows whether the gene g is selected or not. There is a variable 𝑥(𝑔, 𝑃(𝑖)) which shows whether a gene g is selected in the target set of patient 𝑃(𝑖). The objective function is to minimize the total number of genes selected, subject to having a target set of size at most 𝐻(𝑖) + 𝛼 for patient 𝑃(𝑖) where 1 ≤ 𝑖 ≤ 𝑛. Constraint (3) ensures that, for patient 𝑃(𝑖),we do not select any gene 𝑔 that are not selected in the global set. Constraint (4) ensures all the sets corresponding to tumor cells of patient 𝑃(𝑖) are hit. 𝑚𝑖𝑛 ∑ 𝑔∈𝑈 𝑥(𝑔) (1) ∑ 𝑔∈𝑆𝑃(𝑖) 𝑥(𝑔, 𝑃(𝑖)) ≤ 𝐻(𝑖) + 𝛼 ∀𝑖 (2) 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 39 𝑥(𝑔, 𝑃(𝑖)) ≤ 𝑥(𝑔) ∀𝑖∀𝑔 ∈ 𝑈 (3) ∑ 𝑔∈𝐶 𝑥(𝑔, 𝑃(𝑖)) ≥ 1 ∀𝑖∀𝐶 ∈ 𝑆𝑃(𝑖) (4) Parameterization of the Fair Cohort Target Set Problem In the Fair Cohort Target Set ILP shown above, we give more preference towards minimizing number of genes needed in the CTS. However, we do not take into account the number of non- tumor cells killed. Killing (covering) too many non-tumor cells potentially hurts patients. In order to avoid that, we add an additional constraint to both the ILP for the local instances and the global instance. Intuitively, for patient 𝑃(𝑖), given an upper bound of the portion of the non- tumor cell killed 𝑈𝐵, we want to find the smallest cohort target set 𝐻(𝑖) with the following properties: 1. 𝐻(𝑖) covers all the tumor cells of patient 𝑃(𝑖). 2. 𝐻(𝑖) covers at most 𝑈𝐵 ∗ |𝑁𝑇𝑃(𝑖)| where 𝑁𝑇𝑃(𝑖) is the set of non-tumor cells known for patient 𝑃(𝑖); the number of non-tumor cells killed is represented by the variable 𝑦. The ILP can be formulated as follows: 𝑚𝑖𝑛 ∑ 𝑔∈𝑈 𝑥(𝑔) (1) ∑ 𝑔∈𝐶 𝑥(𝑔) ≥ 1 ∀𝐶 ∈ 𝑆𝑃(𝑖) (2) 𝑦(𝐶) ≥ 𝑚𝑎𝑥 𝑔∈𝐶 𝑥(𝑔) ∀𝐶 ∈ 𝑁𝑇𝑃(𝑖) (3) ∑ 𝐶 𝑦(𝐶) ≤ 𝑈𝐵 ∗ |𝑁𝑇𝑃(𝑖)| ∀𝐶 ∈ 𝑁𝑇𝑃(𝑖) (4) With this formulation, the existence of a feasible solution is not guaranteed. However, covering all tumor cells might not always be necessary either. This statement can be justified as (1) measuring data is not always accurate, and some tumor cells could be missing and (2) in some cases, it might be possible to handle uncovered tumor cells using different methods. Hence, we add another parameter 𝐿𝐵 to let us model this scenario. In the high-level, this is the ratio of the tumor cells we want to cover. The ILP can be formulated as follows: 𝑚𝑖𝑛 ∑ 𝑔∈𝑈 𝑥(𝑔) (1) 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 40 ∑ 𝐶 𝑦(𝐶) ≥ 𝐿𝐵 ∗ |𝑆𝑃(𝑖)| ∀𝐶 ∈ 𝑆𝑃(𝑖) (2) 𝑦(𝐶) ≥ 𝑚𝑎𝑥 𝑔∈𝐶 𝑥(𝑔) ∀𝐶 ∈ 𝑆𝑃(𝑖) ∪ 𝑁𝑇𝑃(𝑖) (3) ∑ 𝐶 𝑦(𝐶) ≤ 𝑈𝐵 ∗ |𝑁𝑇𝑃(𝑖)| ∀𝐶 ∈ 𝑁𝑇𝑃(𝑖) (4) Notice that the constraint (2) here is different from the one above as we only care about the total number of tumor cells covered. Even with both 𝑈𝐵 and 𝐿𝐵, the feasibility of the ILP is still not guaranteed. However, modeling the ILP in this way allows us to parameterize the ILP for various other scenarios of interest. While the two ILPs above are designed for one patient, one can extend these ILPs for multi-patient cohort. 𝑚𝑖𝑛 ∑ 𝑔∈𝑈 𝑥(𝑔) (1) ∑ 𝑔∈𝐶 𝑥(𝑔, 𝑃(𝑖)) ≤ 𝐻(𝑖) + 𝛼 ∀𝑖∀𝐶 ∈ 𝑆𝑃(𝑖) (2) 𝑥(𝑔, 𝑃(𝑖)) ≤ 𝑥(𝑔) ∀𝑖, 𝑔 ∈ 𝑈 (3) 𝑦(𝐶, 𝑃𝑃(𝑖)) ≥ 𝑚𝑎𝑥 𝑔 ∈𝐶 𝑥(𝑔, 𝑃(𝑖)) ∀𝑖∀𝐶 ∈ 𝑆𝑃(𝑖) (4) ∑ 𝐶 𝑦(𝐶, 𝑃𝑃(𝑖)) ≥ 𝐿𝐵 ∗ |𝑆𝑃(𝑖)| ∀𝑖 (5) ∑ 𝐶 𝑦(𝐶, 𝑃𝑃(𝑖)) ≤ 𝑈𝐵 ∗ |𝑁𝑇𝑃(𝑖)| ∀𝑖 (6) Implementation Note, Accounting for Multiple Optima and Software Availability We implemented in Python 3 the above fair cohort target set formulations, with the expression ratio 𝑟 as an option when non-tumor cells are available. The parameters 𝛼, 𝑙𝑏, 𝑢𝑏 can be set by the user in the command line. To solve the ILPs to optimality we usually used the SCIP library and its Python interface891. To obtain multiple optimal solutions of equal size we used the Gurobi library (https://www.gurobi.com) and its python interface. When evaluating multiple optima, for all feasible instances, we sampled 50 optimal solutions that may or may not be distinct, using the Gurobi function select_solution(). To determine how often each gene or pair of genes occur in 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://www.gurobi.com/ https://doi.org/10.1101/2020.01.28.923532 41 optimal solutions, we computed the arithmetic mean of gene frequencies and gene pair frequencies over all sampled optimal solutions. The software package is called MadHitter. The main program is called hitting_set.py. We include in MadHitter a separate program to sample cells and generate replicates, called sample_columns.py. So long as one seeks only single optimal solutions for each instance, exactly one of SCIP and Gurobi is sufficient to use MadHitter. We verified that SCIP and Gurobi give optimal solutions of the same size. If one wants to sample multiple optima, this can be done only with the Gurobi library. The choice between SCIP and Gurobi and the number of optima to sample are controlled by command-line parameters use_gurobi and num_sol, respectively. The MadHitter software is available on GitHub at https://github.com/ruppinlab/madhitter Acknowledgements This research is supported in part by the Intramural Research program of the National Institutes of Health, National Cancer Institute. This research is supported in part by the University of Maryland Year of Data Science Program. This research is supported in part by start-up funds from Northwestern University and a research award from Amazon to support the research of S.K. This work utilized the computational resources of the NIH HPC Biowulf cluster. (http://hpc.nih.gov). Thanks to E. Michael Gertz for technical assistance with SCIP, Gurobi, and Biowulf. Thanks to Allon Wagner, Keren Yizhak and Sushant Patkar for assistance in identifying and retrieving suitable single-cell RNAseq data sets. Thanks to Leandro Hermida for technical advice. Competing Interests The authors declare that they have no competing interests. References 1. Von Hoff D.D., et al. Pilot study using molecular profiling of patients’ tumors to find potential targets and select treatments for their refractory cancers. J. Clin. Oncol. 28(33), 4877-4883 (2010). 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://github.com/ruppinlab/madhitter http://hpc.nih.gov/ https://doi.org/10.1101/2020.01.28.923532 42 2. Schütte M., et al. Cancer precision medicine: Why more is more and DNA is not enough. Pub. Health Genomics 20(2):70-80 (2017). 3. Jameson, G.S., et al. A pilot study utilizing multi-omic molecular profiling to find potential targets and select individualized treatments for patients with previously treated metastatic breast cancer. Breast Cancer Res. Treat. 147(3), 579-588 (2014). 4. Saulnier Sholler, G.L., et al. Feasibility of implementing molecular-guided therapy for the treatment of patients with relapsed or refractory neuroblastoma. Cancer Med. 4(6):871-886 (2015). 5. Byron, S.A., et al. Prospective feasibility trial for genomics-informed treatment in recurrent and progressive glioblastoma. Clin. Cancer Res. 24(2), 295-305 (2018). 6. Schwaederle, M., et al. Association of biomarker-based treatment strategies with response rates and progression-free survival in refractory malignant neoplasms: A meta-analysis. JAMA Oncology 2(11), 1452-1459 (2016). 7. Arnedos, M., Vielh, P., Soria, J.C. & Andre, F. The genetic complexity of common cancers and the promise of personalized medicine: is there any hope? J. Pathol. 232(2): 274-282 (2014). 8. Nikanjam, M., Liu, S., Yang, J. & Kurzrock, R. Dosing three-drug combinations that include targeted anti-cancer agents: Analysis of 37,763 patients. Oncologist 22(5), 576-584 (2017). 9. Rebollo, J. et al. Gene expression profiling of tumors from heavily pretreated patients with metastatic cancer for the selection of therapy: A pilot study. Am. J. Clin. Oncol. 40(2), 140- 145 (2017). 10. Sureda, M., et al. Determining personalized treatment by gene expression profiling in metastatic breast carcinoma patients: a pilot study. Clin. Trans. Oncol. 20(6), 785-793 (2018). 11. Sicklick, J.K., et al. Molecular profiling of cancer patients enables personalized combination therapy: the I-PREDICT study. Nat. Med. 25(5), 744-750 (2019). 12. Joo, J.I., et al. Realizing cancer precision medicine by integrating systems biology and nanomaterial engineering. Adv. Mater. 32(35):e1906783 (2020). 13. Marusyk, A. & Polyak, K. Tumor heterogeneity: Causes and consequences. Biochimica et Biophysica Acta 1805(1), 105-117 (2010). 14. McGranahan, N. & Swanton, C. Biological and therapeutic impact of intratumor heterogeneity in cancer evolution. Cancer Cell 27(1):15-26 (2015). 15. Yofe, I., Dahan, R. & Amit I. Single-cell genomic approaches for developing the next generation of immunotherapies. Nat. Med. 26(2), 171-177 (2020). 16. Neelapu, S.S., et al. Axicabatagene ciloleucel CAR T-cell therapy in refractory large B-cell lymphoma. New Engl. J. Med. 377(26):2531-2544, 2017. 17. Bjorn, M.J., Ring, D. & Frankel, A. Evaluation of monoclonal antibodies for the development of breast cancer immunotoxins. Cancer Res. 45(3), 1214-1221 (1985). 18. Pastan, I., Willingham, M.C. & FitzGerald D.J.P. Immunotoxins. Cell 47(5), 641-648 (1986). 19. Gray, B.P. & Brown, K.C. Combinatorial peptide libraries: mining for cell-binding peptides. Chem. Rev. 114(2), 1020-1081 (2014). 20. Liu, R., Li, X., Xiao, W. & Lam, K.S. Tumor-targeting peptides from combinatorial libraries. Adv. Drug Delivery Rev. 110-111, 13-37 (2017). 21. Fisher, S.L. & Phillips, A.J. Targeted protein degradation and the enzymology of degraders. Curr. Opin. Chem. Biol. 44, 47-55 (2018). 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://www.ncbi.nlm.nih.gov/pubmed/28595192 https://www.ncbi.nlm.nih.gov/pubmed/?term=Jameson%20GS%5BAuthor%5D&cauthor=true&cauthor_uid=25209003 https://www.ncbi.nlm.nih.gov/pubmed/?term=Jameson+GS%5Bauth%5D+2014 https://www.ncbi.nlm.nih.gov/pubmed/25720842 https://www.ncbi.nlm.nih.gov/pubmed/25720842 https://www.ncbi.nlm.nih.gov/pubmed/27273579 https://www.ncbi.nlm.nih.gov/pubmed/27273579 https://www.ncbi.nlm.nih.gov/pubmed/28424323 https://www.ncbi.nlm.nih.gov/pubmed/28424323 https://www.ncbi.nlm.nih.gov/pubmed/25144266 https://www.ncbi.nlm.nih.gov/pubmed/25144266 https://doi.org/10.1101/2020.01.28.923532 43 22. Plückthun A. Designed ankyrin repeat proteins (DARPins): binding proteins for research, diagnostics, and therapy. Annu. Rev. Pharmacol. Toxicol. 55, 489-511 (2015). 23. Sokolova, E.A., et al. HER2-specific targeted toxin DARPin-LoPE: Immunogenicity and antitumor effect on intraperitoneal ovarian cancer xenograft model. Int. J. Mol. Sci. 20(10). pii: E2399 (2019). 24. Dannenfelser, R., et al., Discriminatory power of combinatorial antigen recognition in cancer T cell therapies. Cell Syst. 11(3):215-228 (2020). 25. MacKay, M., et al. The therapeutic landscape for cells engineered with chimeric antigen receptors. Nat. Biotech. 38(2), 233-244 (2020). 26. Maude, S.L., et al. Tisagenlecleusel in children and young adults with B-cell lymphoblastic leukemia. New Engl. J. Med. 378(5):439-448 (2018). 27. Lamers, C.H., et al. Treatment of metastatic renal cell carcinoma with CAIX CAR- engineered T cells: clinical evaluation and management of on-target toxicity. Mol. Ther. 21(4):904-912 (2013). 28. Thistlethwaite, F.C., et al. The clinical efficacy of first-generation carcinoembryonic antigen (CEACAM5)-specific CAR T cells is limited by poor persistence and transient pre- conditioning-dependent respiratory toxicity. Cancer Immunol. Immunother. 66(11):1425- 1436 (2017). 29. Fedorov, V.D., Themeli, M., Sadelain, M. PD-1- and CTLA-4-based inhibitory chimeric antigen receptors (iCARs) divert off-target immunotherapy responses. Sci. Transl. Med. 5(215):215ra172 (2013). 30. Grada, Z., et al. TanCAR: A novel bispecific chimeric antigen receptor for cancer immunotherapy. Mol. Ther. Nucl. Acids 2:e105 (2013). 31. Hegde, M., et al. Combinational targeting offsets antigen escape and enhances effector function of adoptively transferred T cells in glioblastoma. Mol. Ther. 21(11):2087-2101 (2013). 32. Roybal, K.T., et al. Engineering T cells with customized therapeutic response using synthetic Notch receptors. Cell 167(2):419-43.e16 (2016). 33. Williams, J.Z., et al. Precise T cell recognition programs designed by transcriptionally linking multiple receptors. Science 370(6520):1099-1104 (2020). 34. Říhová, B. Receptor-mediated targeted drug or toxin delivery. Adv. Drug Deliv. Rev. 29(3), 273-289 (1998). 35. Tortorella, S. & Karagiannis, T.C. Transferrin receptor-mediated endocytosis: a useful target for cancer therapy. J. Membr. Biol. 247(4), 291-307 (2014). 36. Karp, R.M. Reducibility among combinatorial problems. In Complexity of Computer Computations, pp. 85-103 (Plenum Press, New York, 1972). 37. Martinez-Veracoechea, F.J. & Frenkel, D. Designing super selectivity in multivalent nano- particle binding. Proc. Natl Acad. Sci. USA 108, 10963-10968 (2011). 38. Delaney, C., et al. Combinatorial prediction of marker panels from single-cell transcriptomic data. Mol. Syst. Biol. 15(10), e9005 (2019). 39. Müller, S., et al. A role for receptor tyrosine phosphatase zeta in glioma cell migration. Oncogene 22(43), 661-668 (2003). 40. Ulbricht, U. et al. Expression and function of the receptor protein tyrosine phosphatase zeta and its ligand pleiotrophin in human astrocytomas. J. Neuropathol. Exp. Neurol. 62(12), 1265-1275 (2003). 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 44 41. Chen, H.M., et al. Enhanced expression and phosphorylation of the MET oncoprotein by glioma-specific PTPRZ1-MET fusions. FEBS Lett. 589(13):1437-1443 (2015). 42. Ulbricht, U., Eckerich, C., Fillbrandt, R., Westphal, M. & Lamszus, K. RNA interference targeting protein tyrosine phosphatase zeta/receptor-type protein tyrosine phosphatase beta suppresses glioblastoma growth in vitro and in vivo. J. Neurochem. 98, 1497–1506 (2006). 43. Bourgonje. A.M., et al. Intracellular and extracellular domains of protein tyrosine phosphatase PTPRZ-B differentially regulate glioma cell growth and motility. Oncotarget 5(18), 8690-8702 (2014). 44. Fujikawa, A., et al. Targeting PTPRZ inhibits stem cell-like properties and tumorigenicity in glioblastoma cells. Sci. Rep. 7:5609 (2017). 45. Pastor, M., et al. Development of inhibitors of receptor protein tyrosine phosphatase β/ζ (PTPRZ1) as candidates for CNS disorders. Eur. J. Med. Chem. 144:318-329 (2018). 46. Darmanis, S., et al. Single-cell RNA-Seq analysis of infiltrating neoplastic cells at the migrating front of human glioblastoma. Cell Rep. 21, 1399-1410 (2017). 47. Bergelson, J.M., et al. Isolation of a common receptor for coxsackie B viruses and adenoviruses 2 and 5. Science 275(5304): 1320-1323 (1997). 48. Nilchian, A., et al. CXADR-mediated formation of an AKT inhibitory signalosome at tight junctions controls epithelial-mesenchymal plasticity in breast cancer. Cancer Res. 79(1):47- 60 (2019). 49. Arfelt K.N., et al. Signaling via G proteins mediates tumorigenic effects of GPR87. Cell. Signal. 30:9-18 (2017). 50. Zhang, Y., Qian, Y., Lu, W., Chen, X. The G protein-coupled receptor 87 is necessary for p53-dependent cell survival in response to genotoxic stress. Cancer Res. 69(15):6049-6056 (2009). 51. Wang, L., et al. Overexpression of G protein-coupled receptor GPR87 promotes pancreatic cancer aggressiveness and activates NF-κB signaling pathway. Mol. Cancer 16:61 (2017). 52. Szklarczyk, D., et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019; 47(D1), D607-D613 (2019). 53. Seoane, J. & De Mattos-Arruda L. The challenge of intratumour heterogeneity in precision medicine. J. Intern. Med. 276(1), 41-51 (2014). 54. Van Dijk D, et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174(3):716-729.e27 (2018). 55. Li, W.V. & Li, J.J. An accurate and robust imputation method scImpute for single-cell RNA- seq data. Nat. Comm. 9, 997 (2018). 56. Raj, A. & van Oudenaarden, A. Stochastic gene expression and its consequences. Cell 135(2): 216-226 (2008). 57. Kim, J.Y. & Marioni, J.C. Inferring the kinetics of stochastic gene expression from single- cell RNA-sequencing data. Genome Biol. 14, R7 (2013). 58. Clough, E. & Barrett, T. The Gene Expression Omnibus Database. Meth. Mol. Biol. 1418, 93-110 (2016). 59. Kolesnikov, N., et al. ArrayExpress update--simplifying data submissions. Nucleic Acids Res. 43(Database Issue), D1113-D1116 (2015). 60. Cao, Y., Zhu, J., Jia, P. & Zhao Z. scRNASeqDB: A database for RNA-seq based gene expression profiles in human single cells. Genes 8(12), 368 (2017). 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 45 61. Yuan J, et al. Single-cell transcriptome analysis of lineage diversity in high-grade glioma. Genome Med. 10(1), 57 (2018). 62. Karaayvaz, M., et al. Unravelling subclonal heterogeneity and aggressive disease states in TNBC through single-cell RNA-seq. Nat. Comm. 9(1), 3588 (2018). 63. Jerby-Arnon, L., et al. A cancer cell program promotes T cell exclusion and resistance to checkpoint blockade. Cell 175(4):984-997.e24 (2018). 64. Tirosh, I., et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352(6282), 189-196 (2016). 65. Chung, W., et al. Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer. Nat. Comm. 8,15081 (2017). 66. Venteicher, A.S., et al. Decoupling genetics, lineages, and microenvironment in IDH-mutant gliomas by single-cell RNA-seq. Science 355(6332): pii:eaai8478 (2017). 67. Tirosh, I., et al. Single-cell RNA-seq supports a developmental hierarchy in human oligodendroglioma. Nature 539(7628), 309-313 (2016a). 68. Patel, A.P., et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344(6190), 1396-1401 (2014). 69. Filbin, M.G., et al. Developmental and oncogenic programs in H3K27M gliomas dissected by single-cell RNA-seq. Science 360(6386), 331-335 (2018). 70. Li, H., et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat. Genet. 49(5), 708-718 (2017). 71. Puram, S.V., et al. Single-cell transcriptomic analysis of primary and metastatic tumor ecosystems in head and neck cancer. Cell 171(7), 1611-1624.e24 (2017). 72. Shih, A.J., et al. Identification of grade and origin specific cell populations in serous epithelial ovarian cancer by single cell RNA-seq. PLoS ONE 13(11), e0208778 (2018). 73. Miyamoto, D.T., et al. RNA-Seq of single prostate CTCs implicates noncanonical Wnt signaling in antiandrogen resistance. Science 349, 1351-1356 (2015). 74. Lambrechts, D., et al. Phenotype molding of stromal cells in the lung tumor microenvironment. Nat. Med. 24, 1277-1289 (2018). 75. Uhlén, M., et al. Tissue-based map of the human proteome. Science 347(6220), 1260419 (2015). 76. The GTEx Consortium. The genotype-tissue expression (GTEx) project. Nat. Genet. 45(6), 580-585 (2013). 77. Bausch-Fluck, D., et al. The in silico human surfaceome. Proc. Natl. Acad. Sci USA 115, E10988-E10997 (2018). 78. Gainer-Dewar, A. & Vera-Lincona, P. The minimal hitting set generation problem: Algorithms and computation. SIAM J. Discr. Math. 31, 63-100 (2017). 79. Haedlicke, O. & Klamt, S. Computing complex metabolic intervention strategies using constrained minimal cut sets. Metabolic Eng. 13, 204-213 (2011). 80. Haus, U.-U., Klamt, S. & Stephen, T. Computing knock-out strategies in metabolic networks. J. Comput. Biol. 15(3), 259-268 (2008). 81. Jarrah, A.S., Laubenbacher, R., Stigler, B. & Stillman, M. Reverse-engineering of polynomial dynamical systems. Adv. Appl. Math. 39, 477-489 (2007). 82. Klamt, S. & Gilles, E.D. Minimal cut sets in biochemical reaction networks. Bioinformatics 20, 226-234 (2004). 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 46 83. Trinh, C.T., Wlaschin, A. & Srienc F. Elementary mode analysis: a useful metabolic pathway analysis tool for characterizing cellular metabolism. Appl. Microbiol. Biotech. 81, 813-826 (2009). 84. Ideker, T. Discovery of regulatory interactions through perturbation: inference and experimental design. Pac. Symp. Biocomput. 5, 302-313 (2000). 85. Wang, R.S. & Albert, R. Elementary signaling modes predict the essentiality of signal transduction network components. BMC Syst. Biol. 5, 44 (2011). 86. Zvedei-Oancea, I. & Schuster, S. A theoretical framework for detecting signal transfer routes in signaling networks. Comput. Chem. Engineer. 29, 597-617 (2005). 87. Vazquez A. Optimal drug combinations and minimal hitting sets. BMC Syst. Biol., 3, 81 (2009). 88. Mellor, D., Prieto, E., Mathieson, L. & Moscato, P. A kernelisation approach for multiple d- hitting set and its application in optimal multi-drug therapeutic combinations. PLoS ONE 5(10), e13055 (2010). 89. Vera-Licona, P., Bonnet, E., Brillot, E & Zinovyev, A. OCSANA: optimal combinations of interventions from network analysis. Bioinformatics 29, 1571-1573 (2013). 90. Pang, K., et al. Combinatorial therapy discovery using mixed integer linear programming. Bioinformatics 30, 1456-1463 (2014). 91. Achterberg, T. SCIP: Solving constraint integer programs. Math. Program. Comput. 1(1), 1- 41 (2009). 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint https://doi.org/10.1101/2020.01.28.923532 10_1101-2020_09_02_279521 ---- Simulating the outcome of amyloid treatments in Alzheimer’s disease from imaging and clinical data Simulating the outcome of amyloid treatments in Alzheimer's disease from imaging and clinical data Clément Abi Nader1, Nicholas Ayache1, Giovanni B. Frisoni2, Philippe Robert3, Marco Lorenzi1, for the Alzheimer’s Disease Neuroimaging Initiative* In this study we investigate a novel quantitative instrument for the development of intervention strategies for disease modifying drugs in Alzheimer's disease. Our framework is based on the modeling of the spatio-temporal dynamics governing the joint evolution of imaging and clinical biomarkers along the history of the disease, and allows the simulation of the effect of intervention time and drug dosage on the biomarkers' progression. When applied to multi- modal imaging and clinical data from the Alzheimer's Disease Neuroimaging Initiative our method enables to generate hypothetical scenarios of amyloid lowering interventions. The results quantify the crucial role of intervention time, and provide a theoretical justification for testing amyloid modifying drugs in the pre-clinical stage. Our experimental simulations are compatible with the outcomes observed in past clinical trials, and suggest that anti-amyloid treatments should be administered at least 7 years earlier than what is currently being done in order to obtain statistically powered improvement of clinical endpoints. 1 Université Côte d'Azur, INRIA Sophia Antipolis, EPIONE Research Project, France. 2 Memory Clinic and LANVIE-Laboratory of Neuroimaging of Aging, Hospitals and University of Geneva, Geneva, Switzerland 3 Université Côte d'Azur, CoBTeK lab, MNC3 program, France. * Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wpcontent/uploads/how_to_apply/ADNI_Acknowledgement_textunderscore List.pdf. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint http://adni.loni.usc.edu/wpcontent/uploads/how_to_apply/ADNI_Acknowledgement_textunderscore%20List.pdf https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ Correspondence to: Clément Abi Nader EPIONE Research Project, INRIA Sophia-Antipolis, 2004, route des Lucioles, 06902 Sophia- Antipolis, France. E-mail: clement.abi-nader@inria.fr Keywords : Alzheimer’s Disease ; Clinical trials ; Disease progression; Amyloid hypothesis; Biomarkers Abbreviations : DPM = Disease Progression Model; ODE = Ordinary Differential Equations; ADNI = Alzheimer's Disease Neuroimaging Initiative; NL = Healthy; MCI = Mild Cognitive Impairment; AD = Alzheimer's dementia; AV45 = (18)F-florbetapir Amyloid; FDG = (18)F- fluorodeoxyglucose; ADAS11 = Alzheimer's Disease Assessment Scale; MMSE = Mini- Mental State Examination; FAQ = Functional Assessment Questionnaire; RAVLT = Rey Auditory Verbal Learning Test; CDRSB = Clinical Dementia Rating Scale Sum of Boxes Introduction The number of people affected by Alzheimer's disease has recently exceeded 46 millions and is expected to double every 20 years (Prince et al., 2015), thus posing significant healthcare challenges. Yet, while the disease mechanisms remain in large part unknown, there are still no effective pharmacological treatments leading to tangible improvements of patients' clinical progression. One of the main challenges in understanding Alzheimer's disease is that its progression goes through a silent asymptomatic phase that can stretch over decades before a clinical diagnosis can be established based on cognitive and behavioral symptoms. To help designing appropriate intervention strategies, hypothetical models of the disease history have been proposed, characterizing the progression by a cascade of morphological and molecular changes affecting the brain, ultimately leading to cognitive impairment (Jack et al., 2013; Jack & Holtzman, 2013). The dominant hypothesis is that disease dynamics along the asymptomatic period are driven by the deposition in the brain of the amyloid  peptide, triggering the so- called “amyloid cascade” (Bateman et al., 2012; Braak & Braak, 1991; Delacourte et al., 1999; Murphy & LeVine, 2010; Villemagne et al., 2013). Based on this rationale, clinical trials have been focusing on the development and testing of disease modifiers targeting amyloid  aggregates (Cummings, Lee, et al., 2019), for example by increasing its clearance or blocking its accumulation. Although the amyloid hypothesis has been recently invigorated by a post-hoc .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint mailto:clement.abi-nader@inria.fr https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ analysis of the aducanumab trial (Howard & Liu, 2020), clinical trials failed so far to show efficacy of this kind of treatments (Schwarz et al., 2019), as the clinical primary endpoints were not met (Egan et al., 2019; Honig et al., 2018; Wessels et al., 2019), or because of unacceptable adverse effects (Henley et al., 2019). In the past years, growing consensus emerged about the critical importance of intervention time, and about the need of starting anti-amyloid treatments during the pre-symptomatic stages of the disease (Aisen et al., 2018). Nevertheless, the design of optimal intervention strategies is currently not supported by quantitative analysis methods allowing to model and assess the effect of intervention time and dosing (Klein et al., 2019). The availability of models of the pathophysiology of Alzheimer’s disease would entail great potential to test and analyze clinical hypothesis characterizing Alzheimer’s disease mechanisms, progression, and intervention scenarios. Within this context, quantitative models of disease progression, Disease progression Models referred to as DPMs, have been proposed (Fonteijn et al., 2012; Jedynak et al., 2012; Nader et al., 2020; Oxtoby et al., 2017; Schiratti et al., 2015), to quantify the dynamics of the changes affecting the brain during the whole disease span. These models rely on the statistical analysis of large datasets of different data modalities, such as clinical scores, or brain imaging measures derived from MRI, Amyloid- and Fluorodeoxyglucose- PET (Bilgel et al., 2015; Burnham et al., 2020; Donohue et al., 2014; Y Iturria-Medina et al., 2016; Koval et al., 2018). In general, DPMs estimate a long-term disease evolution from the joint analysis of multivariate time-series acquired on a short-term time-scale. Due to the temporal delay between the disease onset and the appearance of the first symptoms, DPMs rely on the identification of an appropriate temporal reference to describe the long-term disease evolution (Lorenzi et al., 2017; Marinescu et al., 2019). These tools are promising approaches for the analysis of clinical trials data, as they allow to represent the longitudinal evolution of multiple biomarkers through a global model of disease progression. Such a model can be subsequently used as a reference in order to stage subjects and quantify their relative progression speed (Insel et al., 2020; Li et al., 2019; Oxtoby et al., 2018; Young et al., 2014). However, these approaches remain purely descriptive as they don't account for causal relationships among biomarkers. Therefore, they generally don't allow to simulate progression scenarios based on hypothetical intervention strategies, thus providing a limited interpretation of the pathological dynamics. This latter capability is of utmost importance for planning and assessment of disease modifying treatments. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ To fill this gap, recent works such as (Hao & Friedman, 2016; Petrella et al., 2019) proposed to model Alzheimer’s disease progression based on specific assumptions on the biochemical processes of pathological protein propagation. These approaches explicitly define biomarkers interactions through the specification of sets of Ordinary Differential Equations (ODEs), and are ideally suited to simulate the effect of drug interventions (Yasser Iturria-Medina et al., 2017). However, these methods are mostly based on the arbitrary choices of pre-defined evolution models, which are not inferred from data. This issue was recently addressed by (Garbarino & Lorenzi, 2019), where the authors proposed an hybrid modeling method combining traditional DPMs with dynamical models of Alzheimer’s disease progression. Still, since this approach requires to design suitable models of protein propagation across brain regions, extending this method to jointly account for spatio-temporal interactions between several processes, such as amyloid propagation, glucose metabolism, and brain atrophy, is considerably more complex. Finally, these methods are usually designed to account for imaging data only, which prevents to jointly simulate heterogeneous measures (Antelmi et al., 2019), such as image-based biomarkers and clinical outcomes, the latter remaining the reference markers for patients and clinicians. In this work we present a novel computational model of Alzheimer’s disease progression allowing to simulate intervention strategies across the history of the disease. The model is here used to quantify the potential effect of amyloid modifiers on the progression of brain atrophy, glucose metabolism, and ultimately on the clinical outcomes for different scenarios of intervention. To this end, we model the joint spatio-temporal variation of different modalities along the history of Alzheimer’s disease by identifying a system of ODEs governing the pathological progression. This latent ODEs system is specified within an interpretable low- dimensional space relating multi-modal information, and combines clinically-inspired constraints with unknown interactions that we wish to estimate. The interpretability of the relationships in the latent space is ensured by mapping each data modality to a specific latent coordinate. The model is formulated within a Bayesian framework, where the latent representation and dynamics are efficiently estimated through stochastic variational inference. To generate hypothetical scenarios of amyloid lowering interventions, we apply our approach to multi-modal imaging and clinical data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Our results provide a meaningful quantification of different intervention strategies, compatible with findings previously reported in clinical studies. For example, we .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ estimate that in a study with 100 individuals per arm, statistically powered improvement of clinical endpoints can be obtained by completely arresting amyloid accumulation at least 11 years before Alzheimer's dementia. The minimum intervention time decreases to 7 years for studies based on 1000 individuals per arm. Materials and methods In the following sections, healthy individuals will be denoted as NL stable, subjects with mild cognitive impairment as MCI stable, subjects diagnosed with Alzheimer's dementia as AD. We define conversion as the change of diagnosis towards a more pathological state. Therefore, NL converters are subjects who were diagnosed as cognitively normal at baseline and whose diagnosis changed either in MCI or AD during their follow-up visits. MCI converters are subjects who were diagnosed as MCI at baseline and subsequently progressed to AD. Diagnosis was established using the DX column from the ADNIMERGE file (https://adni.bitbucket.io/index.html), which reflects the standard ADNI clinical assessment based on Wechsler Memory Scale, Mini-Mental State Examination, and Clinical Dementia Rating. Amyloid concentration and glucose metabolism are respectively measured by (18)F- florbetapir Amyloid (AV45)-PET and (18)F-fluorodeoxyglucose (FDG)-PET imaging. Cognitive and functional abilities are assessed by the following neuro-psychological tests: Alzheimer's Disease Assessment Scale (ADAS11), Mini-Mental State Examination (MMSE), Functional Assessment Questionnaire (FAQ), Rey Auditory Verbal Learning Test (RAVLT) immediate, RAVLT learning, RAVLT forgetting, and Clinical Dementia Rating Scale Sum of Boxes (CDRSB). Study cohort and biomarkers' changes across clinical groups Our study is based on a cohort of 442 amyloid positive individuals composed of 71 NL stable subjects, 33 NL converters subjects, 131 subjects diagnosed with MCI, 105 MCI converters subjects, and 102 AD patients. Among the 131 MCI subjects, 78 were early MCI and 53 were late MCI. Concerning the group of MCI converters, 80 subjects were late MCI at baseline and 25 were early MCI. The term ``amyloid positive'' refers to subjects whose amyloid level in the CSF was below the nominal cutoff of 192 pg/ml (Gamberger et al., 2017) either at baseline, or during any follow-up visit, and conversion to AD was determined using the last available follow-up information. This preliminary selection of patients aims at constituting a cohort of .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://adni.bitbucket.io/index.html https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ subjects for whom it is more likely to observe “Alzheimer’s pathological changes” (Jack et al., 2018). The length of follow-up varies between 0 and 16 years. Further information about the data are available on https://adni.bitbucket.io/reference/, while details on data acquisition and processing are provided in Section Data acquisition and preprocessing. We show in Table 1A socio-demographic information for the training cohort across the different clinical groups. Table 1B shows baseline values and annual rates of change across clinical groups for amyloid burden (average normalized AV45 uptake in frontal cortex, anterior cingulate, precuneus and parietal cortex), glucose metabolism (average normalized FDG uptake in frontal cortex, anterior cingulate, precuneus and parietal cortex), for hippocampal and medial temporal lobe volumes, and for the cognitive ability as measured by ADAS11. Compatibly with previously reported results (Cash et al., 2015; Schuff et al., 2009), we observe that while regional atrophy, glucose metabolism and cognition show increasing rate of change when moving from healthy to pathological conditions, the change of AV45 is maximum in NL stable, NL converters and MCI stable subjects. We also notice the increased magnitude of ADAS11 in AD as compared to the other clinical groups. Finally, we note that glucose metabolism and regional atrophy show comparable magnitudes of change. The observations presented in Table 1 provide us with a coarse representation of the biomarkers' trajectories characterizing Alzheimer’s disease. The complexity of the dynamical changes we may infer is limited, as the clinical stages roughly approximate a temporal scale describing the disease history, while very little insights can be obtained about the biomarkers' interactions. Within this context, our model allows the quantification of the fine-grained dynamical relationships across biomarkers at stake during the history of the disease. Investigation of intervention scenarios can be subsequently carried out by opportunely modulating the estimated dynamics parameters according to specific intervention hypothesis (e.g. amyloid lowering at a certain time). Model overview We provide in Figure 1 an overview of the presented method. Baseline multi-modal imaging and clinical information for a given subject are transformed into a latent variable composed of four z-scores quantifying respectively the overall severity of atrophy, glucose metabolism, amyloid burden, and cognitive and functional assessment. The model estimates the dynamical relationships across these z-scores to optimally describe the temporal transitions between follow-up observations. These transition rules are here mathematically defined by the .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ parameters of a system of ODEs, which is estimated from the data. This dynamical system allows to compute the evolution of the z-scores over time from any baseline observation, and to predict the associated multi-modal imaging and clinical measures. It is important to note that this modelling choice requires to have at least one visit per patient for which all the measures are available, in order to compute the z-scores temporal evolution. Table1 A: Baseline socio-demographic information for training cohort (442 subjects for 2781 data points, follow-up from 0 to 16 years depending on subjects). Average values, standard deviation in parenthesis. B: Baseline values (bl) and annual rates of change (\% change / year) of amyloid burden (average normalized AV45 uptake in frontal cortex, anterior cingulate, precuneus and parietal cortex), glucose metabolism (average normalized FDG uptake in frontal cortex, anterior cingulate, precuneus and parietal cortex), hippocampus volume, medial temporal lobe volume, and ADAS11 score for the different clinical groups. Median values, interquartile range below. The volumes of the hippocampus and the medial temporal lobe are averaged across left and right hemispheres. NL: healthy individuals, MCI: individuals with mild cognitive impairment, AD: patients with Alzheimer's dementia. APOE4: apolipoprotein E ε4. FDG: (18)F-fluorodeoxyglucose Positron Emission Tomography (PET) imaging. AV45: (18)F-florbetapir Amyloid PET imaging. SUVR: Standardized Uptake Value Ratio. MTL: Medial Temporal Lobe. ADAS11: Alzheimer's Disease Assessment Scale-cognitive subscale, 11 items. A: Socio-demographics NL NL MCI MCI AD stable converters stable converters N 71 33 131 105 102 Age (yrs) 74 (6) 76 (4) 72 (8) 73 (7) 74 (8) Education (yrs) 16 (2) 17 (2) 16 (3) 16 (3) 16 (2) APOE4-carrier (%) 41 51 61 75 71 B: Biomarkers and rates of change NL NL MCI MCI AD stable converters stable converters bl % change / bl % change / bl % change / bl % change / bl % change / year year year year year Global AV45 1.21 0.80 1.36 1.24 1.27 1.21 1.41 0.03 1.45 0.06 (SUVR) [1.06 ; 1.37] [0.1 ; 2.2] [1.28 ; 1.55] [0.43 ; 2.2] [1.12 ; 1.44] [0.1 ; 2.5] [1.29 ; 1.53] [-1.5 ; 1.4] [1.34 ; 1.57] [-1.9 ; 3.3] Global FDG 1.27 -0.47 1.22 -1.6 1.28 -0.92 1.16 -3.1 1.04 -5.0 (SUVR) [1.19 ; 1.34] [-1.8 ; 0.9] [1.16 ; 1.33] [-2.2 ; -1.0] [1.19 ; 1.36] [-2.9 ; 0.0] [1.05 ; 1.25] [-4.8 ; -1.4] [0.97 ; 1.14] [-7.9 ; -2.0] Hippocampus 3.7 -1.6 3.5 -1.8 3.5 -1.5 3.1 -3.8 2.8 -4.5 (ml) [3.4 ; 4.0] [-2.2 ; -0.4] [3.1 ; 3.8] [-3.3 ; -2.2] [3.1 ; 3.8] [-3.3 ; -0.7] [2.7 ; 3.4] [-5.1 ; -2.3] [2.5 ; 3.2] [-6.8; -2.0] MTL 10.0 -0.8 9.8 -1.3 10.4 -1.0 9.1 -2.9 8.5 -5.0 (ml) [9.3 ; 10.5] [-2.0 ; 0.1] [8.5 ; 10.5] [-2.3 ; -0.7] [9.8 ; 11.2] [-2.2 ; 0.4] [8.2 ; 10.1] [-4.7 ; -1.5] [7.6 ; 9.3] [-7.9 ; -1.9] ADAS11 5.0 0.1 8.0 1.7 9.0 1.2 14.3 5.0 22.0 10.3 [3.1 ; 7.0] [-0.2 ; 0.8] [5.0 ; 12.2] [-0.6 ; 2.8] [6.0 ; 11.6] [0.3 ; 2.8] [11.0 ; 20.0] [2.3 ; 8.4] [17.0 ; 28.0] [4.1; 21.0] The model thus enables to simulate the pathological progression of biomarkers across the entire history of the disease. Once the model is estimated, we can modify the ODEs parameters to simulate different evolution scenarios according to specific hypothesis. For example, by reducing the parameters associated with the progression rate of amyloid, we can investigate the relative change in the evolution of the other biomarkers. This setup thus .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ provides us with a data-driven system enabling the exploration of hypothetical intervention strategies, and their effect on the pathological cascade. Data modelling We consider observations 1 2 ( ) [ ( ), ( ),..., ( )] M T i i i i t t t t=X x x x , which correspond to multivariate measures derived from M different modalities (e.g clinical scores, MRI, AV45, or FDG measures) at time t for subject i. Each vector ( ) m i tx has dimension mD . We postulate the Figure 1 Overview of the method. a) High-dimensional multi-modal measures are projected into a 4-dimensional latent space. Each data modality is transformed in a corresponding z-score zamy, zmet, zatr, zcli. b) The dynamical system describing the relationships between the z-scores allows to compute their transition across the evolution of the disease. c) Given the latent space and the estimated dynamics, the follow-up measurements can be reconstructed to match the observed data. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ following generative model, in which the modalities are assumed to be independently generated by a common latent representation of the data ( ) i tz : 2 2 0 0 0 ( ( ) | ( ), , ) ( ( ) | ( ), , ) ( ( ( ), ), ), ( ) ( ( ), ), ( ) ~ ( ( )), m i i i i m m m m i m m m i i i p t t p t t t t t t t p t      = = =    2 i X z σ ψ x z z z z z z (1) where 2 m  is measurement noise, while m  are the parameters of the function m  which maps the latent state to the data space for the modality m. For simplicity of notation we denote ( ) i tz by ( )tz . We assume that each coordinate of z is associated to a specific modality m, leading to an M-dimensional latent space. The  operator which gives the value of the latent representation at a given time t, is defined by the solution of the following system of ODEs: , ( ) ( )(1 ( )) ( ), =1,..., . m m m j m m j j m dz t k z t z t z t m M dt   = − + (2) For each coordinate, the first term of the equation enforces a sigmoidal evolution with a progression rate mk , while the second term accounts for the relationship between modalities m and j through the parameters ,m j  . This system can be rewritten as: 2 i , i, j , , ( ) ( ) ( ) ( ( ), ) where, if i=j, k if i=j and = otherwise; 0 otherwise, if i=j 0 otherwise. ( ) ( ) ( ) ODE i i j i j i i j d t t t g t dt k k   = − =   =     =   z Wz Vz z W V V (3) ODE  denotes the parameters of the system of ODEs, which correspond to the entries of the matrices W and V. According to Equation (3), for each initial condition (0)z , the latent state at time t can be computed through integration, 0 ( ) (0) ( ( ), ) t ODE t g x dx= +z z z . .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ We resort to variational inference and stochastic gradient descent in order to optimize the parameters of the model. The procedure is detailed in Sections Variational inference and Model optimization of the Supplementary Material. Simulating the long-term progression of Alzheimer’s disease To simulate the long-term progression of Alzheimer’s disease we first project the AD subjects in the latent space via the encoding functions. We can subsequently follow the trajectories of these subjects backward and forward in time, in order to estimate the associated trajectory from the healthy to their respective pathological condition. In practice, a Gaussian Mixture Model is used to fit the empirical distribution of the AD subjects' latent projection. The number of components and covariance type of the Gaussian Mixture Model is selected by relying on the Akaike information criterion (Akaike, 1998). The fitted Gaussian Mixture Model allows us to sample pathological latent representations 0( )i tz that can be integrated forward and backward in time thanks to the estimated set of latent ODEs, to finally obtain a collection of latent trajectories 1 ( ) [ ( ),..., ( )] N t t t=Z z z summarizing the distribution of the long-term Alzheimer’s disease evolution. Simulating intervention In this section we assume that we computed the average latent progression of the disease ( )tz . Thanks to the modality-wise encoding (cf. Supplementary section Variational inference) each coordinate of the latent representation can be interpreted as representing a single data modality. Therefore, we propose to simulate the effect of a hypothetical intervention on the disease progression, by modulating the vector ( )d t dt z after each integration step such that: 1 * m ( ) ( ) where, = .( ) d t d t dt dt       =       z z Γ Γ (4) The values m are fixed between 0 and 1, allowing to control the influence of the corresponding modalities on the system evolution, and to create hypothetical scenarios of evolution. For .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ example, for a 100% (resp. 50%) amyloid lowering intervention we set 0 amy  = (resp. 0.5 amy  = ). Evaluating disease severity Given an evolution ( )tz describing the disease progression in the latent space, we propose to consider this trajectory as a reference and to use it in order to quantify the individual disease severity of a subject X . This is done by estimating a time-shift  defined as: 1 1 1 || ( , ) ( ) || | ( , ) ( ) | . argmin t m m m f t f z t    = − = − X z x (5) This time-shift allows to quantify the pathological stage of a subject with respect to the disease progression along the reference trajectory ( )tz . Moreover, the time-shift can still be estimated even in the case of missing data modalities, by only encoding the available measures of the observed subject. Statistical analysis The model was implemented using the Pytorch library (Paszke et al., 2019). The estimated disease severity was compared group-wise via two-sided Wilcoxon-Mann-Whitney test (P < 0.01). Differences between the clinical outcomes distribution after simulation of intervention were compared via two-sided Student’s T-test (P < 0.01). Shadowed areas in the different figures show ± standard deviation of the mean. Data availability The data used in this study are available from the ADNI database (adni.loni.usc.edu). Results In the following, MRI, FDG-PET, and AV45-PET images are processed in order to respectively extract regional gray matter density, glucose metabolism and amyloid load from .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ a brain parcellation. The z-scores of gray matter atrophy (zatr), glucose metabolism (zmet), and amyloid burden (zamy), are computed using the measures obtained by this pre-processing step. The clinical z-score zcli is derived from neuro-psychological scores: ADAS11, MMSE, FAQ, RAVLT immediate, RAVLT learning, RAVLT forgetting and CDRSB. This panel of scores was chosen to provide a comprehensive representation of cognitive, memory and functional abilities. Data acquisition and preprocessing Data used in the preparation of this article were obtained from the ADNI database. The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. For up-to-date information, see www.adni-info.org. We considered four types of biomarkers, related to clinical scores, gray matter atrophy, amyloid load and glucose metabolism, and respectively denoted by cli, atr, amy and met. MRI images were processed following the longitudinal pipeline of Freesurfer (Reuter et al., 2012), to obtain gray matter volumes in a standard anatomical space. AV45-PET and FDG-PET images were aligned to the closest MRI in time and normalized to the cerebellum uptake. Regional gray matter density, amyloid load and glucose metabolism were extracted from the Desikan-Killiany parcellation (Desikan et al., 2006). We discarded white-matter, ventricular, and cerebellar regions, thus obtaining 82 regions that were averaged across hemispheres. Therefore, for a given subject, xatr, xamy and xmet are respectively 41-dimensional vectors. The variable xcli is composed of the neuro-psychological scores ADAS11, MMSE, RAVLT immediate, RAVLT learning, RAVLT forgetting, FAQ, and CDRSB. The total number of measures is of 2781 longitudinal data points. We recall that the model estimation requires a visit for which all the measures are available in order to obtain the z-scores evolution of a given subject, but can handle missing data in the follow-up by finding the parameters that best match the available measures. Progression model and latent relationships We show in Figure 2 panel I) the dynamical relationships across the different z-scores estimated by the model, where direction and intensity of the arrows quantify the estimated increase of one variable with respect to the other. Being the scores adimensional, they have been conveniently rescaled to the range [0,1] indicating increasing pathological levels. These .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint http://www.adni-info.org/ https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ relationships extend the summary statistics reported in Table 1 to a much finer temporal scale and wider range of possible biomarkers' values. We observe in Figure 2A, 2B and 2C that large values of the amyloid score zamy trigger the increase of the remaining ones: zmet, zatr, and zcli. Figure 2D shows that large increase of the atrophy score zatr is associated to pathological glucose metabolism indicated by large values of zmet. Moreover, we note that high zmet values also contribute to an increase of zcli (Figure 2E). Finally, Figure 2F shows that high atrophy values lead to an increase mostly along the clinical dimension zcli. This chain of relationships is in agreement with the cascade hypothesis of AD (Jack et al., 2013; Jack & Holtzman, 2013). Relying on the dynamical relationships shown in Figure 2 panel I), starting from any initial set of biomarkers values we can estimate the relative trajectories over time. Figure 2 panel II) (left), shows the evolution obtained by extrapolating backward and forward in time the trajectory associated to the z-scores of the AD group. The x-axis represents the years from conversion to AD, where the instant t=0 corresponds to the average time of diagnosis estimated for the group of MCI progressing to dementia. As observed in Figure 2 panel I) and Table 1, the amyloid score zamy increases and saturates first, followed by zmet and zatr scores whose progression slows down when reaching clinical conversion, while the clinical score exhibits strong acceleration in the latest progression stages. Figure 2 panel II) (right) shows the group- wise distribution of the disease severity estimated for each subject relatively to the modelled long-term latent trajectories. The group-wise difference of disease severity across groups is statistically significant and increases when going from healthy to pathological stages (Wilcoxon-Mann-Whitney test p < 0.01 for each comparisons). The reliability of the estimation of disease severity was further assessed through testing on an independent cohort, and by comparison with a previously proposed disease progression modeling method from the state- of-the-art (Lorenzi et al., 2017). The results are provided in section Time-shift comparison and .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ validation of the Supplementary Material and show positive generalization results as well as a favorable comparison with the benchmark method. From the z-score trajectories of Figure 2 panel II) (left) we predict the progression of imaging and clinical measures shown in Figure 3. We observe that amyloid load globally increases and saturates early, compatibly with the positive amyloid condition of the study cohort. Abnormal glucose metabolism and gray matter atrophy are delayed with respect to amyloid, and tend to map prevalently temporal and parietal regions. Finally, the clinical measures exhibit a non- Figure 2 Dynamical relationships, z-scores evolution and disease staging. Panel I: Estimated dynamical relationships across the different z-scores (A to F). Given the values of two z-scores, the arrow at the corresponding coordinates indicates how one score evolves with respect to the other. The intensity of the arrow gives the strength of the relationship between the two scores. Panel II, left: Estimated long-term latent dynamics (time is relative to conversion to Alzheimer's dementia). Shadowed areas represent the standard deviation of the average trajectory. Panel II, right: Distribution of the estimated disease severity across clinical stages, relatively to the long-term dynamics on the left. NL: normal individuals, MCI: mild cognitive impairment, AD: Alzheimer's dementia. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ linear pattern of change, accelerating during the latest progression stages. These dynamics are compatible with the summary measures on the raw data reported in Table 1. Simulating clinical intervention This experimental section is based on two intervention scenarios: a first one in which amyloid is lowered by 100%, and a second one in which it is reduced by 50% with respect to the estimated natural progression. In Figure 4 we show the latent z-scores evolution resulting from either 100% or 50% amyloid lowering performed at the time t=-20 years. According to these scenarios, intervention results in a sensitive reduction of the pathological progression for Figure 3 Model-based progression of Alzheimer’s disease. Estimated long-term evolution of cortical measurements for the different types of imaging markers, and clinical scores. Shadowed areas represent the standard deviation of the average trajectory. Brain images were generated using the software provided in (Marinescu et al., 2019). .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ atrophy, glucose metabolism and clinical scores, albeit with a stronger effect in case of total blockage. We further estimated the resulting clinical endpoints associated with the two amyloid lowering scenarios, at increasing time points and for different sample sizes. Clinical endpoints consisted in the simulated ADAS11, MMSE, FAQ, RAVLT immediate, RAVLT learning, RAVLT forgetting and CDRSB scores at the reference conversion time (t=0). The case placebo indicates the scenario where clinical values were computed at conversion time from the estimated natural progression shown in Figure 2 panel II) (left). Figure 5 shows the change in statistical power depending on intervention time and sample sizes. For large sample sizes (1000 subjects per arm) a power greater than 0.8 can be obtained around 7 years before conversion, depending on the outcome score, where in general we observe that RAVLT forgetting exhibits a higher power than the other scores. When sample size is lower than 100 subjects per arm, a power greater than 0.8 is reached if intervention is performed at the latest 11 years before conversion, with a mild variability depending on the considered clinical score. We notice that in the case of 50% amyloid lowering, in order to reach the same power intervention needs to be consistently performed earlier compared to the scenario of 100% amyloid lowering for the same sample size and clinical score. For instance, if we consider ADAS11 with a sample size of 100 subjects per arm, a power of 0.8 is obtained for a 100% amyloid lowering intervention performed 11.5 years before conversion, while in case of a 50% amyloid lowering the equivalent effect would be obtained by intervening 15 years before conversion. We provide in Table 2 the estimated improvement for each clinical score at conversion with a sample size of 100 subjects per arm for both 100% and 50% amyloid lowering depending on Figure 4 Simulation of amyloid lowering intervention on the z-scores evolution. Hypothetical scenarios of irreversible amyloid lowering interventions at t=-20 years from Alzheimer's dementia diagnosis, with a rate of 100 % (left) or 50% (right). Shadowed areas represent the standard deviation of the average trajectory. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ the intervention time. We observe that for the same intervention time, 100% amyloid lowering always results in a larger improvement of clinical endpoints compared to 50% amyloid lowering. We also note that in the case of 100% lowering, clinical endpoints obtained for intervention at t=-15 years correspond to typical cutoff values for inclusion into Alzheimer’s disease trials (ADAS11 = 13.7 ± 5.8, MMSE = 25.7± 2.5, see Supplementary Table 2) (Gamberger et al., 2017; Kochhann et al., 2010). Discussion We presented a framework to jointly model the progression of multi-modal imaging and clinical data, based on the estimation of latent biomarkers' relationships governing Alzheimer’s disease progression. The model is designed to simulate intervention scenarios in clinical trials, and in this study we focused on assessing the effect of anti-amyloid drugs on biomarkers' evolution, by quantifying the effect of intervention time and drug efficacy on clinical outcomes. Our results underline the critical importance of intervention time, which should be performed sensibly early during the pathological history to effectively appreciate the effectiveness of disease modifiers. The results obtained with our model are compatible with findings reported in recent clinical studies (Egan et al., 2019; Honig et al., 2018; Wessels et al., 2019). For example, if we consider 500 patients per arm and perform a 100% amyloid lowering intervention for 2 years to reproduce the conditions of the recent trial of Verubecestat (Egan et al., 2019), the average improvement of MMSE predicted by our model is of 0.02, falling in the 95% confidence interval measured during that study ([-0.5 ; 0.8]). While recent anti-amyloid trials such as (Egan et al., 2019; Honig et al., 2018; Wessels et al., 2019) included between 500 and 1000 mild AD subjects per arm and were conducted over a period of two years at most, our analysis suggests that clinical trials performed with less than 1000 subjects with mild AD may be consistently under-powered. Indeed, we see in Figure 5 that with a sample size of 1000 subjects per arm and a total blockage of amyloid production, a power of 0.8 can be obtained only if intervention is performed at least 7 years before conversion. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ Figure 5 Evolution of the statistical power in different intervention scenarios. Statistical power of the Student t-test comparing the estimated clinical outcomes at conversion time between placebo and treated scenarios, according to the year of simulated intervention (100% and 50% amyloid lowering) and sample size. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ Amyloid lowering intervention 100% Point improvement per intervention time -20 -15 -12.5 -10 -5 -3 -2 -1 ADAS11 11.1, (6.4) 5.2, (2.9) 3.0, (1.7) 1.6, (1.0) 0.3, (0.2) 0.1, (0.1) 0.0, (0.0) 0.0, (0.0) MMSE 4.9, (2.8) 2.3, (1.3) 1.3, (0.8) 0.7, (0.4) 0.1, (0.1) 0.0, (0.0) 0.0, (0.0) 0.0, (0.0) FAQ 9.6, (5.6) 4.5, (2.5) 2.6, (1.5) 1.4, (0.8) 0.2, (0.2) 0.1, (0.1) 0.0, (0.0) 0.0, (0.0) RAVLT immediate 15.3, (8.9) 7.2, (4.1) 4.2, (2.4) 2.3, (1.4) 0.5, (0.3) 0.2, (0.1) 0.1, (0.1) 0.0, (0.0) RAVLT learning 2.7, (1.6) 1.3, (0.7) 0.7, (0.4) 0.4, (0.2) 0.1, (0.1) 0.0, (0.0) 0.0, (0.0) 0.0, (0.0) RAVLT forgetting 37.2, (21.5) 17.7, (9.9) 10.5, (6.0) 5.8, (3.5) 1.3, (0.9) 0.5, (0.4) 0.2, (0.2) 0.1, (0.1) CDRSB 3.5, (2.0) 1.6, (0.9) 0.9, (0.5) 0.5, (0.3) 0.1, (0.1) 0.0, (0.0) 0.0, (0.0) 0.0, (0.0) These results allow to quantify the crucial role of intervention time, and provide a theoretical justification for testing amyloid modifying drugs in the pre-clinical stage (Aisen et al., 2018; Sperling et al., 2011). This is for example illustrated in Table 2, in which we notice that clinical endpoints are close to placebo even when the simulated intervention takes place 10 years before Table 2: Estimated mean (standard deviation) improvement of clinical outcomes at predicted conversion time for the normal progression case by year of simulated intervention (100% and 50% amyloid lowering interventions). Results in bold indicate a statistically significant difference between placebo and treated scenarios (p<0.01, two-sided t-test, 100 cases per arm). AD: Alzheimer's dementia, ADAS11: Alzheimer's Disease Assessment Scale, MMSE: Mini- Mental State Examination, FAQ: Functional Assessment Questionnaire, RAVLT: Rey Auditory Verbal Learning Test, CDRSB: Clinical Dementia rating Scale Sum of Boxes. Amyloid lowering intervention 50% Point improvement per intervention time -20 -15 -12.5 -10 -5 -3 -2 -1 ADAS11 5.0, (2.5) 2.4, (1.2) 1.4, (0.7) 0.8, (0.4) 0.2, (0.1) 0.1, (0.0) 0.0, (0.0) 0.0, (0.0) MMSE 2.2, (1.1) 1.0, (0.5) 0.6, (0.3) 0.4, (0.2) 0.1, (0.0) 0.0, (0.0) 0.0, (0.0) 0.0, (0.0) FAQ 4.3, (2.1) 2.0, (1.0) 1.2, (0.6) 0.7, (0.4) 0.1, (0.1) 0.0, (0.0) 0.0, (0.0) 0.0, (0.0) RAVLT immediate 6.9, (3.4) 3.3, (1.6) 1.9, (1.0) 1.2, (0.6) 0.2, (0.1) 0.1, (0.1) 0.0, (0.0) 0.0, (0.0) RAVLT learning 1.2, (0.6) 0.6, (0.3) 0.3, (0.2) 0.2, (0.1) 0.0, (0.0) 0.0, (0.0) 0.0, (0.0) 0.0, (0.0) RAVLT forgetting 16.7, (8.2) 8.1, (4.0) 4.8, (2.5) 2.9, (1.6) 0.6, (0.4) 0.2, (0.2) 0.1, (0.1) 0.0, (0.0) CDRSB 1.6, (0.8) 0.7, (0.4) 0.4, (0.2) 0.2, (0.1) 0.0, (0.0) 0.0, (0.0) 0.0, (0.0) 0.0, (0.0) .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ conversion, while stronger cognitive and functional changes happen when amyloid is lowered by 100% or 50% earlier. These findings may be explained by considering that amyloid accumulates over more than a decade, and that when amyloid clearance occurs the pathological cascade is already entrenched (Rowe et al., 2010). Our results are thus supporting the need to identify subjects at the pre-clinical stage, that is to say still cognitively normal, which is a challenging task. Currently, one of the main criteria to enroll subjects into clinical trials is the presence of amyloid in the brain, and blood-based markers are considered as potential candidates for identifying patients at risk for Alzheimer’s disease (Zetterberg & Burnham, 2019). Moreover, recent works such as (Blennow et al., 2010; Westwood et al., 2016) have proposed more complex entry criteria to constitute cohorts based on multi-modal measurements. Within this context, our model could also be used as an enrichment tool by quantifying the disease severity based on multi-modal data as shown in Figure 2 panel II) (right). Similarly, the method could be applied to predict the evolution of single patient given its current available measurements. An additional critical aspect of anti-amyloid trials is the effect of dose exposure on the production of amyloid (Klein et al., 2019). Currently,  -site amyloid precursor protein cleaving enzyme (BACE) inhibitors allow to suppress amyloid production from 50% to 90%. In this study we showed that lowering amyloid by 50% consistently decreases the treatment effect compared to a 100% lowering at the same time. For instance, if we consider a sample size of 1000 subjects per arm in the case of a 50% amyloid lowering intervention, 80% power can be reached only 10 years before conversion instead of 7 years for a 100% amyloid lowering intervention. This ability of our model to control the rate of amyloid progression is fundamental in order to provide realistic simulations of anti-amyloid trials. In Figure 2 panel I) we showed that amyloid triggers the pathological cascade affecting the other markers, thus confirming its dominating role on disease progression. Assuming that the data used to estimate the model is sufficient to completely depict the history of the pathology, our model can be interpreted from a causal perspective. However, we cannot exclude the existence of other mechanisms driving amyloid accumulation, which our model cannot infer from the existing data. Therefore, our findings should be considered with care, while the integration of additional biomarkers of interest will be necessary to account for multiple drivers .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ of the disease. It is worth noting that recent works ventured the idea to combine drugs targeting multiple mechanisms at the same time (Gauthier et al., 2019). For instance, pathologists have shown tau deposition in brainstem nuclei in adolescents and children (Kaufman et al., 2018), and clinicians are currently investigating the pathological effect of early tau spreading on Alzheimer’s disease progression (Pontecorvo et al., 2019), raising crucial questions about its relationship with amyloid accumulation, and the impact on cognitive impairment (Cummings, Blennow, et al., 2019). In this study, 190 subjects underwent at least one Tau-PET scan. However, when considering the subjects for whom there exists one visit in which all the data modalities were available, the number of patients in the study cohort decreased to 33. This low sample size prevented us from estimating reliable trajectories for this biomarker. It is also important to note that among the 190 subjects with at least one Tau-PET scan, only 19 of them had one follow-up visit. This means that tau markers dynamics cannot be reliably estimated. Including tau data will require studies on larger cohorts with complete sets of PET imaging acquisitions. This could be part of future extensions of this work, where the inclusion of tau markers will allow to simulate scenarios of production blockage of both amyloid and tau at different rates or intervention time. Lately, disappointing results of clinical studies led to hypothesize specific treatments targeting AD sub-populations based on their genotype (Safieh et al., 2019). While in our work we describe a global progression of Alzheimer’s disease, in the future we will account for sub- trajectories due to genetic factors, such as the presence of 4 allele of apolipoprotein (APOE4), which is a major risk for developing Alzheimer’s disease influencing both disease onset and progression (Kim et al., 2009). This could be done by estimating dynamical systems specific to the genetic condition of each patient. This was not possible in this study due to a strong imbalance between the number of carriers and non-carriers across the different clinical groups (cf. Table 1). Indeed, we observe that the number of ADNI non-carriers is much lower than the number of carriers, especially in the latest stages of the disease (MCI converters and AD). On the contrary, the majority of NL stable subjects are non-carriers. Therefore, applying the model in such conditions would lead to a bias towards more represented groups during the different stages of the disease progression (APOE4- at early stages and APOE4+ at late ones), thus preventing us from differentiating the biomarkers dynamics based on the genetic status. Yet, simulating dynamical relationships specific to genetic factors is a crucial avenue of improvement of our approach, as it would allow to evaluate the effect of APOE4 on .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ intervention time or drug dosage. In addition to this example, there exist numerous non-genetic aggravating factors that may also affect disease evolution, such as diabetes, obesity or smoking. Extending our model to account for panels of risk factors would ultimately allow to test in silico personalized intervention strategies. Moreover, a key aspect of clinical trials is their economic cost. Our model could be extended to help designing clinical trials by optimizing intervention with respect to the available funding. Given a budget, we could simulate scenarios based on different sample size, and trials duration, while estimating the expected cognitive outcome. Results presented in this work are based on a model estimated by relying solely on a subset of subjects and measures from the ADNI cohort, and therefore they may not be fully representative of the general Alzheimer’s disease progression. Indeed, subjects included in this cohort were either amyloid-positive at baseline, or became amyloid-positive during their follow-up visits. This was motivated by the consideration that evidence of pathological amyloid levels is a necessary condition for diagnosing AD as it puts subjects within the “Alzheimer’s disease continuum” (Jack et al., 2018). By narrowing the list of subjects to a subgroup of amyloid positive we increase the chances of selecting a set of patients likely to develop the disease. Moreover, the inclusion of subjects at various clinical stages allows to span the entire spectrum of morphological and physiological changes affecting the brain. Through the joint analysis of markers of amyloid, neurodegeneration and cognition, our model estimates the average trajectory that best describes the progression of the observed measures when going from NL individuals towards AD patients. The selection of amyloid positive patients aims at increasing the signal of Alzheimer’s pathological changes within this cohort, in order to estimate long-term dynamics for the biomarkers that can be associated to the disease. We believe that this modeling choice is based on a clinically plausible rationale, and allows us to perform our study on a sufficiently large cohort enabling the estimation of our model. Bearing this in mind, we acknowledge the potential presence of bias towards the specific inclusion criterion adopted in this work. Indeed, the present results may provide a limited representation of the pathological temporal window captured by the model. For example, applying the model on a cohort containing amyloid-negative subjects may provide additional insights on the overall disease history. However, this is a challenging task as it would require to identify sub-trajectories dissociated from normal ageing (Lorenzi et al., 2015; Sivera et al., 2020). Another potential bias affecting the results may come from the .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ choice of the clinical scores used to estimate our model. In this study, we relied on a panel of 7 neuro-psychological assessments providing a comprehensive representation of cognitive, memory and functional abilities: ADAS11, MMSE, RAVLT immediate, RAVLT learning, RAVLT forgetting, FAQ, and CDRSB. The choice of these particular scores is consistent with previous literature on DPM (Donohue et al., 2014; Lorenzi et al., 2017). However, it is important to note that our model can handle any type of clinical assessment. Therefore, investigating the effect of adding supplementary clinical scores on the model’s findings would be an interesting future application of our approach, and could be done without any modification of its current formulation. Finally, in addition to these specific characteristics of the cohort, there exists additional biases impacting the model estimation. For instance, the fact that gray matter atrophy and glucose metabolism become abnormal approximately at the same time in Figure 3 can be explained by the high atrophy rate of change in some key regions in normal elders, such as in the hippocampus, compared to the rate of change of FDG (see Table 1). We note that this stronger change of atrophy with respect to glucose metabolism can already be appreciated in the clinically healthy group. Conclusion In this study we investigated a novel quantitative instrument for the development of intervention strategies for disease modifying drugs in AD. Our framework enables the simulation of the effect of intervention time and drug dosage on the evolution of imaging and clinical biomarkers in clinical trials. The proposed data-driven approach is based on the modeling of the spatio-temporal dynamics governing the joint evolution of imaging and clinical measurements throughout the disease. The model is formulated within a Bayesian framework, where the latent representation and dynamics are efficiently estimated through stochastic variational inference. To generate hypothetical scenarios of amyloid lowering interventions, we applied our approach to multi-modal imaging and clinical data from ADNI. The results quantify the crucial role of intervention time, and provide a theoretical justification for testing amyloid modifying drugs in the pre-clinical stage. Our experimental simulations are compatible with the outcomes observed in past clinical trials and suggest that anti-amyloid treatments should be administered at least 7 years earlier than what is currently being done in order to obtain statistically powered improvement of clinical endpoints. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ Funding This work has been supported by the French government, through the UCAJEDI and 3IA Côte d'Azur Investments in the Future project managed by the National Research Agency (ref.n ANR-15-IDEX-01 and ANR-19-P3IA-0002), the grant AAP Santé 06 2017-260 DGA-DSH, and by the INRIA Sophia-Antipolis-Méditerranée, "NEF" computation cluster. Acknowledgements Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) and DOD ADNI. ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company;CereSpir, Inc.;Cogstate;Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson\& Johnson Pharmaceutical Research & Development LLC.;Lumosity;Lundbeck;Merck & Co., Inc.; Meso Scale Diagnostics, LLC.;NeuroRx Research; Neurotrack Technologies;Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging;Servier; Takeda Pharmaceutical Company; and Transition Therapeutics.The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California. Competing interests The authors declare no competing interests. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ References Aisen, P. S., Siemers, E., Michelson, D., Salloway, S., Sampaio, C., Carrillo, M. C., Sperling, R., Doody, R., Scheltens, P., Bateman, R., Weiner, M., & Vellas, B. (2018). What Have We Learned from Expedition III and EPOCH Trials ? Perspective of the CTAD Task Force. J Prev Alzheimers Dis, 5(3), 171–174. Akaike, H. (1998). Information Theory and an Extension of the Maximum Likelihood Principle. In Selected Papers of Hirotugu Akaike (pp. 199–213). Springer New York. https://doi.org/10.1007/978-1-4612-1694-0_15 Antelmi, L., Ayache, N., Robert, P., & Lorenzi, M. (2019, June). Sparse Multi-Channel Variational Autoencoder for the Joint Analysis of Heterogeneous Data. ICML 2019 - 36th International Conference on Machine Learning. Bateman, R. J., Xiong, C., Benzinger, T. L. S., Fagan, A. M., Goate, A., Fox, N. C., Marcus, D. S., Cairns, N. J., Xie, X., Blazey, T. M., Holtzman, D. M., Santacruz, A., Buckles, V., Oliver, A., Moulder, K., Aisen, P. S., Ghetti, B., Klunk, W. E., McDade, E., … Morris, J. C. (2012). Clinical and Biomarker Changes in Dominantly Inherited Alzheimer’s Disease. New England Journal of Medicine, 367(9), 795–804. Bilgel, M., Jedynak, B., Wong, D. F., Resnick, S. M., & Prince, J. L. (2015). Temporal Trajectory and Progression Score Estimation from Voxelwise Longitudinal Imaging Measures: Application to Amyloid Imaging. Inf Process Med Imaging, 24, 424–436. Blennow, K., Hampel, H., Weiner, M., & Zetterberg, H. (2010). Cerebrospinal fluid and plasma biomarkers in Alzheimer disease. Nat Rev Neurol, 6(3), 131–144. Braak, H., & Braak, E. (1991). Neuropathological stageing of Alzheimer-related changes. Acta Neuropathol., 82(4), 239–259. Burnham, S. C., Fandos, N., Fowler, C., Pérez-Grijalba, V., Dore, V., Doecke, J. D., Shishegar, R., Cox, T., Fripp, J., Rowe, C., Sarasa, M., Masters, C. L., Pesini, P., & Villemagne, V. L. (2020). Longitudinal evaluation of the natural history of amyloid-β in plasma and brain. Brain Communications, 2(1). https://doi.org/10.1093/braincomms/fcaa041 Cash, D. M., Frost, C., Iheme, L. O., Ünay, D., Kandemir, M., Fripp, J., Salvado, O., Bourgeat, P., Reuter, M., Fischl, B., Lorenzi, M., Frisoni, G. B., Pennec, X., Pierson, R. K., Gunter, J. L., Senjem, M. L., Jack, C. R., Guizard, N., Fonov, V. S., … Ourselin, S. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ (2015). Assessing atrophy measurement techniques in dementia: Results from the MIRIAD atrophy challenge. Neuroimage, 123, 149–164. Cummings, J., Blennow, K., Johnson, K., Keeley, M., Bateman, R. J., Molinuevo, J. L., Touchon, J., Aisen, P., & Vellas, B. (2019). Anti-Tau Trials for Alzheimer’s Disease: A Report from the EU/US/CTAD Task Force. J Prev Alzheimers Dis, 6(3), 157–163. Cummings, J., Lee, G., Ritter, A., Sabbagh, M., & Zhong, K. (2019). Alzheimer’s disease drug development pipeline: 2019. Alzheimers Dement (N Y), 5, 272–293. Delacourte, A., David, J. P., Sergeant, N., Buée, L., Wattez, A., Vermersch, P., Ghozali, F., Fallet-Bianco, C., Pasquier, F., Lebert, F., Petit, H., & Di Menza, C. (1999). The biochemical pathway of neurofibrillary degeneration in aging and Alzheimer’s disease. Neurology, 52(6), 1158–1165. Desikan, R. S., Ségonne, F., Fischl, B., Quinn, B. T., Dickerson, B. C., Blacker, D., Buckner, R. L., Dale, A. M., Maguire, R. P., Hyman, B. T., Albert, M. S., & Killiany, R. J. (2006). An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. NeuroImage, 31(3), 968–980. Donohue, M. C., Jacqmin-Gadda, H., Goff, M. Le, Thomas, R. G., Raman, R., Gamst, A. C., Beckett, L. A., Jack, C. R., Weiner, M. W., Dartigues, J.-F., & Aisen, P. S. (2014). Estimating long-term multivariate progression from short-term data. Alzheimer’s & Dementia, 10(5, Supplement), S400–S410. https://doi.org/https://doi.org/10.1016/j.jalz.2013.10.003 Egan, M. F., Kost, J., Voss, T., Mukai, Y., Aisen, P. S., Cummings, J. L., Tariot, P. N., Vellas, B., van Dyck, C. H., Boada, M., Zhang, Y., Li, W., Furtek, C., Mahoney, E., Harper Mozley, L., Mo, Y., Sur, C., & Michelson, D. (2019). Randomized Trial of Verubecestat for Prodromal Alzheimer’s Disease. N. Engl. J. Med., 380(15), 1408–1420. Fonteijn, H. M., Modat, M., Clarkson, M. J., Barnes, J., Lehmann, M., Hobbs, N. Z., Scahill, R. I., Tabrizi, S. J., Ourselin, S., Fox, N. C., & Alexander, D. C. (2012). An event-based model for disease progression and its application in familial Alzheimer’s disease and Huntington’s disease. NeuroImage, 60(3), 1880–1889. Gamberger, D., Lavrač, N., Srivatsa, S., Tanzi, R. E., & Doraiswamy, P. M. (2017). Identification of clusters of rapid and slow decliners among subjects at risk for Alzheimer’s disease. Sci Rep, 7(1), 6763. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ Garbarino, S., & Lorenzi, M. (2019). Modeling and Inference of Spatio-Temporal Protein Dynamics Across Brain Networks. IPMI 2019 - 26th International Conference on Information Processing in Medical Imaging, 11492, 57–69. https://hal.inria.fr/hal- 02165021 Gauthier, S., Alam, J., Fillit, H., Iwatsubo, T., Liu-Seifert, H., Sabbagh, M., Salloway, S., Sampaio, C., Sims, J. R., Sperling, B., Sperling, R., Welsh-Bohmer, K. A., Touchon, J., Vellas, B., & Aisen, P. (2019). Combination Therapy for Alzheimer’s Disease: Perspectives of the EU/US CTAD Task Force. J Prev Alzheimers Dis, 6(3), 164–168. Hao, W., & Friedman, A. (2016). Mathematical model on Alzheimer’s disease. BMC Syst Biol, 10(1), 108. Henley, D., Raghavan, N., Sperling, R., Aisen, P., Raman, R., & Romano, G. (2019). Preliminary Results of a Trial of Atabecestat in Preclinical Alzheimer’s Disease. N. Engl. J. Med., 380(15), 1483–1485. Honig, L. S., Vellas, B., Woodward, M., Boada, M., Bullock, R., Borrie, M., Hager, K., Andreasen, N., Scarpini, E., Liu-Seifert, H., Case, M., Dean, R. A., Hake, A., Sundell, K., Poole Hoffmann, V., Carlson, C., Khanna, R., Mintun, M., DeMattos, R., … Siemers, E. (2018). Trial of Solanezumab for Mild Dementia Due to Alzheimer’s Disease. N. Engl. J. Med., 378(4), 321–330. Howard, R., & Liu, K. Y. (2020). Questions EMERGE as Biogen claims aducanumab turnaround. Nat Rev Neurol, 16(2), 63–64. Insel, P. S., Mormino, E. C., Aisen, P. S., Thompson, W. K., & Donohue, M. C. (2020). Neuroanatomical spread of amyloid β and tau in Alzheimer’s disease: implications for primary prevention. Brain Communications, 2(1). https://doi.org/10.1093/braincomms/fcaa007 Iturria-Medina, Y, Sotero, R. C., Toussaint, P. J., Mateos-P?rez, J. M., Evans, A. C., & Initiative., A. D. N. (2016). Early role of vascular dysregulation on late-onset Alzheimer’s disease based on multifactorial data-driven analysis. Nat Commun, 7, 11934. Iturria-Medina, Yasser, Carbonell, F. M., Sotero, R. C., Chouinard-Decorte, F., & Evans, A. C. (2017). Multifactorial causal model of brain (dis)organization and therapeutic intervention: Application to Alzheimer’s disease. NeuroImage, 152, 60–77. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ https://doi.org/https://doi.org/10.1016/j.neuroimage.2017.02.058 Jack, C. R., Bennett, D. A., Blennow, K., Carrillo, M. C., Dunn, B., Haeberlein, S. B., Holtzman, D. M., Jagust, W., Jessen, F., Karlawish, J., Liu, E., Molinuevo, J. L., Montine, T., Phelps, C., Rankin, K. P., Rowe, C. C., Scheltens, P., Siemers, E., Snyder, H. M., … Silverberg, N. (2018). NIA-AA Research Framework: Toward a biological definition of Alzheimer’s disease. Alzheimers Dement, 14(4), 535–562. Jack, C. R., & Holtzman, D. M. (2013). Biomarker modeling of Alzheimer’s disease. Neuron, 80(6), 1347–1358. Jack, C. R., Knopman, D. S., Jagust, W. J., Petersen, R. C., Weiner, M. W., Aisen, P. S., Shaw, L. M., Vemuri, P., Wiste, H. J., Weigand, S. D., Lesnick, T. G., Pankratz, V. S., Donohue, M. C., & Trojanowski, J. Q. (2013). Tracking pathophysiological processes in Alzheimer’s disease: an updated hypothetical model of dynamic biomarkers. Lancet Neurol, 12(2), 207–216. Jedynak, B. M., Lang, A., Liu, B., Katz, E., Zhang, Y., Wyman, B. T., Raunig, D., Jedynak, C. P., Caffo, B., & Prince, J. L. (2012). A computational neurodegenerative disease progression score: method and results with the Alzheimer’s disease Neuroimaging Initiative cohort. NeuroImage, 63(3), 1478–1486. Kaufman, S. K., Del Tredici, K., Thomas, T. L., Braak, H., & Diamond, M. I. (2018). Tau seeding activity begins in the transentorhinal/entorhinal regions and anticipates phospho-tau pathology in Alzheimer’s disease and PART. Acta Neuropathologica, 136(1), 57–67. https://doi.org/10.1007/s00401-018-1855-6 Kim, J., Basak, J. M., & Holtzman, D. M. (2009). The role of apolipoprotein E in Alzheimer’s disease. Neuron, 63(3), 287–303. Klein, G., Delmar, P., Voyle, N., Rehal, S., Hofmann, C., Abi-Saab, D., Andjelkovic, M., Ristic, S., Wang, G., Bateman, R., Kerchner, G. A., Baudler, M., Fontoura, P., & Doody, R. (2019). Gantenerumab reduces amyloid-$β$ plaques in patients with prodromal to moderate Alzheimer’s disease: a PET substudy interim analysis. Alzheimer’s Research & Therapy, 11(1), 101. https://doi.org/10.1186/s13195-019-0559-z Kochhann, R., Varela, J. S., Lisboa, C. S. M., & Chaves, M. L. F. (2010). The Mini Mental State Examination: Review of cutoff points adjusted for schooling in a large Southern Brazilian sample. Dement Neuropsychol, 4(1), 35–41. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ Koval, I., Schiratti, J.-B., Routier, A., Bacci, M., Colliot, O., Allassonnière, S., & Durrleman, S. (2018). Spatiotemporal Propagation of the Cortical Atrophy: Population and Individual Patterns. Frontiers in Neurology, 9, 235. https://doi.org/10.3389/fneur.2018.00235 Li, D., Iddi, S., Thompson, W. K., & Donohue, M. C. (2019). Bayesian latent time joint mixed effect models for multicohort longitudinal data. Stat Methods Med Res, 28(3), 835–845. Lorenzi, M., Filippone, M., Frisoni, G. B., Alexander, D. C., & Ourselin, S. (2017). Probabilistic disease progression modeling to characterize diagnostic uncertainty: Application to staging and prediction in Alzheimer’s disease. NeuroImage. https://doi.org/https://doi.org/10.1016/j.Neuroimage.2017.08.059 Lorenzi, M., Pennec, X., Frisoni, G. B., & Ayache, N. (2015). Disentangling normal aging from Alzheimer’s disease in structural magnetic resonance images. Neurobiology of Aging, 36, S42–S52. https://doi.org/https://doi.org/10.1016/j.neurobiolaging.2014.07.046 Marinescu, R. V, Eshaghi, A., Lorenzi, M., Young, A. L., Oxtoby, N. P., Garbarino, S., Crutch, S. J., & Alexander, D. C. (2019). DIVE: A spatiotemporal progression model of brain pathology in neurodegenerative disorders. NeuroImage, 192, 166–177. Murphy, M. P., & LeVine, H. (2010). Alzheimer’s disease and the amyloid-beta peptide. J. Alzheimers Dis., 19(1), 311–323. Nader, C. A., Ayache, N., Robert, P., Lorenzi, M., & Initiative, A. D. N. (2020). Monotonic Gaussian Process for spatio-temporal disease progression modeling in brain imaging data. NeuroImage, 205. https://doi.org/10.1016/j.neuroimage.2019.116266 Oxtoby, N. P., Garbarino, S., Firth, N. C., Warren, J. D., Schott, J. M., & Alexander, D. C. (2017). Data-Driven Sequence of Changes to Anatomical Brain Connectivity in Sporadic Alzheimer’s Disease. Front Neurol, 8, 580. Oxtoby, N. P., Young, A. L., Cash, D. M., Benzinger, T. L. S., Fagan, A. M., Morris, J. C., Bateman, R. J., Fox, N. C., Schott, J. M., & Alexander, D. C. (2018). Data-driven models of dominantly-inherited Alzheimer’s disease progression. Brain, 141(5), 1529– 1544. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., … Chintala, S. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingle Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 32 (pp. 8024–8035). Curran Associates, Inc. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high- performance-deep-learning-library.pdf Petrella, J. R., Hao, W., Rao, A., & Doraiswamy, P. M. (2019). Computational Causal Modeling of the Dynamic Biomarker Cascade in Alzheimer’s Disease. Comput Math Methods Med, 2019, 6216530. Pontecorvo, M. J., Devous, M. D., Kennedy, I., Navitsky, M., Lu, M., Galante, N., Salloway, S., Doraiswamy, P. M., Southekal, S., Arora, A. K., McGeehan, A., Lim, N. C., Xiong, H., Truocchio, S. P., Joshi, A. D., Shcherbinin, S., Teske, B., Fleisher, A. S., & Mintun, M. A. (2019). A multicentre longitudinal study of flortaucipir (18F) in normal ageing, mild cognitive impairment and Alzheimer’s disease dementia. Brain, 142(6), 1723– 1735. Prince, M. J., Wimo, A., Guerchet, M. M., Ali, G. C., Wu, Y.-T., & Prina, M. (2015). World Alzheimer Report 2015 - The Global Impact of Dementia: An analysis of prevalence, incidence, cost and trends. Alzheimer’s Disease International. Reuter, M., Schmansky, N. J., Rosas, H. D., & Fischl, B. (2012). Within-subject template estimation for unbiased longitudinal image analysis. NeuroImage, 61(4), 1402–1418. Rowe, C. C., Ellis, K. A., Rimajova, M., Bourgeat, P., Pike, K. E., Jones, G., Fripp, J., Tochon-Danguy, H., Morandeau, L., O’Keefe, G., Price, R., Raniga, P., Robins, P., Acosta, O., Lenzo, N., Szoeke, C., Salvado, O., Head, R., Martins, R., … Villemagne, V. L. (2010). Amyloid imaging results from the Australian Imaging, Biomarkers and Lifestyle (AIBL) study of aging. Neurobiol. Aging, 31(8), 1275–1283. Safieh, M., Korczyn, A. D., & Michaelson, D. M. (2019). ApoE4: an emerging therapeutic target for Alzheimer’s disease. BMC Medicine, 17. Schiratti, J.-B., Allassonnière, S., Colliot, O., & Durrleman, S. (2015). Learning spatiotemporal trajectories from manifold-valued longitudinal data. NIPS, 2404–2412. Schuff, N., Woerner, N., Boreta, L., Kornfield, T., Shaw, L. M., Trojanowski, J. Q., .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ Thompson, P. M., Jack, C. R., & Weiner, M. W. (2009). MRI of hippocampal volume loss in early Alzheimer’s disease in relation to ApoE genotype and biomarkers. Brain, 132(Pt 4), 1067–1077. Schwarz, A. J., Sundell, K. L., Charil, A., Case, M. G., Jaeger, R. K., Scott, D., Bracoud, L., Oh, J., Suhy, J., Pontecorvo, M. J., Dickerson, B. C., & Siemers, E. R. (2019). Magnetic resonance imaging measures of brain atrophy from the EXPEDITION3 trial in mild Alzheimer’s disease. Alzheimer’s & Dementia: Translational Research & Clinical Interventions, 5(1), 328–337. https://doi.org/https://doi.org/10.1016/j.trci.2019.05.007 Sivera, R., Capet, N., Manera, V., Fabre, R., Lorenzi, M., Delingette, H., Pennec, X., Ayache, N., & Robert, P. (2020). Voxel-based assessments of treatment effects on longitudinal brain changes in the Multidomain Alzheimer Preventive Trial cohort. Neurobiology of Aging, 94, 50–59. https://doi.org/https://doi.org/10.1016/j.neurobiolaging.2019.11.020 Sperling, R. A., Jack, C. R., & Aisen, P. S. (2011). Testing the right target and right drug at the right stage. Sci Transl Med, 3(111), 111cm33. Villemagne, V. L., Burnham, S., Bourgeat, P., Brown, B., Ellis, K. A., Salvado, O., Szoeke, C., Macaulay, S. L., Martins, R., Maruff, P., Ames, D., Rowe, C. C., & Masters, C. L. (2013). Amyloid Î2 deposition, neurodegeneration, and cognitive decline in sporadic Alzheimer’s disease: a prospective cohort study. Lancet Neurol, 12(4), 357–367. Wessels, A. M., Tariot, P. N., Zimmer, J. A., Selzler, K. J., Bragg, S. M., Andersen, S. W., Landry, J., Krull, J. H., Downing, A. M., Willis, B. A., Shcherbinin, S., Mullen, J., Barker, P., Schumi, J., Shering, C., Matthews, B. R., Stern, R. A., Vellas, B., Cohen, S., … Sims, J. R. (2019). Efficacy and Safety of Lanabecestat for Treatment of Early and Mild Alzheimer Disease: The AMARANTH and DAYBREAK-ALZ Randomized Clinical Trials. JAMA Neurol. Westwood, S., Leoni, E., Hye, A., Lynham, S., Khondoker, M. R., Ashton, N. J., Kiddle, S. J., Baird, A. L., Sainz-Fuertes, R., Leung, R., Graf, J., Hehir, C. T., Baker, D., Cereda, C., Bazenet, C., Ward, M., Thambisetty, M., & Lovestone, S. (2016). Blood-Based Biomarker Candidates of Cerebral Amyloid Using PiB PET in Non-Demented Elderly. J. Alzheimers Dis., 52(2), 561–572. Young, A. L., Oxtoby, N. P., Daga, P., Cash, D. M., Fox, N. C., Ourselin, S., Schott, J. M., & Alexander, D. C. (2014). A data-driven model of biomarker changes in sporadic Alzheimer’s disease. Brain, 137(Pt 9), 2564–2577. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ Zetterberg, H., & Burnham, S. C. (2019). Blood-based molecular biomarkers for Alzheimer’s disease. Molecular Brain, 12(1), 26. https://doi.org/10.1186/s13041-019-0448-1 .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.09.02.279521doi: bioRxiv preprint https://doi.org/10.1101/2020.09.02.279521 http://creativecommons.org/licenses/by/4.0/ 10_1101-2020_09_21_305516 ---- Copy-scAT: Deconvoluting single-cell chromatin accessibility of genetic subclones in cancer Copy-scAT: Deconvoluting single-cell chromatin accessibility of genetic subclones in cancer 1 2 Ana Nikolic1,2,3, Divya Singhal1,2,3, Katrina Ellestad1,2,3, Michael Johnston1,2,3, Aaron Gillmor1,2,3, Sorana 3 Morrissy1,2,3, Jennifer A Chan1,2,4, Paola Neri1,4, Nizar Bahlis1,4, Marco Gallo1,2,3* 4 5 1Arnie Charbonneau Cancer Institute 6 2Alberta Children’s Hospital Research Institute 7 3Department of Biochemistry and Molecular Biology 8 4Department of Oncology 9 Cumming School of Medicine, University of Calgary, Calgary, AB, Canada 10 11 *Corresponding author: Marco Gallo 12 marco.gallo@ucalgary.ca 13 14 15 ABSTRACT 16 Single-cell epigenomic assays have tremendous potential to illuminate mechanisms of transcriptional control 17 in functionally diverse cancer cell populations. However, application of these techniques to clinical tumor 18 specimens has been hampered by the current inability to distinguish malignant from non-malignant cells, 19 which potently confounds data analysis and interpretation. Here we describe Copy-scAT, an R package that 20 uses single-cell epigenomic data to infer copy number variants (CNVs) that define cancer cells. Copy-scAT 21 enables studies of subclonal chromatin dynamics in complex tumors like glioblastoma. By deploying Copy-22 scAT, we uncovered potent influences of genetics on chromatin accessibility profiles in individual subclones. 23 Consequently, some genetic subclones were predisposed to acquire stem-like or more differentiated 24 molecular phenotypes, reminiscent of developmental paradigms. Copy-scAT is ideal for studies of the 25 relationships between genetics and epigenetics in malignancies with high levels of intratumoral heterogeneity 26 and to investigate how cancer cells interface with their microenvironment. 27 28 29 INTRODUCTION 30 31 Single-cell genomic technologies have made enormous contributions to the deconvolution of complex 32 cellular systems, including cancer (1). Single-cell RNA sequencing (scRNA-seq) in particular has been widely 33 employed to understand the implications of intratumoral transcriptional heterogeneity for tumor growth, 34 response to therapy and patient prognosis (2–6). This field has hugely benefited from an emerging ecosystem 35 of computational tools that have enabled complex analyses of scRNA data. Since copy number variants 36 (CNVs) mostly accrue in malignant cells and are rare in non-malignant tissues, computational platforms that 37 use scRNA data to call CNVs have resulted in improved understanding of the behavior of genetic subclones 38 in tumors (7–9). 39 On the other hand, the application of single-cell epigenomic techniques, including the assay for transposase 40 accessible chromatin (scATAC) (10, 11), to study cancer has been slowed by computational bottlenecks. For 41 instance, unlike scRNA-seq, currently no dedicated tool exists to call CNVs using scATAC data. This 42 technical gap has prevented scATAC studies of clinical tumor specimens, which often are surgical resections 43 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 that include both malignant and non-malignant cells. Inability to deconvolute these cell populations after the 44 generation of scATAC datasets would confound downstream analyses and interpretation of this data type. 45 In this report, we describe Copy-scAT (Copy number inference using scATAC-seq data), a new 46 computational tool that uses scATAC datasets to call CNVs at the single-cell level. Using scATAC datasets 47 from adult glioblastoma (aGBM), pediatric GBM (pGBM) and multiple myeloma (MM), we demonstrate the 48 effectiveness of Copy-scAT in calling (A) focal amplifications, (B) segmental gains and losses and (C) 49 chromosome arm-level gains and losses. At the most basic level, Copy-scAT can therefore discriminate 50 between malignant and non-malignant cells in scATAC datasets based on the presence or absence, 51 respectively, of CNVs. This distinction is fundamental to ensure that downstream analyses include only the 52 appropriate tumor or microenvironment cell populations. At a more sophisticated level, we show that 53 implementation of Copy-scAT allows investigations of the relationship between genetic and epigenetic 54 principles governing the behavior of individual subclones. In this regard, we show that each genetic subclone 55 has characteristic accessible chromatin profiles, indicating that genetics imparts information that determines 56 key epigenetic features. Strong influence of genetics on chromatin states is demonstrated by the 57 predisposition of genetic subclones to have stem-like or more differentiated molecular profiles in GBM. 58 59 RESULTS 60 61 Design and implementation of Copy-scAT 62 We designed Copy-scAT, an R package that uses scATAC-seq information to infer copy number alterations. 63 Copy-scAT uses fragment files generated by cellranger-atac (10xGenomics) as input to generate chromatin 64 accessibility pileups, keeping only barcodes with a minimum number of fragments (defaulting to 5,000 65 fragments). It then generates a pileup of total coverage (number of reads × read lengths) over bins of 66 determined length (1 million bp as default) (Fig. 1a). Binned read counts then undergo linear normalization 67 over the total signal in each cell to account for differences in read depth, and chromosomal bins which 68 consist predominantly of zeros (at least 80% zero values) are discarded from further analysis. All parameters, 69 including reference genome, bin size, and minimum length cut-off are user-customizable. Copy-scAT then 70 implements different algorithms to detect focal amplifications and larger-scale copy number variation. 71 To call focal amplifications (Fig. 1b), Copy-scAT generates a linear scaled profile of density over normalized 72 1 Mbp bins along each chromosome on a single-cell basis, centering on the median and scaling using the 73 range. Copy-scAT then uses changepoint analysis (12) (see Methods) to identify segments of abnormally high 74 signal (Z score > 5) along each chromosome in each single cell. These calls are then pooled together to 75 generate consensus regions of amplification, in order to identify putative double minutes and 76 extrachromosomal amplifications. Each cell is scored as positive or negative for each amplified genomic 77 region. 78 Segmental losses are called in a similar fashion, by calculating a quantile for each bin on a chromosome, 79 running changepoint analysis to identify regions with abnormally low average signal, and then using Gaussian 80 decomposition of total signal in that region to identify distinct clusters of cells. 81 For larger copy number alterations, Copy-scAT pools the bins further at the chromosome arm level using a 82 trimmed mean, while normalizing the data on the basis of length of CpG islands contained in each bin (Fig. 83 1c). Data is then scaled for each chromosome arm, compared to a pseudodiploid control (expected signal 84 distribution for a diploid genotype) that is modeled for each sample, and cluster assignments are generated 85 using Gaussian decomposition. Cluster assignments are then normalized to get an estimate of copy number 86 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3 for each cell (Fig. 1d). These assignments can be optionally combined with clustering information to 87 generate consensus genotypes for each cluster of cells and further filter false positives (Fig. 1e) For full 88 details regarding the execution of Copy-scAT, see Methods. A step-by-step tutorial for Copy-scAT is 89 available on GitHub (see Methods). 90 91 Fig. 1. Copy-scAT workflow. 92 (a) Copy-scAT accepts barcode fragment matrices generated by cellranger (10xGenomics) as input. 93 (b) Large peaks in normalized coverage matrices can be used to infer focal CNVs. 94 (c) Normalized matrices can be used to infer segmental and chromosome-arm level CNVs. 95 (d) Example of chromosome-arm level CNV (chromosome 10p loss) called by Copy-scAT 96 (e) Consensus clustering is used to finalize cell assignment. 97 98 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4 Copy-scAT effectively calls CNVs in diverse malignancies 99 We have tested the ability of Copy-scAT to use scATAC data to call CNVs with three different approaches 100 and with different tumor types. First, we benchmarked Copy-scAT against CNV calls made with whole-101 genome sequencing (WGS) data for adult GBM (aGBM) surgical resections (n = 4 samples, 3,647 cells). This 102 approach consisted in isolating nuclei from flash-frozen aGBM samples, mixing nuclei in suspension, and 103 then using these nuclei for either scATAC or WGS library construction (Fig. 2a). This was meant to ensure 104 similar representation of genetic subclones, which are usually regionally contiguous in this solid tumor, in 105 both scATAC and WGS libraries. Second, we benchmarked Copy-scAT against CNV calls made using 106 pediatric GBM (pGBM) surgical resections (n = 6 patient-matched diagnostic-relapse samples, 33,695 cells). 107 In this case, scATAC and WGS libraries were generated from separate geographical regions of the same 108 tumor (Fig. 2b). Third, we benchmarked Copy-scAT against CNV calls made with the single-cell CNV 109 (scCNV) assay (10xGenomics) using multiple myeloma (MM) clinical samples (n = 10 samples, 31,266 cells). 110 Overall, we observed that Copy-scAT correctly inferred all or most of the CNVs that were called with WGS 111 (Figs. 2a,b; Figs. S1, S2) or scCNV data (Fig. 2c; Fig S3). In total, we profiled 51,571 cells from 20 112 malignancies from 17 patients, and were able to infer CNV status for a total of 39,486 cells (Table S1). On 113 average, we were able to call CNVs for 78.09% of cells in each sample (range: 29.16 – 91.22%) (Table S1). 114 For chromosome-arm level CNV gains, sensitivity ranged from 0.51 for MM to 1.0 for aGBM and specificity 115 ranged from 0.93 to 0.94 (Table S2). For chromosome arm-level losses, sensitivity ranged from 0.67 to 0.79 116 and specificity from 0.89 to 0.95. The sensitivity and specificity of focal amplifications were very high 117 (>0.975, Table S2). The variation observed may reflect technical differences between the strategies used for 118 benchmarking. As expected, the calls of Copy-scAT for aGBM were the most accurate, likely because 119 scATAC and WGS datasets were generated by relatively homogeneous starting material, as described above. 120 Because of its design, it is also possible that Copy-scAT is more sensitive at inferring CNVs that occur in 121 relatively rare subclones compared to WGS, potentially explaining (in addition to true false positives) why 122 the number of CNVs inferred by our new tool is sometime higher than inferences made with WGS. 123 124 scATAC data can be used to distinguish malignant from non-malignant cells 125 Tumor cells often harbor CNVs, and we reasoned that the use of Copy-scAT should enable the use of 126 scATAC data to infer CNVs and therefore to distinguish between malignant and non-malignant cells. To 127 test this hypothesis, we overlayed CNVs called by Copy-scAT onto scATAC datasets displayed in uniform 128 manifold approximation and projections (UMAP) plots. This exercise led to the identification of cells that 129 were clearly positive for multiple CNVs and others that appeared to have a normal genome. As an illustrative 130 example, we found that the aGBM sample CGY4349 was composed of discrete cell populations that 131 harbored focal amplifications at the MDM4 (Fig. 2d), PDGFRA (Fig. 2e) and EGFR (Fig. 2f) loci, as well 132 as chromosome 10p deletion (Fig. 2g) and chromosome 7 gain (Fig. 2h,i). Copy-scAT results suggest 133 specific lineage relationships between subclones. For instance, chromosome 7 amplifications are clonal in 134 this sample (Fig. 2h,i), whereas the chromosome 10 deletion is subclonal (Fig. 2g). In addition, our 135 computational tool predicts that PDGFRA (Fig. 2e) and EGFR (Fig. 2f) focal amplifications are mutually 136 exclusive, a phenomenon that has been reported in aGBM (13). 137 138 Altogether, these results illustrate one specific population of cells (shaded green in Fig. 2i) that harbors 139 several CNVs and are therefore putative cancer cells. At the same time, we also identified cells (labeled in 140 dark blue in Fig. 2i) that did not appear to have any CNVs and are therefore likely to be cells from the tumor 141 microenvironment. Equivalent results were obtained for pGBM (Fig. S4) and MM samples (Fig. S5). Since 142 the latter appear as multiple scATAC clusters, it is possible that our strategy detects multiple distinct non-143 neoplastic cell clusters. Differential motif analysis with ChromVAR confirmed high scores for neural 144 progenitor cell-associated motifs like NFIC and ASCL1 in CNV+ cells (Fig. 2j,k), while the putative non-145 neoplastic clusters showed increased occupancy at transcription factor motifs associated with hematopoietic 146 lineages, such as IKZF1 (Fig. 2l). Another CNV- cluster showed enrichment of FOXG1 binding motifs in 147 accessible chromatin, in keeping with a non-neoplastic neural cell identity (Fig. 2). Using this approach, it 148 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 5 was possible to discriminate between malignant and cells from the tumor microenvironment in all tumor 149 samples analyzed (Extended Figs. S6-S8). Copy-scAT therefore effectively uses scATAC data to infer 150 CNVs, which can then be used to distinguish malignant from non-malignant cells and to infer lineage 151 relationships between genetic subclones that coexist in a tumor. 152 153 154 155 Fig. 2. Benchmarking of Copy-scAT with three methods involving clinical samples from three 156 distinct malignancies. 157 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 6 (a) Banked frozen aGBM samples were used for both scATAC and WGS. Nuclei were isolated from the 158 samples, mixed, and used for both scATAC and WGS. Number of chromosome-arm level gains detected in 159 adult GBM samples identified using both methods, versus total numbers of gains detected by scATAC or 160 WGS. 161 (b) Surgical pGBM resections were split, and one section was used for scATAC and the other for WGS. 162 Number of chromosome-arm level gains detected in adult GBM samples identified using both methods, 163 versus total numbers of gains detected by scATAC or WGS. 164 (c) Multiple myeloma samples were profiled by both scATAC and the single-cell CNV assay. Number of 165 chromosome-arm level gains detected in adult GBM samples identified using both methods, versus total 166 numbers of gains detected by scATAC or scCNV assay. 167 (d) MDM4 amplification in an adult GBM sample (CGY4349). Amplified cells are coloured dark blue, and 168 normal cells in pale blue. 169 (e) PDGFRA amplification in an adult GBM sample (CGY4349). Amplified cells are coloured dark blue, 170 and normal cells in pale blue. 171 (f) EGFR amplification in an adult GBM sample (CGY4349). Amplified cells are coloured dark blue, and 172 normal cells in pale blue. 173 (g) Chromosome 10p loss in an adult GBM sample. 174 (h) Chromosome 7P gain in an adult GBM sample. 175 (i) Chromosome 7Q gain in an adult GBM sample. 176 (j) ChromVAR activity score for ASCL1. 177 (k) ChromVAR activity score for NFIC. 178 (l) ChromVAR activity score for IKZF1. 179 (m) ChromVAR activity score for FOXG1. 180 181 182 183 Subclonal genetics shapes chromatin accessibility profiles in aGBM 184 We noticed that in most tumors we analyzed, cells harboring a given CNV had a tendency to cluster together 185 (Fig. 2d-i). Individual clusters were in fact defined by the presence of specific CNVs (Fig. 3a-c). This was 186 an unexpected observation, because it is widely assumed that clustering of scATAC data reflects the global 187 patterns of chromatin accessibility. One possible explanation for this observation could be that chromosomal 188 regions affected by a CNV display imbalances in the fragment depth distribution of scATAC datasets, and 189 that these patterns have a dominant effect on cluster assignment. Most scATAC-seq workflows rely on some 190 variant of term-frequency inverse document frequency (TF-IDF) normalization rather than feature scaling, 191 and this may amplify the effects of CNV-driven DNA content imbalances. For instance, it is possible that 192 focal amplifications of the PDGFRA locus result in increased frequency of transposition events that are 193 mapped to this site. A dominant effect of chromatin accessibility at this amplified locus could result in 194 PDGFRA-amplified cells clustering together in UMAP representations of scATAC data (Fig 3d,e). Indeed 195 we found that compared to a random selection of peaks, the chromosomes which carried CNVs had 196 significantly different numbers of peaks ranked as highly variant than chromosomes that did not have CNVs, 197 leading to a markedly uneven distribution of top peaks (p < 2.2E-16; Chi-squared test; Fig. S9a) This was 198 not seen in non-neoplastic cells, which had relatively even top fragment distribution patterns (p = 0.05472, 199 Chi-squared test; Fig. S9b). To test this hypothesis, we used Copy-scAT to call CNVs in our tumor samples, 200 then removed all peaks mapping to chromosomes predicted to harbor CNVs, and finally re-clustered all cells 201 in each sample (Fig. 3f). We found that although removing chromosomes with CNVs from our analyses 202 changed the overall cluster structure of a sample (Fig. 3g), PDGFRA-amplified cells still clustered close to 203 each other (Fig. 3h). In fact, our results indicate that clustering after CNV removal is more granular but 204 overall very stable (Fig. 3i). In this case, PDGFRA-amplified cells localized to a single cluster before 205 removing chromosomes affected by CNVs. Following removal of CNV+ chromosomes and re-clustering, 206 most PDGFRA-amplified cells still clustered together, with only a few cells merging into a cluster that 207 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 7 included both amplified and non-amplified cells. Comparing the most variable peaks after chromosome 208 CNV removal showed a distribution closer to normal, supporting the marked effect of the CNVs on the 209 identification of variant peaks (p = 2.418E-8; Fig. S9c). Contrary to current views of cancer epigenomics, 210 these data indicate that genetic subclones may have characteristic patterns of chromatin accessibility, and 211 that a cell’s genetic background has significant influence on its likelihood of attaining specific epigenetic 212 states. 213 214 215 Fig. 3. Subclonal genetics influences clustering of scATAC-seq data. 216 (a-c) CNVs in adult GBM CGY4218 segregate within specific scATAC clusters. 217 (d, e) PDGFRA-amplified cells cluster together in adult GBM CGY4349. 218 (f) Diagram summarizing our strategy to remove CNVs from clustering of scATAC data. All chromosomes 219 or regions with putative CNVs were removed from downstream analyses, and cells were re-clustered. 220 (g) Reclustering of (d) following removal of chromosomes and regions affected by CNVs in CGY 4349. 221 (h) Distribution of PDGFRA-amplified cells following re-clustering. 222 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 8 (i) Cluster assignments of cells in CGY4349 (aGBM specimen) before and after removal of CNV-containing 223 regions (purple: PDGFRA-amplified cells). 224 225 226 Genetic events predispose subclones to the acquisition of developmental chromatin states 227 We further explored the notion that CNVs may shape chromatin accessibility profiles and its possible 228 implications for cell fate determination. As an illustrative example, we focused on an aGBM sample 229 (CGY4218) where CNVs at chromosome 1p characterized three genetic subclones, as determined with 230 Copy-scAT: (i) A subclone with two copies of chromosome 1p; (ii) a subclone with loss of 1p; (iii) a subclone 231 with gain of 1p (Fig. 4a). 232 233 We were interested in determining whether the major genetic subclones in this tumor had similar cycling 234 properties. Unlike scRNA-seq, we found it is not possible to use scATAC profiles at cell cycle genes to 235 determine whether a cell is proliferating. We reasoned that cells that are actively going through cell division 236 have to replicate their DNA. Given that cancer cells have numerous CNVs on autosomes and could lead to 237 noisy data, we decided to use Copy-scAT to identify cells that have doubled the number of their X 238 chromosomes and defined them as actively cycling cells. To validate this approach, we determined the 239 number of cells with double the number of expected X chromosomes – ie putative cycling cells – in 240 previously published scATAC datasets for mouse brain and peripheral blood mononuclear cells (PBMCs). 241 We hypothesized that we should be able to identify cycling cells in fetal mouse brain, but not in PBMCs. In 242 fact, we detected numerous cycling cells (with twice the expected number of X chromosomes) in brain tissue 243 but not in PBMCs (Fig. S10). This method detected putative cycling cells in our datasets (Fig. 4b). We used 244 scATAC data to arrange cells from this tumor along pseudotime with the package STREAM (14) (Fig. 4c) 245 and then superimposed cell cycle status determined with our X chromosome doubling method (Fig. 4d). 246 The results show that cells along branch 2, which is strongly enriched for cells with chromosome 1p gains, 247 are also the most proliferative (Fig. 4e), with over 25% of the cells actively going through replication (P = 248 7.776 × 10-14; Chi-square test). On the other hand, ~5% of cells along branch 1 and ~15% of cells along 249 branch 3 were cycling. These data therefore indicate functional differences between cells with gain or loss of 250 chromosome 1p. 251 252 We then used ChromVAR(15) and STREAM-ATAC to calculate scores for transcription factor (TF) binding 253 motifs that are associated with neurodevelopmental processes. This analysis revealed that motifs bound by 254 TFs that are associated with stem-like phenotypes, including OLIG2 and HOXA2, are enriched in accessible 255 chromatin regions in cells that have one copy of chromosome 1p (Fig. 4f). Motifs bound by TFs associated 256 with progenitor (Fig 4g) and differentiated states (Fig. 4h) were enriched in the branch with more cells 257 showing gain of chromosome 1p. This was associated with a significant shift in the overall distribution of 258 enrichment of these motifs in cells along the different branches of the trajectory (Fig. 4i-k). A distribution 259 of genetic subclones along developmental chromatin accessibility states was observed in other tumor samples 260 we studied (Fig. S11-S13). Overall, the data support the notion that tumor cells sample a discrete number 261 of chromatin states, but their transition probabilities differ based on genotype. Consequently, chromatin 262 states associated with each genetic subclone manifest as different functional properties, here demonstrated 263 at the level of cell proliferation and stemness profiles. 264 265 266 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 9 267 Figure 4. Subclonal genetic alterations predispose cells to adopt developmental chromatin states. 268 (a) Cells were clustered based on scATAC ChromVAR motif scores, then shaded based on the presence of 269 1, 2 or 3 copies of chromosome 1P. 270 (b) Cells were shaded based on their predicted cycling properties. 271 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10 (a) Data shown in (a) projected onto pseudotime. The resulting three branches are populated preferentially 272 by cells with gain or loss of chromosome 1P respectively. 273 (d) Proliferation status as shown in (b), overlaid onto pseudotime. 274 (e) Branches enriched for 1P gain show greater proportions of proliferative cells (statistics: Chi-squared test). 275 (f) Scaled chromatin accessibility at binding motifs for OLIG2 and HOXA2, two TFs associated with 276 stemness. 277 (g) Scaled chromatin accessibility at binding motifs for RFX2 and NFIX, two TFs associated with 278 progenitor-like phenotypes. 279 (h) Scaled chromatin accessibility at binding motifs for RARA::RXRA and STAT3, two TFs associated with 280 differentiated phenotypes. 281 (i) Enrichment plot for motif Z scores for OLIG2 and HOXA2. 282 (j) Enrichment plot for motif Z scores for RFX and NFIX. 283 (k) Enrichment plot for motif Z scores for RARA::RXRA and STAT3. P values calculated by Kruskal-Wallis 284 test. 285 286 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 11 DISCUSSION 287 Here we describe Copy-scAT, the first computational tool dedicated to inferring CNVs using scATAC data. 288 Copy-scAT resolves a computational bottleneck that has restricted the application of single-cell epigenomic 289 techniques to the study of clinical tumor samples, which are often mixtures of malignant and non-malignant 290 cells. The presence of non-malignant cells can severely confound the analyses of these samples and 291 downstream data interpretation. Cell admixture is a particular problem for scATAC data because of the 292 inherent sparsity of these datasets and because they do not provide direct information on the expression 293 status of cell lineage markers that could be used to solve cellular identities. Because most tumor types harbor 294 CNVs, Copy-scAT provides a simple way of solving this problem. 295 It is important to note that Copy-scAT enables users to perform analyses on both malignant and non-296 malignant cells from a tumor sample, because cell barcodes associated with both presence or absence of 297 CNVs can be selected for downstream analyses. Implementation of Copy-scAT will therefore be beneficial 298 to groups interested in defining the epigenomes of both tumor cells and their microenvironment. Because 299 chromatin accessibility datasets provide information on mechanisms of transcriptional regulation by distal 300 and proximal enhancer and super enhancer elements, Copy-scAT could be useful in clarifying epigenetic 301 mechanisms involved in immune suppression and T cell exhaustion, for instance. Copy-scAT also allows 302 scATAC studies of frozen banked cancer specimens (see Methods), because it requires no prior knowledge 303 of cell composition. 304 We show that the underlying CNV architecture plays a significant role in clustering of scATAC data, a 305 problem that is amplified by the use of TF-IDF algorithms for normalization. These effects are less 306 pronounced when clustering is based on motif activity scores (e.g. ChromVAR), likely as this incorporates 307 data from multiple chromosomes, thus dampening the effect of variation at any one specific locus. Further 308 studies are needed to identify the optimal way to address the effects of CNVs in downstream analyses, as 309 they may present a significant confounder and potentially mask significant biological relationships. 310 In this report, we provide evidence that Copy-scAT can be used to shed new light on how genetics and 311 epigenetics interface in cancer. We show that genetic subclones tend to have unique chromatin accessibility 312 landscapes that can promote or antagonize stem-like phenotypes. Consequently, we report that some genetic 313 subclones have greater proportions of stem-like cells, and others appear more differentiated. These results 314 offer a radically different view of functional hierarchies in GBM, where stem-like properties were thought to 315 be programmed by epigenetic factors, independently of genotype. These findings provide a simple 316 explanation for the observed intra-tumoral transcriptional heterogeneity in GBM ((5, 16)), by suggesting that 317 each genetic subclone achieves specific chromatin accessibility profiles, which in turn result in subclone-318 specific transcriptional outcomes. 319 Copy-scAT will enable future studies of subclonal chromatin dynamics in complex tumor types and may be 320 an important tool in better understanding the functional relationships between subclones, their 321 microenvironment and therapy response. 322 323 324 MATERIALS AND METHODS 325 326 Ethics and consent statement 327 All samples were collected and used for research with appropriate informed consent and with approval by 328 the Health Research Ethics Board of Alberta. 329 330 scATAC-seq sample processing 331 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 12 GBM samples were either frozen surgical resections (pediatric GBM) or cells dissociated from fresh surgical 332 specimens and cryopreserved (adult GBM). Samples were dissociated in a 1.5 mL microcentrfuge tube, using 333 a wide-bore P1000 pipette followed by a narrow bore P1000 pipette in nuclear resuspension buffer (10 mM 334 Tris-HCl; 10 mM NaCl; 3 mM MgCl2; 0.1% IGEPAL, 0.1% Tween-20, 0.01% Digitonin, 1% BSA in PBS), 335 then vortexed briefly, chilled on ice for 10 minutes, then pipetted again, and spun at 4°C, 500 g for 5 minutes. 336 This step was repeated, and the sample was then resuspended in Tween wash buffer (10 mM Tris-HCl; 10 337 mM NaCl; 3 mM MgCl2; 0.1% IGEPAL, 0.1% Tween-20; 1% BSA in PBS), then strained though a 35 μm 338 cell strainer FACS tube (Fisher Scientific 08-771-23) to remove debris. Nuclei were then quantified by trypan 339 blue on the Countess II (Invitrogen), spun down at 500 g at 4°C for 5 minutes, resuspended in the nuclear 340 isolation buffer (10X Genomics), and the rest of the scATAC was performed as per the 10X Genomics 341 protocol. MM samples were from bone marrow aspirates collected from patients; tumor cells were isolated 342 from mononuclear cell fractions through Ficoll gradients coupled with magnetic bead sorting of CD138+ 343 cells. scATAC libraries were prepared from GBM and MM samples using a Chromium controller 344 (10xGenomics). Libraries were sequenced on NextSeq 500 or Novaseq 6000 instruments (Illumina) at the 345 Centre for Health Genomics and Informatics (CHGI; University of Calgary) using the recommended 346 settings. 347 348 scATAC-seq initial data analysis 349 The raw sequencing data was demultiplexed using cellranger-atac mkfastq (Cell Ranger ATAC, version 1.1.0, 350 10x Genomics). Single cell ATAC-seq reads were aligned to the hg38 reference genome (GRCh38, version 351 1.1.0, 10x Genomics) and quantified using cellranger-atac count function with default parameters (Cell 352 Ranger ATAC, version 1.1.0, 10x Genomics). 353 354 Single-cell CNV analysis 355 Fragment pileup and normalization 356 The fragment file was processed and signal was binned into bins of a preset size (default 1 Mb) across the 357 hg38 chromosomes to generate a genome-wide read-depth map. Only barcodes with a minimum of 5000 358 reads were retained, in order to remove spurious barcodes. This flattened barcode-fragment matrix pileup 359 was cleaned by removal of genomic intervals which were uninformative (greater than 80% zeros) and 360 barcodes with greater than a certain number of zero intervals. Cells passing this first filter were normalized 361 with counts-per-million normalization using cpm in the edgeR package (17). 362 363 Chromosome arm CNV analysis 364 The normalized barcode-fragment matrix was collapsed to the chromosome arm level, using chromosome 365 arm information from the UCSC (UCSC table: cytoBand), centromeres were removed, and signal in each 366 bin was normalized using the number of basepairs in CpG islands in the interval using the UCSC CpG islands 367 table (UCSC table: cpgIslandExtUnmasked). The signal was then summarized using a quantile-trimmed-368 mean (between the 50th and 80th quantiles). Only chromosome arms with a minimum trimmed mean signal 369 were kept for analysis. 370 The chromosome arm signal matrix is mixed with a generated set proportion of pseudodiploid control cells, 371 defined using the mean of chromosome segment medians with a defined standard deviation. This cell-signal 372 matrix is then scaled across each chromosome arm and centered on the median signal of all chromosomes. 373 Each chromosome arm segment is then analyzed using Gaussian decomposition with Mclust (18). The 374 subsequent clusters are filtered based on Z scores and mixing proportions, and redundant clusters are 375 combined. These Z scores are then translated into estimated copy numbers for each segment for each 376 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 13 barcode. The barcode CNV assignments can be optionally used to assign consensus CNVs to clusters 377 generated in other software packages such as Loupe or Seurat/Signac. 378 379 Detection of amplifications 380 The normalized barcode-fragment matrix was scaled and mean-variance changepoint analysis using the 381 Changepoint package was performed for each cell and each chromosome to identify areas of abnormally 382 high signal (Z score greater than 5) (19). The consensus coordinates of each amplification region were 383 generated across all cells and only abnormalities affecting a minimum number of cells were kept for analysis. 384 385 Detection of loss of heterozygosity 386 The normalized barcode-fragment matrix was scaled as above. As overall coverage levels in these samples 387 are quite sparse, a chromosome-wide coverage profile was generated for the entire sample in bulk, using the 388 30% quantile as a cut-off, and then changepoint analysis was used to find inflection points. This was followed 389 by Gaussian decomposition of the values using Mclust to identify putative areas of loss or gain, thresholded 390 by a minimum difference in signal between the clusters identified by Mclust. 391 392 scATAC trajectory analysis 393 STREAM-ATAC and STREAM (20) were used to generate pseudotime trajectories based on motif 394 occupancy profiles generated using ChromVAR (21) with the JASPAR 2018 motif database as reference (22). 395 Dimensionality reduction was performed using the top 20 components and 50 neighbours, and an initial 396 elastic graph was generated on the 2D UMAP projection using 10 clusters, using the kmeans method with 397 n_neighbours = 30. An elastic principal graph was constructed using the parameters epg_alpha = 0.02, 398 epg_mu = 0.05, epg_lambda = 0.02 and epg_trimmingradius = 1.2, with branch extension using 399 ‘QuantDists’. Trees were rooted using the branch with highest motif activities for OLIG2 and ETV motifs 400 as root. 401 402 Whole genome sequencing 403 DNA was extracted from residual nuclei from the same samples and tissue fragments used for scATAC-seq 404 of adult GBM samples, using the Qiagen DNEasy Blood and Tissue DNA extraction kit (Qiagen # 69504). 405 Libraries were prepared using the NEBNext Ultra II DNA Library Prep Kit (#E7645) and sequenced on 406 the Novaseq 6000 (Illumina) at the CHGI (University of Calgary), in paired-end mode. 407 408 Whole genome data processing 409 Genome data was aligned to the hg38 assembly using bwa mem (bwa 0.7.17)(23). Samtools was used to 410 extract high-quality reads (Q > 30) and picard tools (Broad Institute) was used to remove duplicates (24). 411 412 Whole genome SNV and CNV detection 413 Gatk mutect2 (Broad Institute) was run on the filtered data to detect SNVs with low stringency using the 414 following settings: --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. CNVkit was subsequently 415 used to call copy number variants using the following parameters: --filter cn -m clonal –purity 0.7 (25). Adjacent 416 segments were further combined and averaged using bedtools (26). 417 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 14 Data visualization and clustering 418 Data was visualized and UMAP plots were generated using Seurat 3.0.0 and Signac 1.0.0 (27) and Cell Loupe 419 version 4.0.0 (28). 420 421 Statistical analysis 422 Between-group differences in discrete values (e.g. chromosome peaks, branch assignments) were calculated 423 using the Chi-squared test. Differences in non-parametric distributions (motif accessibility in clusters) were 424 quantified using the Kruskal-Wallis test. 425 426 427 References 428 1. B. Lim, Y. Lin, N. Navin, Advancing Cancer Research and Medicine with Single-Cell Genomics. 429 Cancer Cell. 37, 456–470 (2020). 430 2. A. P. Patel, I. Tirosh, J. J. Trombetta, A. K. Shalek, S. M. Gillespie, H. Wakimoto, D. P. Cahill, B. V. 431 Nahed, W. T. Curry, R. L. Martuza, D. N. Louis, O. Rozenblatt-Rosen, M. L. Suvà, A. Regev, B. E. 432 Bernstein, Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 433 (80-. ). (2014), doi:10.1126/science.1254257. 434 3. S. Darmanis, S. A. Sloan, D. Croote, M. Mignardi, S. Chernikova, P. Samghababi, Y. Zhang, N. 435 Neff, M. Kowarsky, C. Caneda, G. Li, S. D. Chang, I. D. Connolly, Y. Li, B. A. Barres, M. H. 436 Gephart, S. R. Quake, Single-Cell RNA-Seq Analysis of Infiltrating Neoplastic Cells at the Migrating 437 Front of Human Glioblastoma. Cell Rep., 1399–1410 (2017). 438 4. J. Gojo, B. Englinger, L. Jiang, J. M. Hübner, M. L. Shaw, O. A. Hack, S. Madlener, D. Kirchhofer, 439 I. Liu, J. Pyrdol, V. Hovestadt, E. Mazzola, N. D. Mathewson, M. Trissal, D. Lötsch, C. Dorfer, C. 440 Haberler, A. Halfmann, L. Mayr, A. Peyrl, R. Geyeregger, B. Schwalm, M. Mauermann, K. W. 441 Pajtler, T. Milde, M. E. Shore, J. E. Geduldig, K. Pelton, T. Czech, O. Ashenberg, K. W. 442 Wucherpfennig, O. Rozenblatt-Rosen, S. Alexandrescu, K. L. Ligon, S. M. Pfister, A. Regev, I. 443 Slavc, W. Berger, M. L. Suvà, M. Kool, M. G. Filbin, Single-Cell RNA-Seq Reveals Cellular 444 Hierarchies and Impaired Developmental Trajectories in Pediatric Ependymoma. Cancer Cell. 38, 445 44–59 (2020). 446 5. C. Neftel, J. Laffy, M. G. Filbin, T. Hara, M. E. Shore, G. J. Rahme, A. R. Richman, D. Silverbush, 447 M. L. Shaw, C. M. Hebert, J. Dewitt, S. Gritsch, E. M. Perez, L. N. Gonzalez Castro, X. Lan, N. 448 Druck, C. Rodman, D. Dionne, A. Kaplan, M. S. Bertalan, J. Small, K. Pelton, S. Becker, D. Bonal, 449 Q.-D. Nguyen, R. L. Servis, J. M. Fung, R. Mylvaganam, L. Mayr, J. Gojo, C. Haberler, R. 450 Geyeregger, T. Czech, I. Slavc, B. V. Nahed, W. T. Curry, B. S. Carter, H. Wakimoto, P. K. 451 Brastianos, T. T. Batchelor, A. Stemmer-Rachamimov, M. Martinez-Lage, M. P. Frosch, I. 452 Stamenkovic, N. Riggi, E. Rheinbay, M. Monje, O. Rozenblatt-Rosen, D. P. Cahill, A. P. Patel, T. 453 Hunter, I. M. Verma, K. L. Ligon, D. N. Louis, A. Regev, B. E. Bernstein, I. Tirosh, M. L. Suvà, An 454 Integrative Model of Cellular States, Plasticity, and Genetics for Glioblastoma. Cell. 178, 835-849.e21 455 (2019). 456 6. M. C. Vladoiu, I. El-Hamamy, L. K. Donovan, H. Farooq, B. L. Holgado, Y. Sundaravadanam, V. 457 Ramaswamy, L. D. Hendrikse, S. Kumar, S. C. Mack, J. J. Y. Lee, V. Fong, K. Juraschka, D. 458 Przelicki, A. Michealraj, P. Skowron, B. Luu, H. Suzuki, A. S. Morrissy, F. M. G. Cavalli, L. Garzia, 459 C. Daniels, X. Wu, M. A. Qazi, S. K. Singh, J. A. Chan, M. A. Marra, D. Malkin, P. Dirks, L. Heisler, 460 T. Pugh, K. Ng, F. Notta, E. M. Thompson, C. L. Kleinman, A. L. Joyner, N. Jabado, L. Stein, M. 461 D. Taylor, Childhood cerebellar tumours mirror conserved fetal transcriptional programs. Nature. 462 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 15 572, 67–73 (2019). 463 7. I. Tirosh, A. S. Venteicher, C. Hebert, L. E. Escalante, A. P. Patel, K. Yizhak, J. M. Fisher, C. 464 Rodman, C. Mount, M. G. Filbin, C. Neftel, N. Desai, J. Nyman, B. Izar, C. C. Luo, J. M. Francis, 465 A. A. Patel, M. L. Onozato, N. Riggi, K. J. Livak, D. Gennert, R. Satija, B. V. Nahed, W. T. Curry, 466 R. L. Martuza, R. Mylvaganam, A. J. Iafrate, M. P. Frosch, T. R. Golub, M. N. Rivera, G. Getz, O. 467 Rozenblatt-Rosen, D. P. Cahill, M. Monje, B. E. Bernstein, D. N. Louis, A. Regev, M. L. Suvà, 468 Single-cell RNA-seq supports a developmental hierarchy in human oligodendroglioma. Nature. 539, 469 309–313 (2016). 470 8. A. S. Venteicher, I. Tirosh, C. Hebert, K. Yizhak, C. Neftel, M. G. Filbin, V. Hovestadt, L. E. 471 Escalante, M. L. Shaw, C. Rodman, S. M. Gillespie, D. Dionne, C. C. Luo, H. Ravichandran, R. 472 Mylvaganam, C. Mount, M. L. Onozato, B. V. Nahed, H. Wakimoto, W. T. Curry, A. J. Iafrate, M. 473 N. Rivera, M. P. Frosch, T. R. Golub, P. K. Brastianos, G. Getz, A. P. Patel, M. Monje, D. P. Cahill, 474 O. Rozenblatt-Rosen, D. N. Louis, B. E. Bernstein, A. Regev, M. L. Suvà, Decoupling genetics, 475 lineages, and microenvironment in IDH-mutant gliomas by single-cell RNA-seq. Science (80-. ). 355 476 (2017), doi:10.1126/science.aai8478. 477 9. S. Müller, A. Cho, S. J. Liu, D. A. Lim, A. Diaz, CONICS integrates scRNA-seq with DNA 478 sequencing to map gene expression to tumor sub-clones. Bioinformatics (2018), 479 doi:10.1093/bioinformatics/bty316. 480 10. J. D. Buenrostro, P. G. Giresi, L. C. Zaba, H. Y. Chang, W. J. Greenleaf, Transposition of native 481 chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins 482 and nucleosome position. Nat. Methods (2013), doi:10.1038/nmeth.2688. 483 11. J. D. Buenrostro, B. Wu, U. M. Litzenburger, D. Ruff, M. L. Gonzales, M. P. Snyder, H. Y. Chang, 484 W. J. Greenleaf, Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 485 (2015), doi:10.1038/nature14590. 486 12. R. Killick, I. A. Eckley, Changepoint: An R package for changepoint analysis. J. Stat. Softw. (2014), 487 doi:10.18637/jss.v058.i03. 488 13. M. Snuderl, L. Fazlollahi, L. P. Le, M. Nitta, B. H. Zhelyazkova, C. J. Davidson, S. Akhavanfard, D. 489 P. Cahill, K. D. Aldape, R. A. Betensky, D. N. Louis, A. J. Iafrate, Mosaic amplification of multiple 490 receptor tyrosine kinase genes in glioblastoma. Cancer Cell (2011), doi:10.1016/j.ccr.2011.11.005. 491 14. H. Chen, L. Albergante, J. Y. Hsu, C. A. Lareau, G. Lo Bosco, J. Guan, S. Zhou, A. N. Gorban, D. 492 E. Bauer, M. J. Aryee, D. M. Langenau, A. Zinovyev, J. D. Buenrostro, G. C. Yuan, L. Pinello, 493 Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM. Nat. 494 Commun. 10, 1903 (2019). 495 15. A. N. Schep, B. Wu, J. D. Buenrostro, W. J. Greenleaf, ChromVAR: Inferring transcription-factor-496 associated accessibility from single-cell epigenomic data. Nat. Methods. 14, pages975–978 (2017). 497 16. A. P. Patel, I. Tirosh, J. J. Trombetta, A. K. Shalek, S. M. Gillespie, H. Wakimoto, D. P. Cahill, B. V. 498 Nahed, W. T. Curry, R. L. Martuza, D. N. Louis, O. Rozenblatt-Rosen, M. L. Suvà, A. Regev, B. E. 499 Bernstein, Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science. 500 344, 1396–1401 (2014). 501 17. M. D. Robinson, D. J. McCarthy, G. K. Smyth, edgeR: A Bioconductor package for differential 502 expression analysis of digital gene expression data. Bioinformatics. 26, 139–140 (2010). 503 18. L. Scrucca, M. Fop, T. B. Murphy, A. E. Raftery, Mclust 5: Clustering, classification and density 504 estimation using Gaussian finite mixture models. R J. 8, 289–317 (2016). 505 19. R. Killick, I. A. Eckley, Changepoint: An R package for changepoint analysis. J. Stat. Softw. 58 506 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 16 (2014), doi:10.18637/jss.v058.i03. 507 20. H. Chen, L. Albergante, J. Y. Hsu, C. A. Lareau, G. Lo Bosco, J. Guan, S. Zhou, A. N. Gorban, D. 508 E. Bauer, M. J. Aryee, D. M. Langenau, A. Zinovyev, J. D. Buenrostro, G. C. Yuan, L. Pinello, 509 Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM. Nat. 510 Commun. (2019), doi:10.1038/s41467-019-09670-4. 511 21. A. N. Schep, B. Wu, J. D. Buenrostro, W. J. Greenleaf, ChromVAR: Inferring transcription-factor-512 associated accessibility from single-cell epigenomic data. Nat. Methods (2017), 513 doi:10.1038/nmeth.4401. 514 22. A. Khan, O. Fornes, A. Stigliani, M. Gheorghe, J. A. Castro-Mondragon, R. Van Der Lee, A. Bessy, 515 J. Chèneby, S. R. Kulkarni, G. Tan, D. Baranasic, D. J. Arenillas, A. Sandelin, K. Vandepoele, B. 516 Lenhard, B. Ballester, W. W. Wasserman, F. Parcy, A. Mathelier, JASPAR 2018: Update of the 517 open-access database of transcription factor binding profiles and its web framework. Nucleic Acids 518 Res. (2018), doi:10.1093/nar/gkx1126. 519 23. H. Li, R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform. 520 Bioinformatics. 25, 1754–60 (2009). 521 24. H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, 522 The Sequence Alignment/Map format and SAMtools. Bioinformatics. 25, 2078–2079 (2009). 523 25. E. Talevich, A. H. Shain, T. Botton, B. C. Bastian, CNVkit: Genome-Wide Copy Number Detection 524 and Visualization from Targeted DNA Sequencing. PLoS Comput. Biol. 12 (2016), 525 doi:10.1371/journal.pcbi.1004873. 526 26. A. R. Quinlan, I. M. Hall, BEDTools: A flexible suite of utilities for comparing genomic features. 527 Bioinformatics. 26, 841–2 (2010). 528 27. T. Stuart, A. Srivastava, C. Lareau, R. Satija, bioRxiv, in press, doi:10.1101/2020.11.09.373613. 529 28. A. Butler, P. Hoffman, P. Smibert, E. Papalexi, R. Satija, Integrating single-cell transcriptomic data 530 across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018). 531 532 533 ACKNOWLEDGMENTS 534 Funding: A Canada Research Chair in Brain Cancer Epigenomics (tier 2) from the Government of Canada, 535 Project grants from the Canadian Institutes of Health Research (CIHR; PJT-156278, PJT-173475), a 536 Discovery grant from the Natural Sciences and Engineering Research Council (NSERC) and an Azrieli 537 Future Leader in Canadian Brain Research grant to MG; a Clinician Investigator Program fellowship from 538 Alberta Health Services and a fellowship from Alberta Innovates to AN; an Eyes High Scholarship from the 539 University of Calgary to DS; a Clark Smith postdoctoral fellowship and a CIHR postdoctoral fellowship to 540 MJ; a Canada Research Char in Precision Oncology (tier 2) and a CIHR grant (PJT-438802) to SM; an Alberta 541 Graduate Excellence Scholarship and Alberta Innovates scholarship to AG. This project has been made 542 possible by the Brain Canada Foundation through the Canada Brain Research Fund, with the financial 543 support of Health Canada and the Azrieli Foundation. 544 545 Author contributions: Conception and experimental design: AN, MG. Generation of datasets: AN, KE, 546 JC, PN, NB. Data acquisition and analysis: AN, DS, MJ, AG, SM, NB, MG. Data interpretation and creation 547 of new software: AN, MG. Manuscript preparation: All co-authors. 548 549 Competing interests: The authors declare no competing interests. 550 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 17 551 Data and materials availability: The Copy-scAT package and a sample tutorial are available on Github at 552 http://github.com/spcdot/CopyscAT. All datasets will be made available upon publication in a peer 553 reviewed journal. 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 18 SUPPLEMENTAL MATERIAL 600 601 602 Table S1. Summary of samples and cells profiled by Copy-scAT 603 604 Sample Unique barcodes after pileup Unique barcodes after filtering Percent passing filters CGY4218 1542 1335 86.58% CGY4250 1371 947 69.07% CGY4275 1004 609 60.66% CGY4349 961 756 78.67% pCGY2932 1203 802 66.67% pCGY2937 1445 1200 83.04% pCGY3103 2189 956 43.67% pCGY3402 3162 2318 73.31% pCGY3749 2774 2503 90.23% pCGY4021 1963 1382 70.40% MM1217 890 792 88.99% MM1388 2538 2219 87.43% MM1389 7438 6607 88.83% MM1438 1774 1578 88.95% MM1460 2048 1683 82.18% MM1479 5135 4564 88.88% MM1498 7408 2160 29.16% MM1555 7220 6586 91.22% MM1643 3141 2794 88.95% MM1698 2844 2283 80.27% Total cells profiled 58050 44074 76.86% 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 19 Table S2. Sensitivity and Specificity of Copy-scAT in aGBM, pGBM and MM samples 623 624 Gains Losses Amplifications Samples Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity aGBM (n = 3) 1.0 0.94 0.79 0.89 1.0 1.0 pGBM (n= 6) 0.73 0.93 0.73 0.95 N/A 0.975 MM (n = 10) 0.51 0.94 0.67 0.89 N/A N/A 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 20 Fig. S1. Comparison of CNVs inferred by Copy-scAT and by WGS for adult GBM samples. 668 (A) Comparison of chromosome arm level losses detected in three adult GBM samples by single cell 669 ATAC, WGS, or both methods. 670 (B) Comparison of focal amplifications detected in three adult GBM sample by scATAC, WGS, or both 671 methods. 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 21 Fig. S2. Comparison of CNVs inferred by Copy-scAT or WGS in pediatric GBM samples. 707 (a) Gains detected in three pediatric GBM samples compared to linked-reads WGS. 708 (b) Losses detected in three pediatric GBM samples compared to linked-reads WGS. 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 22 Fig. S3. Comparison of CNVs inferred by Copy-scAT or with the scCNV assay in multiple 727 myeloma samples. 728 (a) Comparison of gains seen in additional myeloma samples versus 10x single-cell CNV sequencing. (b) 729 Comparison of chromosome losses seen in additional myeloma samples versus 10x single-cell CNV 730 sequencing. (c,d) Number of gains and losses detected by both methods compared to number of cells in 731 scATAC-seq sample. (e-f) Number of shared gains or losses detected between the two methods, plotted 732 versus the number of cells in the scATAC-seq experiment. (g-h) Number of shared gains or losses 733 detected between the two methods, plotted versus the number of reads per cell in the scATAC. 734 735 736 737 738 739 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 23 Fig. S4. CNVs are detected in scATAC clusters with Copy-scAT in pediatric GBM samples. 740 (a) Overview of cell assignments in two paired patient libraries. 741 (b-d) Representative WGS-confirmed alterations detected in pCGY2932 and pCGY2937. 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 24 Fig. S5. CNVs are identified by Copy-scAT in specific scATAC clusters in multiple myeloma 764 samples. (a) Gain of chromosome 11p restricted to neoplastic cell populations. (b) Similar pattern with 765 gain of chromosome 11q. (c) Similar pattern with loss of chr13q. 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 25 Fig. S6. Additional chromosome copy number analyses for CGY4218. 788 (a) Initial neighbourhood clustering results from Signac. 789 (b-f) Representative chromosome-level copy number alteration profiles for tumour and normal cells. (g-n) 790 Representative motif scores from ChromVAR for different motifs, including (g) ELF5, (h) SPIB, (i) 791 ASCL1, (j) IKZF1, (k) NEUROD1, (l) NFIC, (m) NFYA, (n) ELK3.792 793 794 795 796 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 26 Fig. S7. Representative copy number information and distribution for aGBM sample CGY4250. 797 (a) Neighbourhood clustering results from Signac. 798 (b-c) Distribution of amplifications in EGFR and MDM2. 799 (d-i) Representative chromosome-level copy number alteration profiles for tumour and normal cells. (j-l) 800 Representative motif scores from ChromVAR for different motifs, including (j) NFIC, (k) SPIB and (l) 801 FOXG1.802 803 804 805 806 807 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 27 Fig. S8. Representative copy number information and distribution for aGBM sample CGY4275. 808 (a) Neighbourhood clustering results from Signac. 809 (b) Distribution of amplifications in EGFR. 810 (c-j) Representative chromosome-level copy number alteration profiles for tumour and normal cells. (g-l) 811 Representative motif scores from ChromVAR for different motifs, including (g) NFIC, (h) FOS::JUN, (i) 812 NEUROD1, (j) ELF5, (k) SPIB, and (l) IKZF1. 813 814 815 816 817 818 819 820 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 28 Fig. S9. Effects of removing CNVs on variance in aGBM sample CGY4349. 821 (a) Distribution of the top 2000 most variable peaks in the tumour cells after filtering out non-neoplastic 822 cells; p value from Chi-squared test. 823 (b) Distribution of top 2000 most variable peaks in non-neoplastic cells after filtering (P VALUE FROM 824 CHI-SQUARED TEST). Chromosomes with CNVs or amplification regions are highlighted in pink. 825 (c) Distribution of top 2000 most variable peaks in tumour cells after filtering of non-neoplastic cells and 826 removal of regions containing CNVs (P VALUE FROM CHI-SQUARED TEST). 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 29 Fig. S10. Validation of Copy-scAT and identification of putative proliferative cells in non-847 neoplastic datasets. (a) Chromosome copy number distribution in a 10X dataset of 5000 human PBMCs. 848 (b) Seurat clusters for the 10X dataset of 5000 human PBMCs. (c) Estimate of cycle status for the 10X 849 dataset of 5000 human PBMCs. (d) Chromosome copy number distribution in a 10X dataset of mouse 850 embryonic brain at E18. (e,f) Predicted cycle status and cluster assignments in E18 mouse brain. (g,h) 851 Predicted cell cycle status and cluster profile in P50 mouse brain dataset from 10X. 852 853 854 855 856 857 858 859 860 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 30 Fig. S11. Pseudotime trajectory analysis of aGBM sample CGY4250. Distribution of EGFR 861 amplification (a) and cell cycle status (b) amongst branches. Distribution of ChromVAR motif scores in 862 branches for proneural motifs ASCL1 and OLIG2 (c,d), ETV1 (e), NFIX (f), and mesenchymal motifs 863 JUN::JUNB (g) and STAT3 (h). 864 865 866 867 868 869 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 31 Fig. S12. Pseudotime trajectory analysis of aGBM sample CGY4349. Distribution of PDGFRA 870 amplification (a) and cycling status (b) amongst branches. Distribution of ChromVAR motif scores in 871 branches for proneural motifs ASCL1 and OLIG2 (c,d), ETV1 (e), NFIX (f), and mesenchymal motifs 872 JUN::JUNB (g) and STAT3 (h). 873 874 875 876 877 878 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 32 Fig. S13. Pseudotime trajectory analysis of aGBM sample CGY4275. Distribution of ChromVAR 879 motif scores in branches for proneural motifs ASCL1 and OLIG2 (a,b), ETV1 (c), NFIX (d), and 880 mesenchymal motifs JUN::JUNB (e) and STAT3 (f). 881 882 883 884 885 886 887 888 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2020.09.21.305516doi: bioRxiv preprint https://doi.org/10.1101/2020.09.21.305516 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2020_09_23_308239 ---- 2599906 1 The COVID-19 PHARMACOME: A method for the rational selection of drug repurposing candidates from multimodal knowledge harmonization Bruce Schultz1, Andrea Zaliani2,3, Christian Ebeling1, Jeanette Reinshagen2,3, Denisa Bojkova11, Vanessa Lage-Rupprecht1, Reagon Karki1, Sören Lukassen7, Yojana Gadiya1, Neal G. Ravindra8, Sayoni Das4, Shounak Baksi6, Daniel Domingo-Fernández1, Manuel Lentzen1, Mark Strivens4, Tamara Raschka1, Jindrich Cinatl11, Lauren Nicole DeLong1, Phil Gribbon2,3, Gerd Geisslinger3,9,10, Sandra Ciesek10,11,12 David van Dijk8, Steve Gardner4, Alpha Tom Kodamullil1, Holger Fröhlich1, Manuel Peitsch5, Marc Jacobs1, Julia Hoeng5, Roland Eils7, Carsten Claussen2,3 and Martin Hofmann-Apitius*1 1 Fraunhofer Institute for Algorithms and Scientific Computing SCAI, Department of Bioinformatics, Institutszentrum Birlinghoven, 53754 Sankt Augustin, Germany 2 Fraunhofer Institute for Translational Medicine and Pharmacology ITMP, ScreeningPort, 22525 Hamburg, Germany 3 Fraunhofer Cluster of Excellence for Immune Mediated Diseases, CIMD, External partner site, 22525 Hamburg, Germany 4 PrecisionLife Ltd. Unit 8b Bankside, Hanborough Business Park, Long Hanborough, Oxfordshire, OX29 8LJ, United Kingdom 5 Philipp Morris International R&D, Biological Systems Research, R&D Innovation Cube T1517.07, Quai Jeanrenaud 5, CH-2000 Neuchatel, Switzerland 6 Causality BioModels Pvt Ltd., Kinfra Hi-Tech Park, Kerala technology Innovation Zone- KTIZ, Kalamassery, Cochin, 683503-India 7 Center for Digital Health, Charité Universitätsmedizin Berlin & Berlin Institute of Health (BIH) 8 Center for Biomedical Data Science, Yale School of Medicine, Yale University, 333 Cedar Street, New Haven, CT 06510, USA 9 Pharmazentrum Frankfurt/ZAFES, Institut für Klinische Pharmakologie, Klinikum der Goethe-Universität Frankfurt, 60590 Frankfurt am Main, Germany 10 Fraunhofer Institute for Translational Medicine and Pharmacology ITMP,, 60596 Frankfurt am Main, Germany 11 Institute for Medical Virology, University Hospital Frankfurt, 60590 Frankfurt am Main, Germany 12 DZIF, German Centre for Infection Research, External partner site, 60596 Frankfurt am Main, Germany * martin.hofmann-apitius@scai.fraunhofer.de .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 Abstract The SARS-CoV-2 pandemic has challenged researchers at a global scale. The scientific community’s massive response has resulted in a flood of experiments, analyses, hypotheses, and publications, especially in the field of drug repurposing. However, many of the proposed therapeutic compounds obtained from SARS-CoV-2 specific assays are not in agreement and thus demonstrate the need for a singular source of COVID-19 related information from which a rational selection of drug repurposing candidates can be made. In this paper, we present the COVID-19 PHARMACOME, a comprehensive drug-target-mechanism graph generated from a compilation of 10 separate disease maps and sources of experimental data focused on SARS- CoV-2 / COVID-19 pathophysiology. By applying our systematic approach, we were able to predict the synergistic effect of specific drug pairs, such as Remdesivir and Thioguanosine or Nelfinavir and Raloxifene, on SARS-CoV-2 infection. Experimental validation of our results demonstrate that our graph can be used to not only explore the involved mechanistic pathways, but also to identify novel combinations of drug repurposing candidates. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3 Introduction and Motivation COVID-19 is the term coined for the pandemic caused by SARS-CoV-2. Unprecedented in the history of science, this pandemic has elicited a worldwide, collaborative response from the scientific community. In addition to the strong focus on the epidemiology of the virus1 2 3, experiments aimed at understanding mechanisms underlying the pathophysiology of the virus have led to new insights in a comparably short amount of time4 5 6 7 . In the field of computational biology, several initiatives have started generating disease maps that represent the current knowledge pertaining to COVID-19 mechanisms8 9 10 11 . Such disease maps have proven valuable before in diverse areas of research such as 12 13 14 15. When taken together with related work including cause-and-effect modeling8, entity relationship graphs16, and pathways17; these disease maps represent a considerable amount of highly curated “knowledge graphs” which focus primarily on COVID-19 biology. Here, we use the term “mechanism” to describe a single, or multiple cause-and-effect relationships (i.e. a subgraph), “pathways” to refer to a well-established series of interactions resulting in cellular change or a defined product, and “models” for describing a collection of experimental data or known interactions defined in the context of a particular biological process or pathology. As of July 2020, a collection consisting of 10 models representing core knowledge about the pathophysiology of SARS-CoV-2 and its primary target, the lung epithelium, was shared with the public. With the rapidly increasing generation of data (e.g. transcriptome18, interactome19, and proteome20 data), we are now in the position to challenge and validate these COVID-19 pathophysiology knowledge graphs with experimental data. This is of particular interest as .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4 validation of these knowledge graphs bears the potential to identify those disease mechanisms highly relevant for targeting in drug repurposing approaches. The concept of drug repurposing (the secondary use of already developed drugs for therapeutic uses other than those they were designed for) is not new. The major advantage of drug repurposing over conventional drug development is the massive decrease in time required for development as important steps in the drug discovery workflow have already been successfully passed for these compounds21 22. Our group and many others have already begun performing assays to screen for experimental compounds and approved drugs to serve as new therapeutics for COVID-19. Dedicated drug repurposing collections, such as the Broad Institute library23 , and the even more comprehensive ReFRAME library24, were used to experimentally screen for either viral proteins as targets for functional inhibition25, or for virally infected cells in phenotypic assays26. In our own work, compounds were assessed for their inhibition of virus-induced cytotoxicity using the human cell line Caco-2 and a SARS-CoV-2 isolate27. A total of 63 compounds with IC50 < 20 µM were identified, from which 90% have not yet been previously reported as being active against SARS-CoV-2. Out of the active compounds, 31 are approved drugs, 23 are in phases 1-3 and 9 are preclinical candidate molecules. The described mechanisms of action for the inhibitors included kinase signaling, PDE activity modulation, and long chain acyl transferase inhibition (e.g. “azole class antifungals”). The approach presented here integrates experimental results and the output from other informatic pipelines, and combines proprietary and public data to provide a comprehensive overview on the therapeutic efficacy of candidate compounds, the mechanisms targeted by .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 5 these candidate compounds, and a rational approach to test the drug-mechanism associations for their potential in combination therapy. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 6 Methodology Generation of the COVID-19 PHARMACOME Disparate COVID-19 disease maps focus on different aspects of COVID-19 pathophysiology. Based on comparisons of the COVID-19 knowledge graphs, we found that not a single disease map covers all aspects relevant for the understanding of the virus, host interaction and the resulting pathophysiology. Thus, we optimized the representation of essential COVID-19 pathophysiology mechanisms by integrating several public and proprietary COVID-19 knowledge graphs, disease maps, and experimental data (Supplementary Table 1) into one unified knowledge graph, the COVID-19 Supergraph. To this end, we converted all knowledge graphs and interactomes into OpenBEL28, a language that is both ideally suited to capture and to represent “cause-and-effect” relationships in biomedicine and is fully interoperable with major pathway databases29 30. In order to ensure that molecular interactions were correctly normalized, individual pipelines were constructed for each model to convert the raw data to the OpenBEL format. For example, the COVID-19 Disease Map contained 16 separate files, each of which represented a specific biological focus of the virus. Each file was parsed individually and the entities and relationships that did not adhere to the OpenBEL grammar were mapped accordingly. Whilst most of the entities and relationships in the source disease maps could be readily translated into OpenBEL, a small number of triples from different source disease maps required a more in-depth transformation. When classic methods of naming objects in triples failed, the recently generated COVID-19 ontology31 as well as other available standard ontologies and vocabularies were used to normalize and reference these entities. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 7 In addition to combining the listed models, we also performed a dedicated curation of the COVID-19 supergraph in order to annotate the mechanisms pertaining to selected targets and the biology around prioritized repurposing candidates. The resulting BEL graphs were quality controlled and subsequently loaded into a dedicated graph database system underlying the Biomedical Knowledge Miner (BiKMi), which allows for comparison and extension of biomedical knowledge graphs (see http://bikmi.covid19-knowledgespace.de). Once the models were converted to OpenBEL and imported into the database, the resulting nodes from each mechanism-based model were compared (Figure 1). Even when separated by data origin type, the COVID-19 knowledge graphs had very little overlap (3 shared nodes between all manually curated models and no shared nodes between all models derived from interaction databases), but by unifying the models, our COVID-19 supergraph improves the coverage of essential virus- and host-physiology mechanisms substantially. Figure 1: Venn diagrams comparing major mechanistic models in the COVID-19 supergraph. Mechanism-based models were divided, and their entities compared within their resulting subgroups. Model abbreviations are defined in Supplementary Table 1. a) Manual node comparison shows the overlap of entities in the models that are knowledge-based, manually curated relationships that have been directly encoded in OpenBEL. b) Automated node comparison shows the overlap of entities in models re-encoded into OpenBEL from other formats (e.g. SBML models). .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint http://bikmi.covid19-knowledgespace.de/ https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 8 Additionally, by enriching the COVID-19 supergraph with drug-target information linked from highly curated drug-target databases (DrugBank, ChEMBL, PubChem), we created an initial version of the COVID-19 PHARMACOME, a comprehensive drug-target-mechanism graph representing COVID-19 pathophysiology mechanisms that includes both drug targets and their ligands (Figure 2). In order to maximize its utility, this network includes both experimentally validated drug-target relationships as well as a wide distribution of biological entities and concepts (Supplementary Figure 1). The entire COVID-19 PHARMACOME was manually inspected and re-curated; this graph database is openly accessible to the scientific community at http://graphstore.scai.fraunhofer.de. Figure 2: The COVID-19 supergraph integrates drug-target information to form the COVID-19 PHARMACOME. a) An aggregate of 10 constituent COVID-19 computable models covering a wide spectrum of pathophysiological mechanisms associated with SARS-CoV-2 infection or harmonized to generate the mechanism-based COVID-19 supergraph. b) The COVID-19 supergraph is annotated with drug-target information from a variety of curated sources to generate the COVID-19 PHARMACOME composed of 150662 nodes (representing proteins, .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 9 pathologies, and other biological entities/concepts) and 573929 edges (indicating relationships or interactions between the pair of nodes they connect). Systematic review and integration of information from phenotypic screening At the time of the writing of this paper, six phenotypic cellular screening experiments have been shared via archive servers and journal publications (Supplementary Table 2). Although only a limited number of these manuscripts have been officially accepted and published, we were able to extract their primary findings from the pre-publication archive servers. A significant number of reports on drug repurposing screenings in the COVID-19 context demonstrate how appealing the concept of drug repurposing is as a quick answer to the challenge of a global pandemic. Drug repurposing screenings were all performed with compounds for which a significant amount of information on safety in humans and primary mechanism of action is available. We generated a list of “hits” from cellular screening experiments while results derived from publications that reported on in-silico screening were ignored. Therefore, we keep a strict focus on well-characterized, well-understood candidate molecules in order to ensure that one of the pivotal advantages of this knowledge base is its use for drug repurposing. Subgraph annotation The COVID-19 PHARMACOME contains several subgraphs, three of which correspond to major views on the biology of SARS-CoV-2 as well as the clinical impact of COVID-19: - the viral life cycle subgraph focuses on the stages of viral infection, replication, and spreading. - the host response subgraph represents essential mechanisms active in host cells infected by the virus. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10 - the clinical pathophysiology subgraph illustrates major pathophysiological processes of clinical relevance. These subgraphs were annotated by identifying nodes within the COVID-19 PHARMACOME that represent specific biological processes or pathologies associated with each subgraph category and traversing out to their first-degree neighbors. For example, a biological process node representing “viral translation” would be classified as a starting node for the viral life cycle subgraph while a node defined as “defense response to virus" would be categorized as belonging to the host response subgraph. Though the viral life cycle and host response subgraphs contain a wide variety of node types, the pathophysiology subgraph is restricted to pathology nodes associated with either the SARS-CoV-2 virus or the COVID-19 pathology. Mapping of gene expression data onto the COVID-19 PHARMACOME Two single cell sequencing data sets representing infected and non-infected cells directly derived from human samples32 and cultured human bronchial epithelial cells33 (HBECs) were used to identify the areas of the COVID-19 PHARMACOME responding at gene expression level to SARS-CoV-2 infection. Details of the gene expression data processing and mapping are available in the supplementary material (section gene expression data analysis). Pathway enrichment Associated pathways for subgraphs and significant targets were identified using the Enrichr34 feature of the gseapy Python package35. Briefly, gene symbol lists were assembled from their respective subgraph or dataset and compared against multiple pathway gene set libraries including Reactome, KEGG, and WikiPathways. To account for multiple comparisons, p-values .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 11 were corrected using the Benjamini-Hochberg36 method and results with p-values < 0.01 were considered significantly enriched. Drug repurposing screening We performed phenotypic assays to screen for repurposing drugs that inhibit the replication and the cytopathic effects of virus infection. A derivative of the Broad repurposing library was used to incubate Caco-2 cells before infecting them with an isolate of SARS-CoV-2 (FFM-1 isolate, see 37). Survival of cells was assessed using a cell viability assay and measured by high- content imaging using the Operetta CLS platform (PerkinElmer). Details of the drug repurposing screening are described in the supplemental material. Drug combinations assessment with anti-cytopathic effect measured in Caco-2 cells As described in Ellinger et al.,38 we challenged four combinations of five different compounds with the SARS-CoV-2 virus in four 96-well plates containing two drugs each. Eight drug concentrations were chosen ranging from 20 µM to 0.01 µM, diluted by a factor of 3 and positioned orthogonally to each other in rows and columns. No pharmacological control was used, only cells with and without exposure to SARS CoV-2 virus at 0.01 MOI. In addition, recently published data from the work of Bobrowski et al.39, were mapped to the COVID-19 PHARMACOME and compared to the results of the combinatorial treatment experiments performed here. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 12 Results Comparative analysis of the hits from different repurposing screenings Data from six published drug repurposing screenings were downloaded, and extensive mapping and curation was performed in order to harmonize chemical identifiers. The curated list of drug repurposing “hits” together with an annotation of the assay conditions is available under http://chembl.blogspot.com/2020/05/chembl27-sars-cov-2-release.html Initially, we analyzed the overlap between compounds identified in the reported drug repurposing screening experiments. Figure 3A shows no overlap between experiments, which is not surprising, as we are comparing highly specific candidate drug experiments with screenings based on large drug repositioning libraries. However, the overlap is still quite marginal for those screenings where large compound collections (Broad library, ReFRAME library) have been used. Figure 3: Overlap of compound hits between different drug repurposing screening experiments. a) Direct comparison of overlapping hits in drug repurposing screenings revealed no overlap between the experiments. These experiments were performed using different cell types (Vero E6 cells and Caco2 cells). b) Protein target space overlap between different COVID-19 drug repurposing screenings. Drug targets were identified by confidence level >= 8 and single protein targets according to the ChEMBL database. Comparison of experiments indicates over one hundred common protein targets. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint http://chembl.blogspot.com/2020/05/chembl27-sars-cov-2-release.html https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 13 Mapping of repurposing hits to target proteins In order to identify which proteins are targeted by the repurposing hits, and to investigate the extent to which there are overlaps between repurposing experiments at the target/protein level, we mapped all the identified compounds from the drug repurposing experiments to their respective targets. As most drugs bind to more than one target, we increase the likelihood of overlaps between the drug repurposing experiments when we compare them at the protein/target space. Indeed, Figure 3B shows an overlap of 112 targets between all the drug repurposing experiments, thereby creating a list of potential proteins for therapeutic intervention when the compound targets are considered rather than the compounds themselves. The COVID-19 PHARMACOME associates pathways derived from drug repurposing targets with pathophysiology mechanisms A non-redundant list of drug repurposing candidate molecules that display activity in phenotypic (cellular) assays was generated and mapped to the COVID-19 PHARMACOME. Figure 4 shows the distribution of repurposing drugs in the COVID-19 cause-and-effect graph, the “responsive part” of the graph that is characterized by changes in gene expression associated with SARS-CoV-2 infection and the overlap between the two subgraphs. This overlap analysis allows for the identification of repurposing drugs targeting mechanisms that are modulated by viral infection. A total number of 870 mechanisms were identified as being targeted by most of the drug repurposing candidates (see section “Associated pathway identification” in supplementary materials). When compared to the annotated subgraphs in the COVID-19 PHARMACOME, 201 of the 227 determined associated pathways found for the viral life cycle subgraph overlapped .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 14 with those for the drug repurposing targets while the host response subgraph shared 90 of its 105 pathways. Mapping of drug repurposing signals to hypervariable regions of the COVID-19 PHARMACOME One of the key questions arising from the network analysis is whether the repurposing drugs target mechanisms are specifically activated during viral infection. In order to establish this link, we mapped differential gene expression analyses from two single-cell sequencing studies to our COVID-19 PHARMACOME (see section “Differential Gene Expression” in supplementary material). An overlay of differential gene expression data (adjusted p-value ≤ 0.1 and abs(log fold-change) > 0.25) on the COVID-19 PHARMACOME reveals a distinct pattern characterized by the high responsiveness (expressed by variation of regulation of gene expression) to the viral infection (Figure 4A). .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 15 Figure 4: Identification of suitable targets for combination therapy by comparing subgraphs within the COVID- 19 PHARMACOME. Incorporation of gene expression data into the COVID-19 PHARMACOME resulted in a subgraph characterized by the entities (genes/proteins) that respond to viral infection (a). Mapping of the filtered results obtained from drug repurposing screenings (IC50 < 10 µM) to the PHARMACOME resulted in a subgraph enriched for drug repurposing targets (b). The intersection between subgraphs presented in (a) and (b) is highly enriched for drug repurposing targets directly linked to the viral infection response (c). Virus-response mechanisms are targets for repurposing drugs In the next step, we analyzed which areas of the COVID-19 graph respond to SARS-CoV-2 infection (indicated by significant variance in gene expression) and are targets for repurposing drugs. To this end, we mapped signals from the drug repurposing screenings to the subgraph that showed responsiveness to SARS-CoV-2 infection (Figure 4B). Figure 4C depicts the resulting subgraph that is characterized by the transcriptional response to SARS-CoV-2 infection and the presence of target proteins of compounds that have been identified in drug repurposing screening experiments. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 16 The COVID-19 PHARMACOME supports rational targeting strategies for COVID-19 combination therapy We mapped existing combinatorial therapy data to the COVID-19 PHARMACOME in order to evaluate its potential in guiding rational approaches towards combination therapy using repurposing drug candidates. Combinatorial treatment data obtained from the results published by Bobrowski et al.40 and Ellinger et al.41 were mapped to the COVID-19 PHARMACOME. Figure 5 provides an overview of the mapped compounds, thier protein targets, and the interaction mechanisms. Analysis of the overlaps between the drug repurposing screening data showed that four of the ten compounds reported in the synergistic treatment approach by drug repurposing data were represented in our initial non- redundant set of candidate repurposing drugs. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 17 Figure 5: Visualization of drug repurposing candidates (and their targets) used in combination treatment experiments. The subgraph depicts the drug repurposing candidate molecules in relation to each other and their targets. Shortest path lengths between drug combinations were calculated from this subgraph and are available in the supplementary material (Supplementary Table 5). Based on the association between repurposing drug candidates and the areas of the COVID-19 PHARMACOME that respond to SARS-CoV-2 infection (Figure 4), we hypothesized that the number of edges between a pair of drug nodes may be linked to the effectiveness of the drug combination (Supplementary Figure 2). In order to evaluate whether the determined outcome of a combination of drugs correlated with the distance between said drug nodes, we compared distances for combinations of drugs within the COVID-19 PHARMACOME for which .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 18 their effect was known (Supplementary Tables 3 & 5). Of the 47 drug combinations we were able to check within the COVID-19 PHARMACOME, we found that the pairs of drugs known to have a synergistic effect in the treatment of SARS-CoV-2 had an average shortest path length of 2.43, while antagonistic combinations were found to be farther apart with an average shortest path length of 4.0 (Supplementary Table 7). Based on our calculations, we formulated three categories for predicting the outcome of new drug combinations on infection using the shortest path lengths between them within the COVID-19 PHARMACOME. Drug combinations with shortest path lengths of 2 indicate a synergistic relationship between the compounds, 3 was determined to be inconclusive as our calculations did not justify a specific outcome, and those with a shortest path length of 4 or more were predicted to have an antagonistic relationship. In order to test our ability to predict the outcome of novel drug combinations, we selected five compounds: Remdesivir (a virus replicase inhibitor), Nelfinavir (a virus protease inhibitor), Raloxifene (a selective estrogen receptor modulator), Thioguanosine (a chemotherapy compound interfering with cell growth), and Anisomycin (a pleiotropic compound with several pharmacological activities, including inhibition of protein synthesis and nucleotide synthesis). These compounds were used in four different combinations (Remdesivir/Thioguanosine, Remdesivir/Raloxifene, Remdesivir/Anisomycin and Nelfinavir/Raloxifene) to test the potency of these drug pairings in phenotypic, cellular assays. Figure 6 shows the results of these combinatorial treatments on the virus-induced cytopathic effect in Caco-2 cells. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 19 Figure 6: Dose-response curves (DRC) depicting viral inhibition of SARS-CoV-2 by select drug combinations. a) A threshold effect can be seen with the Remdesivir/Anisomycin combination when Anisomycin reaches 20 µM, well beyond Anisomycin’s IC50 alone. Remdesivir activity does not appear to be affected by Anisomycin, while Remdesivir seems to be equally affected (de-potentiated) by low to high concentrations of Raloxifene. b) Viral inhibition for Remdesivir/Thioguanosine can be seen only at lower Thioguanosine concentrations, at higher concentrations the clear curve shift of Remdesivir at lower concentration (effect beyond Loewe’s additivity formula) could not be appreciated. c) Raloxifene had an antagonistic effect on Remdesivir’s viral replication inhibition activity. d) A clear shift in Nelfinavir’s DRC can be observed when combined with Raloxifene, but also suggests a threshold effect when Raloxifene concentrations are higher than 2.2 µM. Our results indicate that compound combinations acting on different viral mechanisms, such as Remdesivir and Thioguanosine (Figure 6b) or Nelfinavir and Raloxifene (Figure 6d), showed synergy, while compounds acting on host mechanisms, for instance Anisomycin or Raloxifene, when combined with Remdesivir (Figure 6a and Figure 6c, respectively), resulted in neither synergistic nor additive effects. Interestingly, our experiments revealed that the HIV-Protease inhibitor Nelfinavir, which already appeared to be active against viral post-entry fusion steps of both SARS-CoV42 and SARS-CoV-243, displayed synergistic effects when combined with high concentrations of Raloxifene. This result agrees with our predictions generated using the COVID-19 PHARMACOME in which the drug .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 20 combination with the shortest distance, Raloxifene and Nelfinavir (Supplementary Table 5), would have a synergistic effect on SARS-CoV-2 pathology. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 21 Discussion By combining a significant number of knowledge graphs which represent various aspects of COVID-19 pathophysiology and drug-target information we were able to generate the COVID- 19 PHARMACOME, a unique resource that covers a wide spectrum of cause-and-effect knowledge about SARS-CoV-2 and its interactions with the human host. Based on a systematic review of the results derived from published drug repurposing screening experiments, as well as our own drug repurposing screening results, we were able to identify mechanisms targeted by a variety of compounds showing virus inhibition in phenotypic, cellular assays. With the COVID-19 PHARMACOME, we are now able to link repurposing drugs, their targets and the mechanisms modulated by said drugs within one computable data structure, thereby enabling us to target - in a combinatorial treatment approach - different, independent mechanisms. By challenging the COVID-19 PHARMACOME with gene expression data, we have identified subgraphs that are responsive (at gene expression level) to virus infection. Network analysis along with the overview on previous repurposing experiments provided us with the insights needed to select the optimal repurposing drug candidates for combination therapy. Experimental verification showed that this systematic approach is valid; we were able to identify two drug-target-mechanism combinations that demonstrated synergistic action of the repurposed drugs targeting different mechanisms in combinatorial treatments. We are fully aware of the fact that the COVID-19 PHARMACOME combines experimental results generated in different assay conditions. In the course of our work, we accumulated evidence that assay responses recorded using Vero E6 cells in comparison to Caco-2 cells may only partially overlap. Comparative analysis of the results of both assay systems to virus infection by means of transcriptome-wide gene expression analysis is one of the experiments .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 22 we plan to perform next. However, for the identification of meaningful combinations of repurposing drugs, the current model-driven information fusion approach was shown to work well despite the putative differences between drug repurposing screening assays. Given the urgent need for treatments that work in an acute infection situation, our approach described here paves the way for systematic and rational approaches towards combination therapy of SARS-CoV-2 infections. We want to encourage all our colleagues to make use of the COVID-19 PHARMACOME, improve it, and add useful information about pharmacological findings (e.g. from candidate repurposing drug combination screenings). In addition to vaccination and antibody therapy, (combination) treatment with small molecules remains one of the key therapeutic options for combatting COVID-19. The COVID-19 PHARMACOME will therefore be continuously improved and expanded to serve integrative approaches in anti-SARS-CoV-2 drug discovery and development. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 23 Acknowledgements In part, this project is supported by the European Union’s Horizon 2020 research and innovation program under grant agreement No 101003551, project Exscalate4CoV. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 24 References 1 Xu, B., Gutierrez, B., Mekaru, S., Sewalk, K., Goodwin, L., Loskill, A., ... & Zarebski, A. E. (2020). Epidemiological data from the COVID-19 outbreak, real-time case information. Scientific data, 7(1), 1-6. 2 Lipsitch, M., Swerdlow, D. L., & Finelli, L. (2020). Defining the epidemiology of Covid-19—studies needed. New England journal of medicine, 382(13), 1194-1196. 3 Holmdahl, I., & Buckee, C. (2020). Wrong but Useful—What Covid-19 Epidemiologic Models Can and Cannot Tell Us. New England Journal of Medicine. 4 Cao, W., & Li, T. (2020). COVID-19: towards understanding of pathogenesis. Cell Research, 1-3. 5 Liao, M., Liu, Y., Yuan, J., Wen, Y., Xu, G., Zhao, J., ... & Liu, L. (2020). Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19. Nature Medicine, 1-3. 6 Tay, M. Z., Poh, C. M., Rénia, L., MacAry, P. A., & Ng, L. F. (2020). The trinity of COVID-19: immunity, inflammation and intervention. Nature Reviews Immunology, 1-12. 7 Gervasoni, S.; Vistoli, G.; Talarico, C.; Manelfi, C.; Beccari, A.R.; Studer, G.; Tauriello, G.; Waterhouse, A.M.; Schwede, T.; Pedretti, A. A Comprehensive Mapping of the Druggable Cavities within the SARS-CoV-2 Therapeutically Relevant Proteins by Combining Pocket and Docking Searches as Implemented in Pockets 2.0. Int. J. Mol. Sci. 2020, 21, 5152. 8 Ostaszewski, M., Mazein, A., Gillespie, M. E., Kuperstein, I., Niarakis, A., Hermjakob, H., ... & Schreiber, F. (2020). COVID-19 Disease Map, building a computational repository of SARS-CoV-2 virus-host interaction mechanisms. Scientific data, 7(1), 1-4. 9 Domingo-Fernandez, D. et al. COVID-19 Knowledge Graph: a computable, multi-modal, cause-and-effect knowledge model of COVID-19 pathophysiology. Bioinformatics. btaa834 (2020). 10 Gysi, D. M., Valle, Í. D., Zitnik, M., Ameli, A., Gan, X., Varol, O., ... & Barabási, A. L. (2020). Network medicine framework for identifying drug repurposing opportunities for covid-19. arXiv preprint arXiv:2004.07229. 11 Khan, J. Y., Khondaker, M., Islam, T., Hoque, I. T., Al-Absi, H., Rahman, M. S., ... & Rahman, M. S. (2020). COVID-19Base: A knowledgebase to explore biomedical entities related to COVID-19. arXiv preprint arXiv:2005.05954. 12 Kuperstein, I., Bonnet, E., Nguyen, H. A., Cohen, D., Viara, E., Grieco, L., ... & Dutreix, M. (2015). Atlas of Cancer Signalling Network: a systems biology resource for integrative analysis of cancer data with Google Maps. Oncogenesis, 4(7), e160-e160. 13 Kodamullil, A. T., Younesi, E., Naz, M., Bagewadi, S., & Hofmann-Apitius, M. (2015). Computable cause-and- effect models of healthy and Alzheimer's disease states and their mechanistic differential analysis. Alzheimer's & Dementia, 11(11), 1329-1339. 14 Fujita, K. A., Ostaszewski, M., Matsuoka, Y., Ghosh, S., Glaab, E., Trefois, C., ... & Diederich, N. (2014). Integrating pathways of Parkinson's disease in a molecular interaction map. Molecular neurobiology, 49(1), 88- 102. 15 Matsuoka, Y. et al. A comprehensive map of the influenza A virus replication cycle. BMC Syst. Biol. 7, 97 (2013 16 Khan, J. Y., Khondaker, M., Islam, T., Hoque, I. T., Al-Absi, H., Rahman, M. S., ... & Rahman, M. S. (2020). COVID-19Base: A knowledgebase to explore biomedical entities related to COVID-19. arXiv preprint arXiv:2005.05954. 17 Ostaszewski, M., Mazein, A., Gillespie, M. E., Kuperstein, I., Niarakis, A., Hermjakob, H., ... & Schreiber, F. (2020). COVID-19 Disease Map, building a computational repository of SARS-CoV-2 virus-host interaction mechanisms. Scientific data, 7(1), 1-4. 18 Blanco-Melo, D., Nilsson-Payant, B. E., Liu, W. C., Uhl, S., Hoagland, D., Møller, R., ... & Wang, T. T. (2020). Imbalanced host response to SARS-CoV-2 drives development of COVID-19. Cell. 19 Gordon, D. E., Jang, G. M., Bouhaddou, M., Xu, J., Obernier, K., White, K. M., ... & Tummino, T. A. (2020). A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature, 1-13. 20 Bojkova, D., Klann, K., Koch, B., Widera, M., Krause, D., Ciesek, S., ... & Münch, C. (2020). Proteomics of SARS-CoV-2- infected host cells reveals therapy targets. Nature, 1-8. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 25 21 Ashburn, T. T., & Thor, K. B. (2004). Drug repositioning: identifying and developing new uses for existing drugs. Nature reviews Drug discovery, 3(8), 673-683. 22 Pushpakom, S., Iorio, F., Eyers, P. A., Escott, K. J., Hopper, S., Wells, A., ... & Norris, A. (2019). Drug repurposing: progress, challenges and recommendations. Nature reviews Drug discovery, 18(1), 41-58. 23 http://rdcu.be/qKdSKdSp://rdcu.be/qKdS 24 https://doi.org/10.1073/pnas.1810137115 25 https://reframedb.org/assays/A00461 26 https://reframedb.org/assays/A00440 27 preprint, DOI:.21203/rs.3.rs-23951/v1 28 Slater, T. (2014). Recent advances in modeling languages for pathway maps and computable biological networks. Drug discovery today, 19(2), 193-198. 29 Domingo-Fernández, D., Mubeen, S., Marín-Llaó, J., Hoyt, C. T., & Hofmann-Apitius, M. (2019). PathMe: Merging and exploring mechanistic pathway knowledge. BMC bioinformatics, 20(1), 243. 30 Domingo-Fernández, D., Hoyt, C. T., Bobis-Álvarez, C., Marín-Llaó, J., & Hofmann-Apitius, M. (2018). ComPath: an ecosystem for exploring, analyzing, and curating mappings across pathway databases. NPJ systems biology and applications, 4(1), 1-8. 31 Astghik, S. et al., submitted, Bioinformatics Journal (OUP) 32 Chua, R. L., Lukassen, S., Trump, S., Hennig, B. P., Wendisch, D., Pott, F.,Debnath, O., Thürmann, L., Kurth, F., Völker, M.T., Kazmierski, J., Timmermann, B., Twardziok, S., Schneider, S., Machleidt, F., Müller-Redetzky, H., Maier, M., Krannich, A., Schmidt, S., Balzer, F., Liebig, J., Loske, J., Suttorp, N., Eils, J., Ishaque, N., Liebert, U.G., von Kalle, C., Witzenrath, M., Goffinet, C., Drosten, C., Laudi, S., Lehmann, I., Conrad, C., Sander, L-E. and Eils, R. (2020). COVID-19 severity correlates with airway epithelium–immune cell interactions identified by single-cell analysis. Nature Biotechnology, 38(8), 970-979. 33 Ravindra, N. G., Alfajaro, M. M., Gasque, V., Habet, V., Wei, J., Filler, R. B., Huston, N. C., Wan, H., Szigeti- Buck, K., Wang, B., Wang, G., Montgomery, R.R., Eisenbarth, S. C., Williams, A., Pyle, A.M., Iwasaki, A., Horvath, T.L., Foxman, E.F., Pierce, R.W., van Dijk, D., and Wilen, C.B. (2020). Single-cell longitudinal analysis of SARS- CoV-2 infection in human bronchial epithelial cells. bioRxiv. 34 Kuleshov MV, Jones MR, Rouillard AD, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016;44(W1):W90-W97. doi:10.1093/nar/gkw377 35 https://pypi.org/project/gseapy/ 36 Benjamini Y. Discovering the false discovery rate: False Discovery Rate. J. R. Stat. Soc. Ser. B Stat. Methodol. 2010;72(4):405–416. doi: 10.1111/j.1467-9868.2010.00746.x. 37 Hoehl, S., Rabenau, H., Berger, A., Kortenbusch, M., Cinatl, J., Bojkova, D., Behrens,P., Böddinghaus, B., Götsch,U., Naujoks,F., Neumann, P., Schork, J., Tiarks-Jungk, P., Walczok, A., Eickmann, M., Vehreschild,M., Kann, G.,Wolf, T.,Gottschalk, R., & Ciesek, S. (2020). Evidence of SARS-CoV-2 infection in returning travelers from Wuhan, China. New England Journal of Medicine, 382(13), 1278-1280. 38 Ellinger, B., Bojkova, D., Zaliani, A., Cinatl, J., Claussen, C., Westhaus, S., ... & Gribbon, P. (2020). Identification of inhibitors of SARS-CoV-2 in-vitro cellular toxicity in human (Caco-2) cells using a large scale drug repurposing collection. manuscript under review 39 Bobrowski, T., Chen, L., Eastman, R. T., Itkin, Z., Shinn, P., Chen, C., Guo, H., Zheng, W., Michael, S., Simeonov, A., Hall, M., Zakharov, A.V., and Muratov, E.N. (2020). Discovery of Synergistic and Antagonistic Drug Combinations against SARS-CoV-2 In Vitro. BioRxiv. 40 Bobrowski, T., Chen, L., Eastman, R. T., Itkin, Z., Shinn, P., Chen, C., Guo, H., Zheng, W., Michael, S., Simeonov, A., Hall, M., Zakharov, A.V., and Muratov, E.N. (2020). Discovery of Synergistic and Antagonistic Drug Combinations against SARS-CoV-2 In Vitro. BioRxiv. 41 Ellinger, B et al. (2020). Identification of inhibitors of SARS-CoV-2 in-vitro cellular toxicity in human (Caco-2) cells using a large scale drug repurposing collection. Preprint. https://doi.org/10.21203/rs.3.rs-23951/v1. 42 Yamamoto, N., Yang, R., Yoshinaka, Y., Amari, S., Nakano, T., Cinatl, J., ... & Tamamura, H. (2004). HIV protease inhibitor nelfinavir inhibits replication of SARS-associated coronavirus. Biochemical and biophysical research communications, 318(3), 719-725. 43 Musarrat, F., Chouljenko, V., Dahal, A., Nabi, R., Chouljenko, T., Jois, S. D., & Kousoulas, K. G. (2020). The anti‐ HIV Drug Nelfinavir Mesylate (Viracept) is a Potent Inhibitor of Cell Fusion Caused by the SARS‐CoV‐2 Spike (S) Glycoprotein Warranting further Evaluation as an Antiviral against COVID‐19 infections. Journal of medical virology. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint http://rdcu.be/qKdS http://rdcu.be/qKdS http://rdcu.be/qKdS https://doi.org/10.1073/pnas.1810137115 https://reframedb.org/assays/A00461 https://reframedb.org/assays/A00440 https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ a b .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ a b .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ a b c .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Thioguanosine log c (M) % In hi bi tio n 0.25 µM Thioguanosine 0.08 µM Thioguanosine 0.03 µM Thioguanosine 0.01 µM Thioguanosine -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Nelfinavir DRC in presence of Raloxifene log c (M) % In hi bi tio n 20 µM Raloxifene 6.67 µM Raloxifene 2.22 µM Raloxifene 0.74 µM Raloxifene 0.25 µM Raloxifene 0.08 µM Raloxifene 0.03 µM Raloxifene 0.01 µM Raloxifene -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Anisomycin log c (M) % In hi bi tio n 20 µM Anisomycin 6.67 µM Anisomycin 2.22 µM Anisomycin 0.74 µM Anisomycin 0.25 µM Anisomycin 0.08 µM Anisomycin 0.03 µM Anisomycin 0.01 µM Anisomycin -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Raloxifene log c (M) % In hi bi tio n 20 µM Raloxifene 6.67 µM Raloxifene 2.22 µM Raloxifene 0.74 µM Raloxifene 0.25 µM Raloxifene 0.08 µM Raloxifene 0.03 µM Raloxifene 0.01 µM Raloxifene -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Thioguanosine log c (M) % In hi bi tio n 0.25 µM Thioguanosine 0.08 µM Thioguanosine 0.03 µM Thioguanosine 0.01 µM Thioguanosine -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Nelfinavir DRC in presence of Raloxifene log c (M) % In hi bi tio n 20 µM Raloxifene 6.67 µM Raloxifene 2.22 µM Raloxifene 0.74 µM Raloxifene 0.25 µM Raloxifene 0.08 µM Raloxifene 0.03 µM Raloxifene 0.01 µM Raloxifene -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Anisomycin log c (M) % In hi bi tio n 20 µM Anisomycin 6.67 µM Anisomycin 2.22 µM Anisomycin 0.74 µM Anisomycin 0.25 µM Anisomycin 0.08 µM Anisomycin 0.03 µM Anisomycin 0.01 µM Anisomycin -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Raloxifene log c (M) % In hi bi tio n 20 µM Raloxifene 6.67 µM Raloxifene 2.22 µM Raloxifene 0.74 µM Raloxifene 0.25 µM Raloxifene 0.08 µM Raloxifene 0.03 µM Raloxifene 0.01 µM Raloxifene -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Thioguanosine log c (M) % In hi bi tio n 0.25 µM Thioguanosine 0.08 µM Thioguanosine 0.03 µM Thioguanosine 0.01 µM Thioguanosine -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Nelfinavir DRC in presence of Raloxifene log c (M) % In hi bi tio n 20 µM Raloxifene 6.67 µM Raloxifene 2.22 µM Raloxifene 0.74 µM Raloxifene 0.25 µM Raloxifene 0.08 µM Raloxifene 0.03 µM Raloxifene 0.01 µM Raloxifene -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Anisomycin log c (M) % In hi bi tio n 20 µM Anisomycin 6.67 µM Anisomycin 2.22 µM Anisomycin 0.74 µM Anisomycin 0.25 µM Anisomycin 0.08 µM Anisomycin 0.03 µM Anisomycin 0.01 µM Anisomycin -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Raloxifene log c (M) % In hi bi tio n 20 µM Raloxifene 6.67 µM Raloxifene 2.22 µM Raloxifene 0.74 µM Raloxifene 0.25 µM Raloxifene 0.08 µM Raloxifene 0.03 µM Raloxifene 0.01 µM Raloxifene -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Thioguanosine log c (M) % In hi bi tio n 0.25 µM Thioguanosine 0.08 µM Thioguanosine 0.03 µM Thioguanosine 0.01 µM Thioguanosine -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Nelfinavir DRC in presence of Raloxifene log c (M) % In hi bi tio n 20 µM Raloxifene 6.67 µM Raloxifene 2.22 µM Raloxifene 0.74 µM Raloxifene 0.25 µM Raloxifene 0.08 µM Raloxifene 0.03 µM Raloxifene 0.01 µM Raloxifene -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Anisomycin log c (M) % In hi bi tio n 20 µM Anisomycin 6.67 µM Anisomycin 2.22 µM Anisomycin 0.74 µM Anisomycin 0.25 µM Anisomycin 0.08 µM Anisomycin 0.03 µM Anisomycin 0.01 µM Anisomycin -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Raloxifene log c (M) % In hi bi tio n 20 µM Raloxifene 6.67 µM Raloxifene 2.22 µM Raloxifene 0.74 µM Raloxifene 0.25 µM Raloxifene 0.08 µM Raloxifene 0.03 µM Raloxifene 0.01 µM Raloxifene -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Thioguanosine log c (M) % In hi bi tio n 0.25 µM Thioguanosine 0.08 µM Thioguanosine 0.03 µM Thioguanosine 0.01 µM Thioguanosine -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Nelfinavir DRC in presence of Raloxifene log c (M) % In hi bi tio n 20 µM Raloxifene 6.67 µM Raloxifene 2.22 µM Raloxifene 0.74 µM Raloxifene 0.25 µM Raloxifene 0.08 µM Raloxifene 0.03 µM Raloxifene 0.01 µM Raloxifene -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Anisomycin log c (M) % In hi bi tio n 20 µM Anisomycin 6.67 µM Anisomycin 2.22 µM Anisomycin 0.74 µM Anisomycin 0.25 µM Anisomycin 0.08 µM Anisomycin 0.03 µM Anisomycin 0.01 µM Anisomycin -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Raloxifene log c (M) % In hi bi tio n 20 µM Raloxifene 6.67 µM Raloxifene 2.22 µM Raloxifene 0.74 µM Raloxifene 0.25 µM Raloxifene 0.08 µM Raloxifene 0.03 µM Raloxifene 0.01 µM Raloxifene -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Thioguanosine log c (M) % In hi bi tio n 0.25 µM Thioguanosine 0.08 µM Thioguanosine 0.03 µM Thioguanosine 0.01 µM Thioguanosine -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Nelfinavir DRC in presence of Raloxifene log c (M) % In hi bi tio n 20 µM Raloxifene 6.67 µM Raloxifene 2.22 µM Raloxifene 0.74 µM Raloxifene 0.25 µM Raloxifene 0.08 µM Raloxifene 0.03 µM Raloxifene 0.01 µM Raloxifene -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Anisomycin log c (M) % In hi bi tio n 20 µM Anisomycin 6.67 µM Anisomycin 2.22 µM Anisomycin 0.74 µM Anisomycin 0.25 µM Anisomycin 0.08 µM Anisomycin 0.03 µM Anisomycin 0.01 µM Anisomycin -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Raloxifene log c (M) % In hi bi tio n 20 µM Raloxifene 6.67 µM Raloxifene 2.22 µM Raloxifene 0.74 µM Raloxifene 0.25 µM Raloxifene 0.08 µM Raloxifene 0.03 µM Raloxifene 0.01 µM Raloxifene -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Thioguanosine log c (M) % In hi bi tio n 0.25 µM Thioguanosine 0.08 µM Thioguanosine 0.03 µM Thioguanosine 0.01 µM Thioguanosine -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Nelfinavir DRC in presence of Raloxifene log c (M) % In hi bi tio n 20 µM Raloxifene 6.67 µM Raloxifene 2.22 µM Raloxifene 0.74 µM Raloxifene 0.25 µM Raloxifene 0.08 µM Raloxifene 0.03 µM Raloxifene 0.01 µM Raloxifene -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Anisomycin log c (M) % In hi bi tio n 20 µM Anisomycin 6.67 µM Anisomycin 2.22 µM Anisomycin 0.74 µM Anisomycin 0.25 µM Anisomycin 0.08 µM Anisomycin 0.03 µM Anisomycin 0.01 µM Anisomycin -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Raloxifene log c (M) % In hi bi tio n 20 µM Raloxifene 6.67 µM Raloxifene 2.22 µM Raloxifene 0.74 µM Raloxifene 0.25 µM Raloxifene 0.08 µM Raloxifene 0.03 µM Raloxifene 0.01 µM Raloxifene -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Thioguanosine log c (M) % In hi bi tio n 0.25 µM Thioguanosine 0.08 µM Thioguanosine 0.03 µM Thioguanosine 0.01 µM Thioguanosine -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Nelfinavir DRC in presence of Raloxifene log c (M) % In hi bi tio n 20 µM Raloxifene 6.67 µM Raloxifene 2.22 µM Raloxifene 0.74 µM Raloxifene 0.25 µM Raloxifene 0.08 µM Raloxifene 0.03 µM Raloxifene 0.01 µM Raloxifene -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Anisomycin log c (M) % In hi bi tio n 20 µM Anisomycin 6.67 µM Anisomycin 2.22 µM Anisomycin 0.74 µM Anisomycin 0.25 µM Anisomycin 0.08 µM Anisomycin 0.03 µM Anisomycin 0.01 µM Anisomycin -9 -8 -7 -6 -5 -4 -50 0 50 100 150 Remdesivir DRC in presence of Raloxifene log c (M) % In hi bi tio n 20 µM Raloxifene 6.67 µM Raloxifene 2.22 µM Raloxifene 0.74 µM Raloxifene 0.25 µM Raloxifene 0.08 µM Raloxifene 0.03 µM Raloxifene 0.01 µM Raloxifene a c b d .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.308239doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.308239 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2020_09_23_310276 ---- 97992561 Title: NIAGADS Alzheimer’s GenomicsDB: A resource for exploring Alzheimer’s Disease genetic and genomic knowledge Authors Emily Greenfest-Allen24, Conor Klamann123, Prabhakaran Gangadharan123, Amanda Kuzma123, Yuk Yee Leung123, Otto Valladares123, Gerard Schellenberg123, Christian J. Stoeckert Jr. 124, Li-San Wang123 Affiliations 1 Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 2 Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 3 Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 4 Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Corresponding Author Emily Greenfest-Allen allenem@pennmedicine.upenn.edu Li-San Wang lswang@pennmedicine.upenn.edu (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint mailto:allenem@pennmedicine.upenn.edu https://doi.org/10.1101/2020.09.23.310276 Abstract INTRODUCTION: The NIAGADS Alzheimer’s Genomics Database (GenomicsDB) is an interactive knowledgebase for Alzheimer’s disease (AD) genetics that provides access to GWAS summary statistics datasets deposited at NIAGADS, a national genetics data repository for AD and related dementia (ADRD). METHODS: The website makes available >70 genome-wide summary statistics datasets from GWAS and genome sequencing analysis for AD/ADRD. Variants identified from these datasets are mapped to up-to-date variant and gene annotations from a variety of resources and linked to functional genomics data. The database is powered by a big data optimized relational database and ontologies to consistently annotate study designs and phenotypes, facilitating data harmonization and efficient real-time data analysis and variant or gene report generation. RESULTS: Detailed variant reports provide tabular and interactive graphical summaries of known ADRD associations, as well as highlight variants flagged by the Alzheimer’s Disease Sequencing Project (ADSP). Gene reports provide summaries of co-located ADRD risk-associated variants and have been expanded to include meta-analysis results from aggregate association tests performed by the ADSP allowing us to flag genes with genetic evidence for AD. DISCUSSION: The GenomicsDB makes available >150 million variant annotations, including ~30 million (5 million novel) variants identified as AD-relevant by ADSP, for browsing and real-time mining via the website. With a newly redesigned, efficient, search interface and comprehensive record pages linking summary statistics to variant and gene annotations, this resource makes these data both accessible and interpretable, establishing itself as valuable tool for AD research. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 1 Background Alzheimer’s disease (AD) is a progressive neurodegenerative disorder that affects 5.8 million people in US in 2018, is effectively untreatable, and invariably progresses to complete incapacitation and death 10 or more years after onset. Early work in the 1990s identified mutations in the amyloid precursor protein (APP) gene, presenilins 1 and 2 that cause AD, and alleles of the apolipoprotein E gene (APOE) that increase (ε4) or decrease (ε2) susceptibility to late-onset Alzheimer’s disease (LOAD). Heritability of AD is high, ranging from near 60% to 80% in the best fitting model [1,2]. However, apart from APOE, there is no simple pattern of inheritance for LOAD. Instead, it is likely caused by a complex combination of common, polygenic variants [3] acting together with a small number of rare variants with a large effect [4,5]. Our current understanding of genetic risk for AD has resulted mainly from massive genotyping and sequencing efforts such as the Alzheimer’s Disease Genetics Consortium (ADGC), the International Genomics of Alzheimer’s Project (IGAP), and the Alzheimer’s Disease Sequencing Project (ADSP). Large-scale genome wide association studies (GWAS) and GWAS-derived meta- analyses have been performed by each of these groups [4–7], the results of which are deposited at the National Institute of Aging (NIA) Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS) at the University of Pennsylvania [8]. NIAGADS is an NIA-designated essential national infrastructure, providing a one-stop access portal for Alzheimer’s disease ′omics datasets. Qualified investigators can submit data use requests to access protect personal genetic information. NIAGADS also disseminates unrestricted meta-analysis results and GWAS summary statistics to promote data reuse, allowing researchers to explore known evidence for AD genetic risk. However, substantive bioinformatics expertise and compute power are required to annotate and mine these datasets, which are significant hurdles for many researchers planning to explore this large and ever-increasing volume of data. Assembly of unrestricted genomic knowledge into an integrated, interactive web resource would help overcome this barrier. Here, we introduce the NIAGADS Alzheimer’s Genomics Database (GenomicsDB), which was developed in collaboration with the ADGC and ADSP with this goal in mind. The GenomicsDB is a user-friendly workspace for data sharing, discovery, and analysis designed to facilitate the quest for better understanding of the complex genetic underpinnings of AD neurodegeneration and accelerate the progress of research on AD and AD related dementias (ADRD). It accomplishes this by making summary genetic evidence for AD/ADRD both accessible to and interpretable by molecular biologists, clinicians and bioinformaticians alike regardless of computational skills. 2 Methods 2.1 Genomics Datasets 2.1.1 NIAGADS GWAS summary statistics (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 As of December 2020, the NIAGADS GenomicsDB provides unrestricted access to genome-wide summary statistics p-values from >70 GWAS and ADSP meta-analysis. Summary statistic results are linked to >150 million ADSP annotated single-nucleotide variants (SNVs) and indels. GWAS summary statistics datasets deposited at NIAGADS are integrated into the GenomicsDB as they become publicly available via publication or permission of the submitting researchers. These include studies that focus specifically on AD and late-onset AD (LOAD), as well as those on ADRD-related neuropathologies and biomarkers. A full listing of the summary statistics datasets currently available through the NIAGADS GenomicsDB is provided in Supplementary Table S1. Prior to loading in the database, the datasets are annotated (e.g. provenance, phenotypes, study design) and variant representation normalized to ensure consistency with ADSP analysis pipelines and facilitate harmonization with third-party annotations. To ensure the privacy of personal health information, the NIAGADS GenomicsDB website only makes p-values from the summary statistics available for browsing (on dataset, gene, and variant reports and as genome browser tracks) and analysis. Access to the full summary statistics (including genome-wide allele frequencies and effect sizes) and corresponding GWAS or sequencing results is managed via formal data-access requests made to NIAGADS. All datasets included in the GenomicsDB are properly credited to the submitting researchers or sequencing project. 2.1.2 NHGRI-EBI GWAS Catalog Variants and summary statistics curated in the NHGRI-EBI GWAS catalog [9] are listed in NIAGADS GenomicsDB variant reports and a track is available on the genome browser. Variants linked to AD/ADRD are highlighted. 2.1.3 ADSP meta-analysis results The NIAGADS GenomicsDB has recently expanded its scope to include meta-analysis results offering genetic evidence for gene-level and single-variant risk associations for AD. Currently available are case/control association results recently published by the ADSP [7] and deposited at NIAGADS (Accession No. NG00065). 2.2 Variant annotation 2.2.1 Variant identification Single nucleotide polymorphisms (SNPs) and short-indels are uniquely identified by position and allelic variants. This allows accurate mapping of risk-association statistics to specific mutations and to external variant annotations from resources such as gnomAD (https://gnomad.broadinstitute.org/) [10] and GTex (https://www.gtexportal.org/home/) [11]. All variants are mapped to dbSNP (https://www.ncbi.nlm.nih.gov/snp/) [12] and linked to refSNP identifiers when possible. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://gnomad.broadinstitute.org/ https://www.gtexportal.org/home/ https://www.ncbi.nlm.nih.gov/snp/ https://doi.org/10.1101/2020.09.23.310276 2.2.2 ADSP variant annotations Annotated variants in the NIAGADS GenomicsDB include the >29 million SNPs and ~50,000 short-indels identified during the ADSP Discovery Phase whole-genome (WGS) and whole- exome sequencing (WES) efforts [13]. These variants are highlighted in variant and dataset reports and their quality control status is provided. As part of this sequencing effort, the ADSP developed an annotation pipeline that builds on Ensembl’s VEP software [14] to efficiently integrate standard annotations and rank potential variant impacts according to predicted effect (such as codon changes, loss of function, and potential deleteriousness) [13,15]. Variant tracks annotated by these results are available for both the WES and WGS variants on the GenomicsDB genome browser. The pipeline has been applied to all variants in the GenomicsDB. These annotations can be browsed on variant reports or used to filter search results. User uploaded lists of variants are automatically annotated in real-time. 2.2.3 Allele frequencies The NIAGADS GenomicsDB includes allele frequency data from 1000 Genomes (phase 3, version 1) (https://www.internationalgenome.org/home) [16], ExAC (http://exac.broadinstitute.org/) [17], and gnomAD [10]. 2.2.4 Linkage disequilibrium Linkage-disequilibrium (LD) structure around annotated variants is estimated using phase 3 version 1 (11 May 2011) of the 1000 Genomes Project [16]. LD estimates were made using PLINK v1.90b2i 64-bit [18]. Only LD-scores meeting a correlation threshold of r2 ≥ 0.2 are stored in the database. Locuszoom.js [19,20] is used to render LD-scores in the context of the GWAS summary statistics datasets. 2.3 Gene and transcript annotation 2.3.1 Gene identification Gene and transcript models are obtained from the GENCODE Release 19 (GRCh37.p13) reference gene annotation [21]. A GRCh38 version of the NIAGADS GenomicsDB is planned for 2021. Standard gene nomenclature is imported from the HUGO Gene Nomenclature Committee at the European Bioinformatics Institute [22] and used to link annotated genes to external resources such as UniProt (https://www.uniprot.org/) [23], the UCSC Genome Browser (http://genome.ucsc.edu)[24], and Online Mendelian Inheritance in Man (OMIM) database (https://omim.org/) [25,26]. 2.3.2 Functional annotation (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://www.internationalgenome.org/home http://exac.broadinstitute.org/ https://www.uniprot.org/ http://genome.ucsc.edu/ https://omim.org/ https://doi.org/10.1101/2020.09.23.310276 Annotations of the functions of genes and gene products are taken from packaged releases of the Gene Ontology (GO; http://geneontology.org) and GO-gene associations [27] and are updated regularly. GO-gene associations are reported in summary tables on gene reports and include details on annotation sources, as well as new information from the GO causal modeling (GO-CAM) framework that allows better understanding of how different gene products work together to effect biological processes [28]. Users can run functional enrichment analysis on gene search results or uploaded gene lists. Geneset enrichment and semantic similarity scores are calculated using the goatools Python library for GO analysis [29]. 2.4.3 Pathways Gene membership in molecular and metabolic pathways is provided from the Kyoto Encyclopedia of Genes and Genomes (KEGG) (https://www.genome.jp/kegg/) [30] and Reactome (https://reactome.org/) [31]. Users can run pathway enrichment analysis on gene search results or uploaded gene lists. Pathway enrichment statistics are calculated using a multiple hypothesis corrected Fisher’s exact test implemented using the SciPy, pandas, and statsmodels Python packages. 2.4 Functional genomics Hundreds of functional genomics tracks have been integrated into the NIAGADS GenomicsDB and mapped against AD/ADRD-associated variants. These tracks are queried from the NIAGADS Functional genomics repository (FILER), which provides harmonized functional genomics datasets that have been GIGGLE indexed [32] for quick lookups [33]. FILER tracks made available through the GenomicsDB have been pulled from established functional genomics resources, including the Encyclopedia of DNA Elements (ENCODE) [34,35], the Functional Annotation of the Mouse/Mammalian Genome (FANTOM5) enhancer atlas [36], and the NIH Roadmap Epigenomics Mapping Consortium [37]. Genome browser tracks are available for all functional genomics datasets and are organized by data source, biotype (e.g., cell, tissue, or cell line), type of functional annotation (e.g., expressed enhancers, transcription factor binding sites, histone modifications) and platform or assay type to facilitate track selection. 2.5 Overview of database design An overview of the NIAGADS GenomicsDB systems architecture is provided in Figure 1. The GenomicsDB is powered by a PostgreSQL relational database system that has been optimized for parallel big data querying, allowing for efficient real-time data mining. Data are organized using the modular Genomics Unified Schema version 4 (GUS4), designed for scalable integration and dissemination of large-scale ′omics datasets. Loading of all data is managed by the GUS4 application layer (https://github.com/VEuPathDB/GusAppFramework), which ensures the accuracy of data integration. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint http://geneontology.org/ https://www.genome.jp/kegg/ https://reactome.org/ https://github.com/VEuPathDB/GusAppFramework https://doi.org/10.1101/2020.09.23.310276 2.6 Overview of website design and organization The NIAGADS GenomicsDB is powered by an open-source database system and web- development kit (WDK; https://github.com/VEuPathDB/WDK) developed and successfully deployed by the Eukaryotic Pathogen, Vector and Host Informatics (VEuPathDB) Bioinformatics Resource Center [38,39]. The VEuPathDB WDK provides a query engine that ties the database system to the website via an easily extensible XML data model. The data model is used to automatically generate and organize searches, search results, and reports, with concepts and data organized by topics from the EMBRACE Data And Methods (EDAM) ontology, which defines a comprehensive set of concepts that are prevalent within bioinformatics [40]. This facilitates updates of third-party data and rapid integration of new datasets as they become publicly available. The WDK also provides a framework for lightweight Java/Jersey representational state transfer (REST) services for data querying. This allows search results and reports to be returned in multiple file formats (e.g., delimited-text, XML, and JSON) in addition to browsable, interactive web pages. This new feature of GenomicsDB has enabled the inclusion of sophisticated visualizations for summarizing search results and annotations in gene and variant reports. API development is still undergoing, with plans to develop a flexible API that allows researchers to integrate GenomicsDB datasets and annotations into analysis pipelines. The GenomicsDB uses a combination of an in-house JavaScript genomics visualization toolkit and established third- party visualization tools, including the HighCharts.js (https://www.highcharts.com/) charting library for rendering scatter, pie, and bar charts, ideogram.js (https://github.com/eweitz/ideogram) for chromosome visualization, LocusZoom.js for rendering LD structure in the context of NIAGADS GWAS summary statistics datasets, and an IGV.js powered genome browser [41]. All code used to generate the WDK website, including the JavaScript genomics visualizations are available on GitHub (https://github.com/NIAGADS). 2.7 Overview of the NIAGADS genome browser The NIAGADS genome browser enables researchers to visually inspect and browse GWAS summary statistics datasets in a genomic context. The genome browser allows users to compare NIAGADS GWAS summary statistics tracks to each other, against annotated gene or variant tracks, or to the functional genomics tracks from the NIAGADS FILER functional genomics repository. This tool is powered by IGV.js, with track data queried in real-time by NIAGADS GenomicsDB REST services. The browser also provides a track selection tool that allows users to easily find tracks of interest by keyword search, data source, biotype (e.g., cell, tissue, or cell line) or type of functional annotation (Fig. 2). 3. Results (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://github.com/VEuPathDB/WDK https://www.highcharts.com/ https://github.com/eweitz/ideogram https://github.com/NIAGADS https://doi.org/10.1101/2020.09.23.310276 The NIAGADS Alzheimer’s GenomicsDB creates a public forum for sharing, discovery, and analysis of genetic evidence for Alzheimer’s disease that is made accessible via an interface designed for easy mastery by biological researchers, regardless of background. The GenomicsDB provides four main routes for data exploration and mining. First, detailed reports compile all available data concerning summary statistics datasets and genetic evidence linking AD/ADRD to genes and variants. Second, datasets can be mined in real-time to isolate a refined set of variants that share biological characteristics of interest. Third, visualization tools such a s LocusZoom.js and the NIAGADS Genome Browser offer the ability to quickly view and draw conclusions from comparisons of summary statistics or ADSP annotated variants to different types of sequence data in a genomic area of interest. Fourth, and finally, tools such as enrichment analyses offer opportunities for users to link variants to biological processes via impacted genes. 3.1 Finding variants, genes, and datasets The GenomicsDB homepage and navigation menu contain a site search allowing users to quickl y find variants, genes, and datasets of interest by identifier or keyword. This search is paired with interactive graphics found throughout the site that provide shortcuts to resources and annotations of interest to the AD/ADRD research community (Fig. 3A, B). The GenomicsDB also provides a dataset browser that allows users to search for GWAS summary statistics datasets by AD/ADRD phenotype, population, genotype, attribution, and sequencing center. 3.2 Browsing and mining NIAGADS GWAS summary statistics A detailed report is provided for each of the GWAS summary statistics and ADSP meta-analysis datasets in the NIAGADS GenomicsDB (Fig. 4A). These reports allow users to browse the genetic variants with genome-wide significance in the dataset (p-value ≤ 5 × 10-8 to account for false positives due to testing associations of millions of variants simultaneously) via tables and interactive plots that provide an overview of the distribution and potential functional or regulatory impacts of the top variants (and proximal gene-loci) across the genome. All genes and variants listed in a dataset report are linked to reports in the GenomicsDB that provide detailed information about genetic evidence for AD for the sequence feature (see next sections). Dataset reports also provide quick links back to their parent accession in the NIAGADS repository where users can download the complete p-values or make formal data access requests for the full summary statistics, related GWAS, expression, or sequencing data associated with the accession. The reports also provide an inline search allowing users to mine the summary statistics in real-time via the website, setting their own p-value cut-off (see section 3.5 for more information). 3.3 Detailed variant reports Variant reports include a basic summary about the variant (alleles, variant type, flanking sequence, genomic location) and a graphical overview of NIAGADS GWAS summary statistics datasets in which the variant has genome-wide significance (Fig. 5A). All other information in (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 the report is subdivided into multiple sections that can be expanded or hidden at the user’s discretion. These sections include sub-reports on genetic variation (e.g., allele population frequencies and LD), function prediction determined via the ADSP annotation pipeline (incl. transcript and regulatory consequences), and comprehensive listings of GWAS inferred disease or trait associations from both NIAGADS summary statistics and the NHGRI-EBI GWAS Catalog. Tables listing summary statistics results can be dynamically filtered by p-value, dataset, phenotypes, or covariates, and the filtered results are downloadable. Links to the source datasets for each reported statistic are also provided, leading to detailed dataset reports (e.g., NIAGADS GWAS summary statistics) or to the source publication (e.g., curated variant catalogs). These tables are paired with browsable LocusZoom.js views of the LD structure surrounding the variant in the context of selected GWAS summary statistics datasets. Links to the NIAGADS Alzheimer’s Disease Variant Portal (ADVP) and external resources for additional information (e.g., dbSNP, ClinVar) are also provided. 3.4 Detailed gene reports Like the variant reports, gene reports provide basic summary information about the gene (nomenclature, gene type, genomic span) and a graphical overview of NIAGADS GWAS summary statistics-linked variants proximal to or within the footprint of the gene (Fig.5B). Two types of gene-linked genetic evidence for AD are provided in the GenomicsDB gene reports. First, we have surveyed the top risk-associated variants from the NIAGADS GWAS summary statistics datasets and provide a comprehensive listing of and links to those contained within ±100kb of each gene (Fig. 5C). Second, we report meta-analysis results from gene-based rare variant aggregation tests performed as part of the ADSP discovery phase case/control analysis [42]. Genes found to have a significant p-value in these results are flagged as being associated with genetic-evidence for AD. Also provided on the gene report are sections reporting function prediction (Gene Ontology associations and evidence) and pathway membership (KEGG and Reactome). Tables reporting these results or annotations can be dynamically filtered or downloaded. Links to the NIAGADS ADVP and to external resources (e.g., UniprotKB, OMIM, and ExAC) are also provided. 3.5 Workspaces The GenomicsDB provides an interactive workspace for exploring a dataset in more depth. As an example, dataset reports provide an inline search allowing users to mine the summary statistics. Variants meeting the search criterion are reported in an interactive workspace that includes both tabular and graphical summaries. Users are initially presented with a table that can be sorted or filtered by annotations (e.g., variant type, predicted effect, deleteriousness) (Fig. 4B). A per-chromosome genome view is also available allowing users to explore an interactive ideogram depicting the distribution of variants meeting the search and filter criteria across the genome and allowing inspection of LD structure among proximal variants (Fig. 4C). Tables of results can be downloaded or requested via the API for programmatic processing. Registered users also have the option to save and share search results both privately and (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 publicly; publicly shared search results are assigned a stable URL that can be referenced in publications. 3.6 Genome Browser The NIAGADS genome browser can be used to visually inspect any of the NIAGADS GWAS summary statistics datasets in a broader genomic context and compare against annotated ADSP variant tracks or other ′omics tracks in the GenomicsDB or FILER (see section 2.7, Fig. 2B). 4 Discussion The NIAGADS Alzheimer’s Genomics Database is a user-friendly platform for interactive browsing and real-time in-depth mining of published genetic evidence and genetic risk-factors for AD. It provides open, real-time access to summary statistics datasets from genome-wide association analysis (GWAS) of Alzheimer’s disease and related neuropathologies. Flexible search options allow users to easily retrieve AD risk-associated variants, conditioned on phenotypes such as ethnicity and age of onset. Users can compare the NIAGADS datasets against personal gene or variant lists. Every entry in the GenomicsDB has been linked with relevant external resources and functional genomics annotations to supply further information and assist researchers in interpreting the potential functional or regulatory role of risk-associated variants and susceptibility loci. The GenomicsDB is updated periodically with enhanced features and new datasets and annotations when they are reported. The AD research community is actively encouraged through outreach and collaboration to submit data to NIAGADS to keep this public platform updated and timely. The GenomicsDB is integrated with other resources available at NIAGADS. Users can follow links back to the NIAGADS repository to view comprehensive details about all GWAS summary statistics datasets from NIAGADS accession or request access to the primary data. The REST services used to query the database and generate data or feature reports provide the foundation of an API that allows programmatic access to the database, which we plan to integrate with cloud based NIAGADS analysis pipelines. The GenomicsDB is regularly updated to keep up with advances in Alzheimer’s disease genomics research. New AD-related GWAS summary statistics datasets and meta-analysis results from the ADSP are added as they become available. Reference databases are updated yearly. All genomics data in the current version of the GenomicsDB are aligned and mapped to the GRCh37.p13 genome build. A GRCh38 version of the database is planned for release in early 2021, which will include variants from the ongoing ADSP sequencing effort, including 20K WES in 2020 and 17K WGS in 2021. GenomicsDB is a potent platform for the AD genetics community to host comprehensive AD genetic and genomic findings. It uses the latest web and database technologies to allow integration with new tools, and NIAGADS is constantly improving. As more data and tools (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 become available the NIAGADS Alzheimer’s Genomics Database will become a central hub for AD/ADRD research and data analysis. 5 Conflicts of Interest The authors have no financial interests to disclose. 6 Acknowledgements and Funding Information This work is supported by the NIH National Institute on Aging (grant number U24-AG041689). The ADSP Discovery Phase analysis of sequence data is supported through UF1AG047133 (to Drs. Schellenberg, Farrer, Pericak-Vance, Mayeux, and Haines); U01AG049505 to Dr. Seshadri; U01AG049506 to Dr. Boerwinkle; U01AG049507 to Dr. Wijsman; and U01AG049508 to Dr. Goate. Additional funding and acknowledgement statements for the ADSP can be found in the supplement. 7 References [1] Gatz M, Reynolds CA, Fratiglioni L, Johansson B, Mortimer JA, Berg S, et al. Role of genes and environments for explaining Alzheimer disease. Arch Gen Psychiatry 2006;63:168–74. https://doi.org/10.1001/archpsyc.63.2.168. [2] Jansen IE, Savage JE, Watanabe K, Bryois J, Williams DM, Steinberg S, et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nature Genetics 2019;51:404–13. https://doi.org/10.1038/s41588-018-0311-9. [3] Hollingworth P, Harold D, Sims R, Gerrish A, Lambert J-C, Carrasquillo MM, et al. Common variants in ABCA7, MS4A6A/MS4A4E, EPHA1, CD33 and CD2AP are associated with Alzheimer’s disease. Nat Genet 2011;43:429–35. https://doi.org/10.1038/ng.803. [4] Lambert J-C, Ibrahim-Verbaas CA, Harold D, Naj AC, Sims R, Bellenguez C, et al. Meta- analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease. Nature Genetics 2013;45:1452–8. https://doi.org/10.1038/ng.2802. [5] Kunkle BW, Grenier-Boley B, Sims R, Bis JC, Damotte V, Naj AC, et al. Genetic meta- analysis of diagnosed Alzheimer’s disease identifies new risk loci and implicates Aβ, tau, immunity and lipid processing. Nat Genet 2019;51:414–30. https://doi.org/10.1038/s41588- 019-0358-2. [6] Naj AC, Jun G, Beecham GW, Wang L-S, Vardarajan BN, Buros J, et al. Common variants at MS4A4/MS4A6E , CD2AP , CD33 and EPHA1 are associated with late-onset Alzheimer’s disease. Nature Genetics 2011;43:436–41. https://doi.org/10.1038/ng.801. [7] Bis JC, Jian X, Kunkle BW, Chen Y, Hamilton-Nelson KL, Bush WS, et al. Whole exome sequencing study identifies novel rare and common Alzheimer’s-Associated variants involved in immune response and transcriptional regulation. Molecular Psychiatry 2018:1– 17. https://doi.org/10.1038/s41380-018-0112-7. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 [8] Kuzma A, Valladares O, Cweibel R, Greenfest-Allen E, Childress DM, Malamon J, et al. NIAGADS: The NIA Genetics of Alzheimer’s Disease Data Storage Site. Alzheimer’s & Dementia 2016;12:1200–3. https://doi.org/10.1016/j.jalz.2016.08.018. [9] Buniello A, MacArthur JAL, Cerezo M, Harris LW, Hayhurst J, Malangone C, et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res 2019;47:D1005–12. https://doi.org/10.1093/nar/gky1120. [10] Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. BioRxiv 2019:531210. https://doi.org/10.1101/531210. [11] Gamazon ER, Segrè AV, van de Bunt M, Wen X, Xi HS, Hormozdiari F, et al. Using an atlas of gene regulation across 44 human tissues to inform complex disease- and trait- associated variation. Nature Genetics 2018;50:956–67. https://doi.org/10.1038/s41588-018- 0154-4. [12] Sherry ST, Ward M-H, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001;29:308–11. [13] Butkiewicz M, Blue EE, Leung YY, Jian X, Marcora E, Renton AE, et al. Functional annotation of genomic variants in studies of late-onset Alzheimer’s disease. Bioinformatics 2018;34:2724–31. https://doi.org/10.1093/bioinformatics/bty177. [14] McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, et al. The Ensembl Variant Effect Predictor. Genome Biol 2016;17. https://doi.org/10.1186/s13059-016-0974-4. [15] Wheeler NR, Benchek P, Kunkle BW, Hamilton-Nelson KL, Warfe M, Fondran JR, et al. Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies. Pac Symp Biocomput 2020;25:523–34. [16] Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, et al. A global reference for human genetic variation. Nature 2015;526:68–74. https://doi.org/10.1038/nature15393. [17] Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 2016;536:285–91. https://doi.org/10.1038/nature19057. [18] Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007;81:559–75. https://doi.org/10.1086/519795. [19] Pruim RJ, Welch RP, Sanna S, Teslovich TM, Chines PS, Gliedt TP, et al. LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics 2010;26:2336–7. https://doi.org/10.1093/bioinformatics/btq419. [20] Clark CP, Flickinger M, Welch R, VandeHaar P, Taliun D, Boehnke M, et al. LocusZoom.js: Web-based plugin for interactive analysis of genome and phenome wide association studies. Presented at the 66th Annual Meeting of The American Society of Human Genetics, Vancouver: 2016, p. 189T. [21] Frankish A, Diekhans M, Ferreira A-M, Johnson R, Jungreis I, Loveland J, et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res 2019;47:D766–73. https://doi.org/10.1093/nar/gky955. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 [22] Braschi B, Denny P, Gray K, Jones T, Seal R, Tweedie S, et al. Genenames.org: the HGNC and VGNC resources in 2019. Nucleic Acids Res 2019;47:D786–92. https://doi.org/10.1093/nar/gky930. [23] UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 2019;47:D506–15. https://doi.org/10.1093/nar/gky1049. [24] Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The Human Genome Browser at UCSC. Genome Res 2002;12:996–1006. https://doi.org/10.1101/gr.229102. [25] Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A. OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res 2015;43:D789-798. https://doi.org/10.1093/nar/gku1205. [26] Amberger JS, Bocchini CA, Scott AF, Hamosh A. OMIM.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res 2019;47:D1038–43. https://doi.org/10.1093/nar/gky1151. [27] The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res 2019;47:D330–8. https://doi.org/10.1093/nar/gky1055. [28] Thomas PD, Hill DP, Mi H, Osumi-Sutherland D, Auken KV, Carbon S, et al. Gene Ontology Causal Activity Modeling (GO-CAM) moves beyond GO annotations to structured descriptions of biological functions and systems. Nature Genetics 2019;51:1429–33. https://doi.org/10.1038/s41588-019-0500-1. [29] Klopfenstein DV, Zhang L, Pedersen BS, Ramírez F, Warwick Vesztrocy A, Naldi A, et al. GOATOOLS: A Python library for Gene Ontology analyses. Scientific Reports 2018;8:1– 17. https://doi.org/10.1038/s41598-018-28948-z. [30] Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000;28:27–30. https://doi.org/10.1093/nar/28.1.27. [31] Jassal B, Matthews L, Viteri G, Gong C, Lorente P, Fabregat A, et al. The reactome pathway knowledgebase. Nucleic Acids Res 2020;48:D498–503. https://doi.org/10.1093/nar/gkz1031. [32] Layer RM, Pedersen BS, DiSera T, Marth GT, Gertz J, Quinlan AR. GIGGLE: a search engine for large-scale integrated genome analysis. Nat Methods 2018;15:123–6. https://doi.org/10.1038/nmeth.4556. [33] Kuksa PP, Gangadharan P, Katanic Z, Kleidermacher L, Amlie-Wolf A, Lee C-Y, et al. FILER: large-scale, harmonized FunctIonaL gEnomics Repository. BioRxiv 2021:2021.01.22.427681. https://doi.org/10.1101/2021.01.22.427681. [34] ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 2012;489:57–74. https://doi.org/10.1038/nature11247. [35] Davis CA, Hitz BC, Sloan CA, Chan ET, Davidson JM, Gabdank I, et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res 2018;46:D794–801. https://doi.org/10.1093/nar/gkx1081. [36] Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, et al. An atlas of active enhancers across human cell types and tissues. Nature 2014;507:455–61. https://doi.org/10.1038/nature12787. [37] Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, et al. Integrative analysis of 111 reference human epigenomes. Nature 2015;518:317–30. https://doi.org/10.1038/nature14248. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 [38] Fischer S, Aurrecoechea C, Brunk BP, Gao X, Harb OS, Kraemer ET, et al. The strategies WDK: a graphical search interface and web development kit for functional genomics databases. Database (Oxford) 2011;2011. https://doi.org/10.1093/database/bar027. [39] Aurrecoechea C, Barreto A, Basenko EY, Brestelli J, Brunk BP, Cade S, et al. EuPathDB: the eukaryotic pathogen genomics database resource. Nucleic Acids Res 2017;45:D581–91. https://doi.org/10.1093/nar/gkw1105. [40] Ison J, Kalas M, Jonassen I, Bolser D, Uludag M, McWilliam H, et al. EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats. Bioinformatics 2013;29:1325–32. https://doi.org/10.1093/bioinformatics/btt113. [41] Robinson JT, Thorvaldsdóttir H, Turner D, Mesirov JP. igv.js: an embeddable JavaScript implementation of the Integrative Genomics Viewer (IGV). BioRxiv 2020:2020.05.03.075499. https://doi.org/10.1101/2020.05.03.075499. [42] Bis JC, Jian X, Kunkle BW, Chen Y, Hamilton-Nelson KL, Bush WS, et al. Whole exome sequencing study identifies novel rare and common Alzheimer’s-Associated variants involved in immune response and transcriptional regulation. Mol Psychiatry 2018. https://doi.org/10.1038/s41380-018-0112-7. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 GWAS summary statistics GUS API provides transaction management and ensures data harmonization and referential integrity Variant annotations Gene annotations FILER: Functional genomics GUS Database modular, scalable and big-data optimized for quick look ups and real- time analysis ADSP meta-analysis results GenomicsDB Website scalable RESTful services and graphical front-end for interactively browsing detailed feature reports and real-time mining of datasets {JSON} Programmatic access for integration with analysis pipelines Interactively browse or mine data and annotations using popular web-browsers Link back to the NIAGADS repository to learn more about accessions and make formal data- access requests NIAGADS (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 59,600 kb 59,800 kb 60,000 kb 60,200 kb 60,400 kb 60,600 kb Ensembl Genes ADSP Single-Variant Risk Association: European (Model 2) (Bis et al. 2018) ADSP Variants (WES) IGAP: Stage 1 (Kunkle et al. 2019) IGAP APOE-Stratified Analysis: APOEε4 Non-Carriers (Jun et al. 2016) IGAP APOE-Stratified Analysis: APOEε4 Carriers (Jun et al. 2016) Roadmap Enh: NH-A Astrocytes >15 -log10p 6 9 123<1 B MS4A4E MS4A6A MS4A2 STX3 MS4A4A MS4A6E MS4A5 MS4A12 MS4A8 MS4A18 MS4A15 ZP1LINC00301 MS4A3TCN1 GIF A (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 A B (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 C Variant Span containing multiple variants 3 B 2 A 1 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 A B C (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 10_1101-2020_10_08_327718 ---- 1 Creating Clear and Informative Image-based Figures for Scientific Publications Helena Jambor1*, Alberto Antonietti2*, Bradly Alicea3, Tracy L. Audisio4, Susann Auer5, Vivek Bhardwaj6,7, Steven J. Burgess8, Iuliia Ferling9, Małgorzata Anna Gazda10,11, Luke H. Hoeppner12, Vinodh Ilangovan13, Hung Lo14,15, Mischa Olson16, Salem Yousef Mohamed17, Sarvenaz Sarabipour18, Aalok Varma19, Kaivalya Walavalkar19, Erin M. Wissink20, Tracey L. Weissgerber21 *Co-first authors 1 Mildred-Scheel Early Career Center, Medical Faculty, Technische Universität Dresden, Germany 2 Department of Electronics, Information and Bioengineering, Politecnico di Milano, Italy; Department of Brain and Behavioral Sciences, University of Pavia, Pavia, Italy 3 Orthogonal Research and Education Laboratory, Champaign, Illinois, United States 4 Evolutionary Genomics Unit, Okinawa Institute of Science and Technology, Okinawa, Japan 5 Department of Plant Physiology, Faculty of Biology, Technische Universität Dresden, Dresden, Germany 6 Max Plank Institute of Immunology and Epigenetics, Freiburg, Germany 7 Hubrecht Institute, Utrecht, the Netherlands 8 Carl R Woese Institute for Genomic Biology, University of Illinois at Urbana- Champaign, Urbana, Illinois, United States 9 Junior Research Group Evolution of Microbial Interactions, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute (HKI), Jena, Germany 10 CIBIO/InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Campus Agrário de Vairão, Universidade do Porto, 4485-661 Vairão, Portugal 11 Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Porto, Portugal 12 The Hormel Institute, University of Minnesota, Austin, MN, USA; The Masonic Cancer Center, University of Minnesota, Minneapolis, MN, United States 13 Aarhus University, Denmark 14 Neuroscience Research Center, Charité - Universitätsmedizin Berlin, Corporate member of Freie Universität Berlin, Humboldt - Universität zu Berlin, and Berlin Institute of Health, 10117 Berlin, Germany 15Einstein Center for Neurosciences Berlin, 10117 Berlin, Germany 16 Section of Plant Biology, School of Integrative Plant Science, Cornell University, Ithaca, NY, United States 17 Gastroenterology and Hepatology Unit, Internal Medicine Department, Faculty of Medicine, University of Zagazig, Egypt 18 Institute for Computational Medicine and the Department of Biomedical Engineering, Johns Hopkins University, United States 19 National Centre for Biological Sciences (NCBS), Tata Institute of Fundamental Research (TIFR), Bangalore, Karnataka, India 20 Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, United States 21 QUEST – Quality | Ethics | Open Science | Translation, Charité - Universitätsmedizin Berlin, Berlin Institute of Health (BIH), Germany .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 2 Address for correspondence: Tracey Weissgerber, tracey.weissgerber@charite.de, QUEST – Quality | Ethics | Open Science | Translation, Charité – Universitätsmedizin Berlin, Berlin Institute of Health, Berlin, Germany Abstract Scientists routinely use images to display data. Readers often examine figures first; therefore, it is important that figures are accessible to a broad audience. Many resources discuss fraudulent image manipulation and technical specifications for image acquisition; however, data on the legibility and interpretability of images are scarce. We systematically examined these factors in non-blot images published in the top 15 journals in three fields; plant sciences, cell biology and physiology (n=580 papers). Common problems included missing scale bars, misplaced or poorly marked insets, images or labels that were not accessible to colorblind readers, and insufficient explanations of colors, labels, annotations, or the species and tissue or object depicted in the image. Papers that met all good practice criteria examined for all image-based figures were uncommon (physiology 16%, cell biology 12%, plant sciences 2%). We present detailed descriptions and visual examples to help scientists avoid common pitfalls when publishing images. Our recommendations address image magnification, scale information, insets, annotation, and color and may encourage discussion about quality standards for bioimage publishing. Keywords: microscopy; imaging; images; photographs; colorblind; transparency; good bioimaging practices .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint mailto:tracey.weissgerber@charite.de https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 3 Introduction Images are often used to share scientific data, providing the visual evidence needed to turn concepts and hypotheses into observable findings. An analysis of 8 million images from more than 650,000 papers deposited in PubMed Central revealed that 22.7% of figures were “photographs”, a category that included microscope images, diagnostic images, radiology images and fluorescence images.1 Cell biology was one of the most visually intensive fields, with publications containing an average of approximately 0.8 photographs per page.1 Plant sciences papers included approximately 0.5 photographs per page.1 While there are many resources on fraudulent image manipulation and technical requirements for image acquisition and publishing,2-4 data examining the quality of reporting and ease of interpretation for image-based figures are scarce. Recent evidence suggests that important methodological details about image acquisition are often missing.5 Researchers generally receive little or no training in designing figures; yet many scientists and editors report that figures and tables are one of the first elements that they examine when reading a paper.6, 7 When scientists and journals share papers on social media, posts often include figures to attract interest. The PubMed search engine caters to scientists’ desire to see the data by presenting thumbnail images of all figures in the paper just below the abstract.8 Readers can click on each image to examine the figure, without ever accessing the paper or seeing the introduction or methods. EMBO’s Source Data tool (RRID:SCR_015018) allows scientists and publishers to share or explore figures, as well as the underlying data, in a findable and machine readable fashion.9 Image-based figures in publications are generally intended for a wide audience. This may include scientists in the same or related fields, editors, patients, educators and grants officers. General recommendations emphasize that authors should design figures for their audience rather than themselves, and that figures should be self-explanatory.7 Despite this, figures in papers outside one’s immediate area of expertise are often difficult to interpret, marking a missed opportunity to make the research accessible to a wide audience. Stringent quality standards would also make image data more reproducible. A recent study of fMRI image data, for example, revealed that incomplete documentation and presentation of brain images led to non-reproducible results.10, 11 Here, we examined the quality of reporting and accessibility of image-based figures among papers published in top journals in plant sciences, cell biology and physiology. Factors assessed include the use of scale bars, explanations of symbols and labels, clear and accurate inset markings, and transparent reporting of the object or species and tissue shown in the figure. We also examined whether images and labels were accessible to readers with the most common form of color blindness.12 Based on our results, we provide targeted recommendations about how scientists can create informative image-based figures that are accessible to a broad audience. These recommendations may also be used to establish quality standards for images deposited in emerging image data repositories. Results Using a science of science approach to investigate current practices: This study was conducted as part of a participant-guided learn-by-doing course, in which eLife .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 4 Community Ambassadors from around the world worked together to design, complete, and publish a meta-research study.13 Participants in the 2018 Ambassadors program designed the study, developed screening and abstraction protocols, and screened papers to identify eligible articles (HJ, BA, SJB, VB, LHH, VI, SS, EMW). Participants in the 2019 Ambassadors program refined the data abstraction protocol, completed data abstraction and analysis, and prepared the figures and manuscript (AA, SA, TLA, IF, MAG, HL, SYM, MO, AV, KW, HJ, TLW). To investigate current practices in image publishing, we selected three diverse fields of biology to increase generalizability. For each field, we examined papers published in April 2018 in the top 15 journals, which publish original research (Table S1, Table S2, Table S3). All full-length original research articles that contained at least one photograph, microscope image, electron microscope image, or clinical image (MRI, ultrasound, X-ray, etc.) were included in the analysis (Figure S1). Blots and computer- generated images were excluded, as some of the criteria assessed do not apply to these types of images. Two independent reviewers assessed each paper, according to the detailed data abstraction protocol (see methods and information deposited on the Open Science Framework (RRID:SCR_017419) at https://osf.io/b5296/).14 The repository also includes data, code and figures. Image analysis: First, we confirmed that images are common in the three biology subfields analyzed. More than half of the original research articles in the sample contained images (plant science: 68%, cell biology: 72%, physiology: 55%). Among the 580 papers that included images, microscope images were very common in all three fields (61 to 88%, Figure 1A). Photographs were very common in plant sciences (86%), but less widespread in cell biology (38%) and physiology (17%). Electron microscope images were less common in all three fields (11 to 19%). Clinical images, such as X- rays, MRI or ultrasound, and other types of images were rare (2 to 9%). Scale information is essential to interpret biological images. Approximately half of papers in physiology (49%) and cell biology (55%), and 28% of plant science papers provided scale bars with dimensions (in the figure or legend) for all images in the paper (Figure 1B, Table S4). Approximately one-third of papers in each field contained incomplete scale information, such as reporting magnification or presenting scale information for a subset of images. Twenty-four percent of physiology papers, 10% of cell biology papers, and 29% of plant sciences papers contained no scale information on any image. Some publications use insets to show the same image at two different scales (cell biology papers: 40%, physiology: 17%, plant sciences: 12%). In this case, the authors should indicate the position of the high-magnification inset in the low-magnification image. The majority of papers in all three fields clearly and accurately marked the location of all insets (53 to 70%, Figure 1C left panel), however one-fifth of papers appeared to have marked the location of at least one inset incorrectly (17 to 22%). Clearly visible inset markings were missing for some or all insets in 13 to 28% of papers (Figure 1C left panel). Approximately half of papers (43 to 53%, Figure 1C right panel) provided legend explanations or markings on the figure to clearly show that an inset was used, whereas this information was missing for some or all insets in the remaining papers. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://osf.io/b5296/ https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 5 Figure 1: Image types and reporting of scale information and insets A: Microscope images and photographs were common, whereas other types of images were used less frequently. B: Complete scale information was missing in more than half of the papers examined. Partial scale information indicates that scale information was presented in some figures, but not others, or that the authors reported magnification rather than including scale bars on the image. C: Problems with labeling and describing insets are common. Totals may not be exactly 100% due to rounding. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 6 Many images contain information in color. We sought to determine whether color images were accessible to readers with deuteranopia, the most common form of color blindness, by using the color blindness simulator Color Oracle (https://colororacle.org/, RRID: SCR_018400). We evaluated only images in which the authors selected the image colors (e.g. fluorescence microscopy). Papers without any colorblind accessible figures were uncommon (3 to 6%), however 45% of cell biology papers and 21-24% of physiology and plant science papers contained some images that were inaccessible to readers with deuteranopia (Figure 2A). 17 to 34% of papers contained color annotations that were not visible to someone with deuteranopia. Figure legends and, less often, titles typically provide essential information needed to interpret an image. This text provides information on the specimen and details of the image, while also explaining labels and annotations used to highlight structures or colors. 57% of physiology papers, 48% of cell biology papers and 20% of plant papers described the species and tissue or object shown completely. 5-17% of papers did not provide any such information (Figure 2B). Approximately half of the papers (47-58%, Figure 1C, right panel) also failed or partially failed to adequately explain that insets were used. Annotations of structures were explained better. Two-thirds of papers across all three fields clearly stated the meaning of all image labels, while 18 to 24% of papers provided partial explanations. Most papers (73 to 83%) completely explained the image colors by stating what substance each color represented or naming the dyes or staining technique used. Finally, we examined the number of papers that used optimal image presentation practices for all criteria assessed in the study. Twenty-eight (16%) physiology papers, 19 (12%) cell biology papers and 6 (2%) plant sciences papers met all criteria for all image- based figures in the paper (data not shown in figure). In plant sciences and physiology, the most common problems were with scale bars, insets and specifying in the legend the species and tissue or object shown. In cell biology, the most common problems were with insets, colorblind accessibility, and specifying in the legend the species and tissue or object shown. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://colororacle.org/ https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 7 Figure 2: Use of color and annotations in image-based figures A: While many authors are using colors and labels that are visible to colorblind readers, the data show that improvement is needed. B: Most papers explain colors in image-based figures, however, explanations are less common for the species and tissue or object shown, and labels and annotations. Totals may not be exactly 100% due to rounding. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 8 Designing image-based figures: How can we improve? Our results obtained by examining 580 papers from three fields provide us with unique insights into the quality of reporting and the accessibility of image-based figures. Our quantitative description of standard practices in image publication highlights opportunities to improve transparency and accessibility to readers from different backgrounds. We have therefore outlined specific actions that scientists can take when creating images, designing multipanel figures, annotating figures and preparing figure legends. Throughout the paper, we provide visual examples to illustrate each stage of the figure preparation process. Other elements are often omitted to focus readers’ attention on the step illustrated in the figure. For example, a figure that highlights best practices for displaying scale bars may not include annotations designed to explain key features of the image. When preparing image-based figures in scientific publications, readers should address all relevant steps in each figure. All steps described below (image cropping and insets, adding scale bars and annotation, choosing color channel appearances, figure panel layout) can be implemented with standard image processing software such as FIJI15 (RRID:SCR_002285) and ImageJ216 (RRID:SCR_003070), which are open source, free programs for bio-image analysis. A quick guide on how to do basic image processing for publications with FIJI is available in a recent cheat sheet publication17 and a discussion forum and wiki are available for FIJI and ImageJ (https://imagej.net/). 1. Choose a scale or magnification that fits your research question Scientists should select an image scale or magnification that allows readers to clearly see features needed to answer the research question. Figure 3A shows Drosophila melanogaster at three different microscopic scales. The first focuses on the ovary tissue and might be used to illustrate the appearance of the tissue or show stages of development. The second focuses on a group of cells. In this example, the “egg chamber” cells show different nucleic acid distributions. The third example focuses on subcellular details in one cell, for example, to show finer detail of RNA granules or organelle shape. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://imagej.net/ https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 9 Figure 3: Selecting magnification and using insets A. Magnification and display detail of images should permit readers to see features related to the main message that the image is intended to convey. This may be the organism, tissue, cell, or a subcellular level. Microscope images18 show D. melanogaster ovary (A1), ovarian egg chamber cells (A2), and a detail in egg chamber cell nuclei (A3). B. Insets or zoomed-in areas are useful when two different scales are needed to allow readers to see essential features. It is critical to indicate the origin of the inset in the full- scale image. Poor and clear examples are shown. Example images were created based on problems observed by reviewers. Images show B1, B2, B3, B5: Protostelium aurantium amoeba fed on germlings of Aspergillus .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 10 fumigatus D141-GFP (green) fungal hyphae, dead fungal material stained with propidium iodide (red), and acidic compartments of amoeba marked with LysoTracker Blue DND-22 dye (blue); B4: Lendrum-stained human lung tissue (Haraszti, Public Health Image Library); B6: fossilized Orobates pabsti.19 When both low and high magnifications are necessary for one image, insets are used to show a small portion of the image at higher magnification (Figure 3B). The inset location must be accurately marked in the low magnification image. We observed that the inset position in the low magnification image was missing, unclear, or incorrectly placed in approximately one third of papers. Inset positions should be clearly marked by lines or regions-of-interest in a high-contrast color, usually black or white. Insets may also be explained in the figure legend. Care must be taken when preparing figures outside vector graphics suits, as insert positions may move during file saving or export. 2. Include a clearly labeled scale bar Scale information allows audiences to quickly understand the size of features shown in images. This is especially important for microscopic images where we have no intuitive understanding of scale. Scale information for photographs should be considered when capturing images as rulers are often placed into the frame. Our analysis revealed that 10-29% of papers screened failed to provide any scale information and another third only provided incomplete scale information (Figure 1B). Scientists should consider the following points when displaying scale bars: • Every image type needs a scale bar: Authors usually add scale bars to microscope images, but often leave them out in photos and clinical images, possibly because these depict familiar objects such a human or plant. Missing scale bars, however, adversely affect reproducibility. A size difference of 20% in between a published study and the reader’s lab animals, for example, could impact study results by leading to an important difference in phenotype. Providing scale bars allows scientists to detect such discrepancies and may affect their interpretation of published work. Scale bars may not be a standard feature of image acquisition and processing software for clinical images. Authors may need to contact device manufacturers to determine the image size and add height and width labels. • Scale bars and labels should be clearly visible: Short scale bars, thin scale bars and scale bars in colors that are similar to the image color can easily be overlooked (Figure 4). In multicolor images, it can be difficult to find a color that makes the scale bar stand out. Authors can solve this problem by placing the scale bar outside the image or onto a box with a more suitable background color. • Annotate scale bar dimensions on the image: Stating the dimensions along with the scale bar allows readers to interpret the image more quickly. Despite this, dimensions were typically stated in the legend instead (Figure 1B), possibly a legacy of printing processes that discouraged text in images. Dimensions should be in high resolution and large enough to be legible. In our set, we came across small and/or low-resolution annotations that were illegible in electronic versions of the paper, even after zooming in. Scale bars that are visible on larger figures produced by authors may be difficult to read when the size of the figure is .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 11 reduced to fit onto a journal page. Authors should carefully check page proofs to ensure that scale bars and dimensions are clearly visible. Figure 4: Using scale bars to annotate image size Scale bars provide essential information about the size of objects, which orients readers and helps them to bridge the gap between the image and reality. Scales may be indicated by a known size indicator such as a human next to a tree, a coin next to a rock, or a tape measure next to a smaller structure. In microscope images, a bar of known length is included. Example images were created based on problems observed by reviewers. Poor scale bar examples (1-6 bottom), clear scale bar examples (7-12). Images 1, 4, 7: Microscope images of D. melanogaster nurse cell nuclei;18 2. Microscope image of Dictyostelium discoideum (see Figure 7); 3, 5, 8, 10:. Electron microscope image of mouse pancreatic beta-islet cells (Andreas Müller); 6, 11: Microscope image of Lendrum-stained human lung tissue (Haraszti, Public Health Image Library); 9. Photo of Arabidopsis thaliana; 12: Photograph of fossilized Orobates pabsti.19 3. Use color wisely in images Colors in images are used to display the natural appearance of an object, or to visualize features with dyes and stains. In the scientific context, adapting colors is possible and may enhance readers’ understanding, while poor color schemes may distract or mislead. Images showing the natural appearance of a subject, specimen or staining technique (e.g. images showing plant size and appearance, or histopathology images of fat tissue from mice on different diets) are generally presented in color (Figure 5). Images showing electron microscope images are captured in black and white (“grayscale”) by default and may be kept in grayscale to leverage the good contrast resulting from a full luminescence spectrum. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 12 Figure 5: Image types and their accessibility in colorblind render and grayscale mode Shown are examples of the types of images that one might find in manuscripts in the biological or biomedical sciences: photograph, fluorescent microscope images with 1-3 color hues/Look-up-tables (LUT), electron microscope images. The relative visibility is assessed in a colorblind rendering for deuteranopia, and in grayscale. Grayscale images offer the most contrast (1-color microscope image) but cannot show several structures in parallel (multicolor images, color photographs). Color combinations that are not colorblind accessible were used in rows 3 and 4 to illustrate the importance of colorblind simulation tests. Scale bars are not included in this figure, as they could not be added in a non-distracting way that would not detract from the overall message of the figure. Images show: Row 1: Darth Vader being attacked, Row 2: D. melanogaster salivary glands,18 Row 3: D. melanogaster egg chambers,18 Row 4: D. melanogaster nurse cell nuclei,18 and Row 5: mouse pancreatic beta-islet cells. In some instances, scientists can choose whether to show grayscale or color images. Assigning colors may be optional, even though it is the default setting in imaging programs. When showing only one color channel, scientists may consider presenting this channel in grayscale to optimally display fine details. This may include variations in staining intensity or fine structures. When opting for color, authors should use grayscale visibility tests (Figure 6) to determine whether visibility is compromised. This can occur when dark colors, such as magenta, red, or blue, are shown on a black background. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 13 Figure 6: Visibility of colors/hues differs and depends on the background color The best contrast is achieved with grayscale images or dark hues on a light background (first row). Dark color hues, such as red and blue, on a dark background (last row) are least visible. Visibility can be tested with mock grayscale. Images show actin filaments in Dictyostelium discoideum (LifeAct-GFP). All images have the same scale. Abbreviations: GFP, green fluorescent protein. 4. Choose a colorblind accessible color palette: Fluorescent images with merged color channels visualize the co-localization of different markers. While many readers find these images to be visually appealing and informative, these images are often inaccessible to color blind co-authors, reviewers, editors, and readers. Deuteranopia, the most common form of colorblindness, affects up to 8% of men and 0.5% of women of northern European ancestry.12 A study of articles published in top peripheral vascular disease journals revealed that 85% of papers with color maps and 58% of papers with heat maps used color palettes that were not colorblind safe.20 We show that approximately half of cell biology papers, and one third of physiology papers and plant science papers contained images that were inaccessible to readers with deuteranopia. Scientists should consider the following points to ensure that images are accessible to colorblind readers. • Select colorblind safe colors: Researchers should use colorblind safe color palettes for fluorescence and other images where color may be adjusted. Figure 7 illustrates how four different color combinations would look to viewers with different types of color blindness. Green and red are indistinguishable to readers with deuteranopia, whereas green and blue are indistinguishable to readers with tritanopia, a rare form of color blindness. Cyan and magenta are the best options, as these two colors look different to viewers with normal color vision, deuteranopia or tritanopia. Green and magenta are also shown, as scientists .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 14 often prefer to show colors close to the excitation value of the fluorescent dyes, which are often green and red. • Display separate channels in addition to the merged image: Selecting a colorblind safe color palette becomes increasingly difficult as more colors are added. When the image includes three or more colors, authors are encouraged to show separate images for each channel, followed by the merged image (Figure 8). Individual channels may be shown in grayscale to make it easier for readers to perceive variations in staining intensity. • Use simulation tools to confirm that essential features are visible to colorblind viewers: Free tools, such as Color Oracle (RRID:SCR_018400), quickly simulate different forms of color blindness by adjusting the colors on the computer screen to simulate what a colorblind person would see. Scientists using FIJI (RRID:SCR002285) can select the “Simulate colorblindness” option in the “Color” menu under “Images”. Figure 7: Color combinations as seen with normal vision and two types of colorblindness The figure illustrates how four possible color combinations for multichannel microscope images would appear to someone with normal color vision, the most common form of colorblindness (deuteranopia), and a rare form of color blindness (tritanopia). Some combinations that are accessible to someone with deuteranopia are not accessible to readers with tritanopia, for example green/blue combinations. Microscope images show Dictyostelium discoideum expressing Vps32-GFP (Vps32- green fluorescent protein shows broad signal in cells) and stained with dextran (spotted signal) after infection with conidia of Aspergillus fumigatus. All images have the same scale. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 15 Abbreviations: GFP, green fluorescent protein. Figure 8: Strategies for making 2- or 3-channel microscope images colorblind safe Images in the first row are not colorblind safe. Readers with the most common form of colorblindness would not be able to identify key features. Possible accessible solutions are shown: changing colors/LUTs to colorblind friendly combinations, showing each channel in a separate image, showing colors in grayscale and inverting grayscale images to maximize contrast. Solutions 3 and 4 (show each channel in grayscale, or in inverted grayscale) are more informative than solutions 1 and 2. Regions of overlap are sometimes difficult to see in merged images without split channels. When splitting channels, scientists often use colors that have low contrast, as explained in Figure 6 (e.g. red or blue on black). Microscope images show D. melanogaster egg chambers (2 colors) and nurse cell nuclei (3 colors).18 All images of egg chambers and nurse cells respectively have the same scale. Abbreviations: LUT, look-up table. 5. Design the figure Figures often contain more than one panel. Careful planning is needed to convey a clear message, while ensuring that all panels fit together and follow a logical order. A planning table (Figure 9A) helps scientists to determine what information is needed to answer the .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 16 research question. The table outlines the objectives, types of visualizations required, and experimental groups that should appear in each panel. A planning table template is available on OSF.14 After completing the planning table, scientists should sketch out the position of panels, and the position of images, graphs, and titles within each panel (Figure 9B). Audiences read a page either from top to bottom and/or from left to right. Selecting one reading direction and arranging panels in rows or columns helps with figure planning. Using enough white space to separate rows or columns will visually guide the reader through the figure. The authors can then assemble the figure based on the draft sketch. Figure 9: Planning multipanel figures Planning tables and layout sketches are useful tools to efficiently design figures that address the research question. A. Planning tables allow scientists to select and organize elements needed to answer the research question addressed by the figure. B. Layout sketches allow scientists to design a logical layout for all panels listed in the planning table and ensure that there is adequate space for all images and graphs. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 17 6. Annotate the figure Annotations with text, symbols or lines allow readers from many different backgrounds to rapidly see essential features, interpret images, and gain insight. Unfortunately, scientists often design figures for themselves, rather than their audience.7 Examples of annotations are shown in Figure 10. Table 1 describes important factors to consider for each annotation type. Figure 10: Using arrows, regions of interest, lines and letter codes to annotate structures in images Text descriptions alone are often insufficient to clearly point to a structure or region in an image. Arrows and arrowheads, lines, letters, and dashed enclosures can help if overlaid on the respective part of the image. Microscope images show D. melanogaster egg chambers,18 with the different labelling techniques in use. The table provides an overview of their applicability and common pitfalls. All images have the same scale. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 18 Table 1: Use annotations to make figures accessible to a broad audience Feature to be Explained Annotation Size Scale bar with dimensions Direction of movement Arrow with tail Draw attention to: • Points of interest Symbol (arrowhead, star, etc.) • Regions of interest: Black & white image Highlight in color if this does not obscure important features within the region OR Outline with boxes or circles • Regions of interest: Color image Outline with boxes or circles • Layers Labeled brackets beside the image for layers that are visually identifiable across the entire image OR A line on the image for wavy layers that may be difficult to identify Define features within an image Labels When adding annotations to an image, scientists should consider the following steps. • Choose the right amount of labeling. Figure 11 shows three levels of annotation. The barely annotated image (11A) is only accessible to scientists already familiar with the object and technique, whereas the heavily annotated version (11C) contains numerous annotations that obstruct the image and a legend that is time consuming to interpret. Panel 11B is more readable; annotations of a few key features are shown, and the explanations appear right below the image for easy interpretation. Explanations of labels are often placed in the figure legend. Alternating between examining the figure and legend is time consuming, especially when the legend and figure are on different pages. Figure 11D shows one option for situations where extensive annotations are required to explain a complex image. An annotated image is placed as a legend next to the original image. A semi-transparent white layer mutes the image to allow annotations to stand out. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 19 Figure 11: Different levels of detail for image annotations Annotations help to orient the audience but may also obstruct parts of the image. Authors must find the right balance between too few and too many annotations. 1. Example with no annotations. Readers cannot determine what is shown. 2. Example with a few annotations to orient readers to key structures. 3. Example with many annotations, which obstruct parts of the image. The long legend below the figure is confusing. 4. Example shows a solution for situations where many annotations are needed to explain the image. An annotated version is placed next to an unannotated version of the image for comparison. The legend below the image helps readers to interpret the image, without having to refer to the figure legend. Note the different requirements for space. Electron microscope images show mouse pancreatic beta- islet cells. • Use abbreviations cautiously: Abbreviations are commonly used for image and figure annotation to save space, but inevitably require more effort from the reader. Abbreviations are often ambiguous, especially across fields. Authors should run a web search for the abbreviation.21 If the intended meaning is not a top result, authors should refrain from using the abbreviation or clearly define the abbreviation on the figure itself, even if it is already defined elsewhere in the manuscript. Note that in Figure 11, abbreviations have been written out below the image to reduce the number of legend entries. • Explain colors and stains: Explanations of colors and stains were missing in around 20% of papers. Figure 12 illustrates several problematic practices observed in our dataset, as well as solutions for clearly explaining what each color represents. This figure uses fluorescence images as an example; however we also observed many histology images in which authors did not mention which stain was used. Authors should describe how stains affect the tissue shown or .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 20 use annotations to show staining patterns of specific structures. This allows readers who are unfamiliar with the stain to interpret the image. Figure 12: Explain color in images Cells and their structures are almost all transparent. Every dye, stain, and fluorescent label therefore should be clearly explained to the audience. Labels should be colorblind safe. Large labels that stand out against the background are easy to read. Authors can make figures easier to interpret by placing the color label close to the structure; color labels should only be placed in the figure legend when this is not possible. Example images were created based on problems observed by reviewers. Microscope images show D. melanogaster egg chambers stained with the DNA dye DAPI (4′,6-diamidino-2-phenylindole) and probe for a specific mRNA species.18All images have the same scale. • Ensure that annotations are accessible to colorblind readers: Confirming that labels or annotations are visible to colorblind readers is important for both color and grayscale images (Figure 13). Up to one third of papers in our dataset contained annotations or labels that would not have been visible to someone with deuteranopia. This occurred because the annotations blended in with the background (e.g. red arrows on green plants) or the authors use the same symbol in colors that are indistinguishable to someone with deuteranopia to mark different features. Figure 13 illustrates how to annotate a grayscale image so that it is accessible to color blind readers. Using text to describe colors is also problematic for colorblind readers. This problem can be alleviated by using colored symbols in the legend or by using distinctly shaped annotations such as open vs. closed arrows, thin vs. wide lines, or dashed vs. solid lines. Color blindness simulators help in determining whether annotations are accessible to all readers. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 21 Figure 13: Annotations should be colorblind safe 1. The annotations displayed in the first image are inaccessible to colorblind individuals, as shown with the visibility test below. This example was created based on problems observed by reviewers. 2-3. Two colorblind safe alternative annotations, in color (2) and in grayscale (3). The bottom row shows a test rendering for deuteranopia colorblindness. Note that double-encoding of different hues and different shapes (e.g. different letters, arrow shapes, or dashed/non-dashed lines) allows all audiences to interpret the annotations. Electron microscope images show mouse pancreatic beta-cell islet cells. All images have the same scale. 7. Prepare figure legends Each figure and legend are meant to be self-explanatory and should allow readers to quickly assess a paper or understand complex studies that combine different methodologies or model systems. To date, there are no guidelines for figure legends for images, as the scope and length of legends varies across journals and disciplines. Some journals require legends to include details on object, size, methodology or sample size, while other journals require a minimalist approach and mandate that information should not be repeated in subsequent figure legends. Our data suggest that important information needed to interpret images was regularly missing from the figure or figure legend. This includes the species and tissue type, or object shown in the figure, clear explanations of all labels, annotations and colors, and markings or legend entries denoting insets. Presenting this information on the figure itself is more efficient for the reader, however any details that are not marked in the figure should be explained in the legend. While not reporting species and tissue information in every figure legend may be less of an issue for papers that examine a single species and tissue, this is a major problem when a study includes many species and tissues, which may be presented in different .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 22 panels of the same figure. Additionally, the scientific community is increasingly developing automated data mining tools, such as the Source Data tool, to collect and synthesize information from figures and other parts of scientific papers. Unlike humans, these tools cannot piece together information scattered throughout the paper to determine what might be shown in a particular figure panel. Even for human readers, this process wastes time. Therefore, we recommend that authors present information in a clear and accessible manner, even if some information may be repeated for studies with simple designs. Discussion A flood of images is published every day in scientific journals and the number is continuously increasing. Of these, around 4% likely contain intentionally or accidentally duplicated images.3 Our data show that, in addition, most papers show images that are not fully interpretable due to issues with scale markings, annotation, and/or color. This affects scientists’ ability to interpret, critique and build upon the work of others. Images are also increasingly submitted to image archives to make image data widely accessible and permit future re-analyses. A substantial fraction of images that are neither human nor machine-readable lowers the potential impact of such archives. Based on our data examining common problems with published images, we provide a few simple recommendations, with examples illustrating good practices. We hope that these recommendations will help authors to make their published images legible and interpretable. Limitations: While most results were consistent across the three subfields of biology, findings may not be generalizable to other fields. Our sample included the top 15 journals that publish original research for each field. Almost all journals were indexed in PubMed. Results may not be generalizable to journals that are un-indexed, have low impact factors, or are not published in English. Data abstraction was performed manually due to the complexity of the assessments. Error rates were 5% for plant sciences, 4% for physiology and 3% for cell biology. Our assessments focused on factors that affect readability of image-based figures in scientific publications. Future studies may include assessments of raw images and meta-data to examine factors that affect reproducibility, such as contrast settings, background filtering and processing history. Actions journals can take to make image-based figures more transparent and easier to interpret The role of journals in improving the quality of reporting and accessibility of image-based figures should not be overlooked. There are several actions that journals might consider. • Screen manuscripts for figures that are not colorblind safe: Open source automated screening tools22 may help journals to efficiently identify common color maps that are not colorblind safe. • Update journal policies: We encourage journal editors to update policies regarding colorblind accessibility, scale bars, and other factors outlined in this manuscript. Importantly, policy changes should be accompanied by clear plans for implementation and enforcement. Meta-research suggests that changing journal policy, without enforcement or implementation plans, has limited effects on author behavior. Amending journal policies to require authors to report RRIDs, .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 23 for example, increases the number of papers reporting RRIDs by 1%.23 In a study of life sciences articles published in Nature journals, the percentage of animal studies reporting the Landis 4 criteria (blinding, randomization, sample size calculation, exclusions) increased from 0 to 16.4% after new guidelines were released.24 In contrast, a randomized controlled trial of animal studies submitted to PLoS One demonstrated that randomizing authors to complete the ARRIVE checklist during submission did not improve reporting.25 Some improvements in reporting of confidence intervals, sample size justification and inclusion and exclusion criteria were noted after Psychological Science introduced new policies,26 although this may have been partially due to widespread changes in the field. A joint editorial series published in the Journal of Physiology and British Journal of Pharmacology did not improve the quality of data presentation or statistical reporting.27 • Re-evaluate limits on the number of figures: Limitations on the number of figures originally stemmed from printing costs calculations, which are becoming increasingly irrelevant as scientific publishing moves online. Unintended consequences of these policies include the advent of large, multipanel figures. These figures are often especially difficult to interpret because the legend appears on a different page, or the figure combines images addressing different research questions. • Reduce or eliminate page charges for color figures: As journals move online, policies designed to offset the increased cost of color printing are no longer needed. The added costs may incentivize authors to use grayscale in cases where color would be beneficial. • Encourage authors to explain labels or annotations in the figure, rather than in the legend: This is more efficient for readers. • Encourage authors to share image data in public repositories: Open data benefits authors and the scientific community.28-30 How can the scientific community improve image-based figures? The role of scientists in the community is multi-faceted. As authors, scientists should familiarize themselves with guidelines and recommendations, such as ours provided above. As reviewers, scientists should ask authors to improve erroneous or uninformative image-based figures. As instructors, scientists should ensure that bioimaging and image data handling is taught during undergraduate or graduate courses, and support existing initiatives such as NEUBIAS31 (Network of European Bioimage Analysts) that aim to increase training opportunities in bioimage analysis. Scientists are also innovators. As such they should support emerging image data archives, which may expand to automatically source images from published figures. Repositories for other types of data are already widespread, however the idea of image repositories has only recently gained traction.32 Existing image databases, which are mainly used for raw image data and meta-data, include the Allen Brain Atlas, the Image Data Resource33 and the emerging BioImage Archives.32 Springer Nature encourages authors to submit imaging data to the Image Data Resource.33 While scientists have called for common quality standards for archived images and meta-data,32 such standards have not been defined, implemented, or taught. Examining standard practices for reporting images in scientific publications, as outlined here, is one strategy for establishing common quality standards. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 24 In the future, it is possible that each image published electronically in a journal or submitted to an image data repository will follow good practice guidelines, and will be accompanied by expanded “meta-data” or “alt-text/attribute” files. Alt-text is already published in html to provide context if an image cannot be accessed (e.g. by blind readers). Similarly, images in online articles and deposited in archives could contain essential information in a standardized format. The information could include the main objective of the figure, specimen information, ideally with research resource identifier34 (RRID), specimen manipulation (dissection, staining, RRID for dyes and antibodies used), as well as the imaging method including essential items from meta-files of the microscope software, information about image processing and adjustments, information about scale, annotations, insets, and colors shown, and confirmation that the images are truly representative. Conclusions Our meta-research study of standard practices for presenting images in three fields highlights current shortcomings in publications. Pubmed indexes approximately 800,000 new papers per year, or 2,200 papers per day (https://www.nlm.nih.gov/bsd/index_stats_comp.html). Twenty-three percent,1 or approximately 500 papers per day, contain images. Our survey data suggest that most of these papers will have deficiencies in image presentation, which may affect legibility and interpretability. These observations lead to targeted recommendations for improving the quality of published images. Our recommendations are available as a slide set via the Open Science Framework and can be used in teaching best practice and avoid misleading or uninformative image-based figures. Our analysis underscores the need for standardized image publishing guidelines. Adherence to such guidelines will allow the scientific community to unlock the full potential of image collections in the life sciences for current and future generations of researchers. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://www.nlm.nih.gov/bsd/index_stats_comp.html https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 25 Methods Systematic review: We examined original research articles that were published in April of 2018 in the top 15 journals that publish original research for each of three different categories (physiology, plant science, cell biology). Journals for each category were ranked according to 2016 impact factors listed for the specified categories in Journal Citation Reports. Journals that only publish review articles or that did not publish an April issue were excluded. We followed all relevant aspects of the PRISMA guidelines.35 Items that only apply to meta-analyses or are not relevant to literature surveys were not followed. Ethical approval was not required. Search strategy: Articles were identified through a PubMed search, as all journals were PubMed indexed. Electronic search results were verified by comparison with the list of articles published in April issues on the journal website. The electronic search used the following terms: Physiology: ("Journal of pineal research"[Journal] AND 3[Issue] AND 64[Volume]) OR ("Acta physiologica (Oxford, England)"[Journal] AND 222[Volume] AND 4[Issue]) OR ("The Journal of physiology"[Journal] AND 596[Volume] AND (7[Issue] OR 8[Issue])) OR (("American journal of physiology. Lung cellular and molecular physiology"[Journal] OR "American journal of physiology. Endocrinology and metabolism"[Journal] OR "American journal of physiology. Renal physiology"[Journal] OR "American journal of physiology. Cell physiology"[Journal] OR "American journal of physiology. Gastrointestinal and liver physiology"[Journal]) AND 314[Volume] AND 4[Issue]) OR (“American journal of physiology. Heart and circulatory physiology”[Journal] AND 314[Volume] AND 4[Issue]) OR ("The Journal of general physiology"[Journal] AND 150[Volume] AND 4[Issue]) OR ("Journal of cellular physiology"[Journal] AND 233[Volume] AND 4[Issue]) OR ("Journal of biological rhythms"[Journal] AND 33[Volume] AND 2[Issue]) OR ("Journal of applied physiology (Bethesda, Md. : 1985)"[Journal] AND 124[Volume] AND 4[Issue]) OR ("Frontiers in physiology"[Journal] AND ("2018/04/01"[Date - Publication] : "2018/04/30"[Date - Publication])) OR ("The international journal of behavioral nutrition and physical activity"[Journal] AND ("2018/04/01"[Date - Publication] : "2018/04/30"[Date - Publication])) Plant science: ("Nature plants"[Journal] AND 4[Issue] AND 4[Volume]) OR ("Molecular plant"[Journal] AND 4[Issue] AND 11[Volume]) OR ("The Plant cell"[Journal] AND 4[Issue] AND 30[Volume]) OR ("Plant biotechnology journal"[Journal] AND 4[Issue] AND 16[Volume]) OR ("The New phytologist"[Journal] AND (1[Issue] OR 2[Issue]) AND 218[Volume]) OR ("Plant physiology"[Journal] AND 4[Issue] AND 176[Volume]) OR ("Plant, cell & environment"[Journal] AND 4[Issue] AND 41[Volume]) OR ("The Plant journal : for cell and molecular biology"[Journal] AND (1[Issue] OR 2[Issue]) AND 94[Volume]) OR ("Journal of experimental botany"[Journal] AND (8[Issue] OR 9[Issue] OR 10[Issue]) AND 69[Volume]) OR ("Plant & cell physiology"[Journal] AND 4[Issue] AND 59[Volume]) OR ("Molecular plant pathology"[Journal] AND 4[Issue] AND 19[Volume]) OR ("Environmental and experimental botany"[Journal] AND 148[Volume]) OR ("Molecular plant-microbe interactions : MPMI"[Journal] AND 4[Issue] AND 31[Volume]) OR (“Frontiers in plant science”[Journal] AND ("2018/04/01"[Date - Publication] : "2018/04/30"[Date - Publication])) OR (“The Journal of ecology” ("2018/04/01"[Date - Publication] : "2018/04/30"[Date - Publication])) .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 26 Cell biology: ("Cell"[Journal] AND (2[Issue] OR 3[Issue]) AND 173[Volume]) OR ("Nature medicine"[Journal] AND 24[Volume] AND 4[Issue]) OR ("Cancer cell"[Journal] AND 33[Volume] AND 4[Issue]) OR ("Cell stem cell"[Journal] AND 22[Volume] AND 4[Issue]) OR ("Nature cell biology"[Journal] AND 20[Volume] AND 4[Issue]) OR ("Cell metabolism"[Journal] AND 27[Volume] AND 4[Issue]) OR ("Science translational medicine"[Journal] AND 10[Volume] AND (435[Issue] OR 436[Issue] OR 437[Issue] OR 438[Issue])) OR ("Cell research"[Journal] AND 28[Volume] AND 4[Issue]) OR ("Molecular cell"[Journal] AND 70[Volume] AND (1[Issue] OR 2[Issue])) OR("Nature structural & molecular biology"[Journal] AND 25[Volume] AND 4[Issue]) OR ("The EMBO journal"[Journal] AND 37[Volume] AND (7[Issue] OR 8[Issue])) OR ("Genes & development"[Journal] AND 32[Volume] AND 7-8[Issue]) OR ("Developmental cell"[Journal] AND 45[Volume] AND (1[Issue] OR 2[Issue])) OR ("Current biology : CB"[Journal] AND 28[Volume] AND (7[Issue] OR 8[Issue])) OR ("Plant cell"[Journal] AND 30[Volume] AND 4[Issue]) Screening: Screening for each article was performed by two independent reviewers (Physiology: TLW, SS, EMW, VI, KW, MO; Plant science: TLW, SJB; Cell biology: EW, SS) using Rayyan software (RRID:SCR_017584), and disagreements were resolved by consensus. A list of articles is uploaded into Rayyan. Reviewers independently examined each article and marked whether the article was included or excluded, along with the reason for exclusion. Both reviewers screened all articles published in each journal between April 1 and April 30, 2018 to identify full length, original research articles (Table S1, Table S2, Table S3, Figure S1) published in the print issue of the journal. Articles for online journals that do not publish print issues were included if the publication date was between April 1 and April 30, 2018. Articles were excluded if they were not original research articles, or if an accepted version of the paper was posted as an “in press” or “early release” publication; however, the final version did not appear in the print version of the April issue. Articles were included if they contained at least one eligible image, such as a photograph, an image created using a microscope or electron microscope, or an image created using a clinical imaging technology such as ultrasound or MRI. Blot images were excluded, as many of the criteria in our abstraction protocol cannot easily be applied to blots. Computer generated images, graphs and data figures were also excluded. Papers that did not contain any eligible images were excluded. Abstraction: All abstractors completed a training set of 25 articles before abstracting data. Data abstraction for each article was performed by two independent reviewers (Physiology: AA, AV; Plant science: MO, TLA, SA, KW, MAG, IF; Cell biology: IF, AA, AV, KW, MAG). When disagreements could not be resolved by consensus between the two reviewers, ratings were assigned after a group review of the paper. Eligible manuscripts were reviewed in detail to evaluate the following questions according to a predefined protocol (available at: https://osf.io/b5296/).14 Supplemental files were not examined, as supplemental images may not be held to the same peer review standards as those in the manuscript. The following items were abstracted: 1. Types of images included in the paper (photograph, microscope image, electron microscope image, image created using a clinical imaging technique such as ultrasound or MRI, other types of images) 2. Did the paper contain appropriately labeled scale bars for all images? 3. Were all insets clearly and accurately marked? 4. Were all insets clearly explained in the legend? .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://osf.io/b5296/ https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 27 5. Is the species and tissue, object, or cell line name clearly specified in the figure or legend for all images in the paper? 6. Are any annotations, arrows or labels clearly explained for all images in the paper? 7. Among images where authors can control the colors shown (e.g. fluorescence microscopy), are key features of the images visible to someone with the most common form of colorblindness (deuteranopia)? 8. If the paper contains colored labels, are these labels visible to someone with the most common form of color blindness (deuteranopia)? 9. Are colors in images explained either on the image or within the legend? Questions 7 and 8 were assessed by using Color Oracle36 (RRID:SCR_018400) to simulate the effects of deuteranopia. Verification: Ten percent of articles in each field were randomly selected for verification abstraction, to ensure that abstractors in different fields were following similar procedures. Data were abstracted by a single abstractor (TLW). The question on species and tissue was excluded from verification abstraction for articles in cell biology and plant sciences, as the verification abstractor lacked the field-specific expertise needed to assess this question. Results from the verification abstractor were compared with consensus results from the two independent abstractors for each paper and discrepancies were resolved through discussion. Error rates were calculated as the percentage of responses for which the abstractors’ response was incorrect. Error rates were 5% for plant sciences, 4% for physiology and 3% for cell biology. Data processing and creation of figures: Data are presented as n (%). Summary statistics were calculated using Python (RRID:SCR_008394, version 3.6.9, libraries NumPy 1.18.5 and Matplotlib 3.2.2). Charts were prepared with a Python-based Jupyter Notebook (Jupyter-client, RRID:SCR_018413,37 Python version 3.6.9, RRID:SCR_008394, libraries NumPy 1.18.538 and Matplotlib 3.2.239) and assembled into figures with vector graphic software. Example images were previously published or generously donated by the manuscript authors as indicated in the figure legends. Image acquisition was described in references (D.melagenoster images18, mouse pancreatic beta islet cells: A. Müller personal communication, and Orobates Pabsti19). Images were cropped, labeled, and color-adjusted with FIJI15 (RRID:SCR_002285) and assembled with vector-graphic software. Color-blind and grayscale rendering of images was done using Color Oracle36 (RRID:SCR_018400). All poor and clear images presented here are ‘mock examples’ prepared based on practices observed during data abstraction. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 28 Funding TLW was funded by American Heart Association grant 16GRNT30950002 and a Robert W. Fulk Career Development Award (Mayo Clinic Division of Nephrology & Hypertension). LHH was supported by The Hormel Foundation and National Institutes of Health grant CA187035. Acknowledgements We thank the eLife Community Ambassadors program for facilitating this work, and Andreas Müller and John A. Nyakatura for generously sharing example images. Falk Hillmann and Thierry Soldati provided the amoeba strains used for imaging. Some of the early career researchers who participated in this research would like to thank their principal investigators and mentors for supporting their efforts to improve science. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 29 Supplemental Tables Table S1: Number of articles examined by journal in physiology Journal Articles Screened (n = 431) Original Research Articles (n = 312, 72%) Included Articles (n = 172, 40%) Journal of Pineal Research 7 6 (86%) 5 (71%) Acta Physiologica 21 10 (48%) 5 (24%) Journal of Physiology 39 22 (56%) 12 (31%) International Journal of Behavioral Nutrition and Physical Activity 9 9 (100%) 0 AJP: Lung, Cellular and Molecular Physiology 15 12 (80%) 6 (40%) Journal of General Physiology 10 4 (40%) 3 (30%) AJP: Endocrinology and Metabolism 9 8 (89%) 6 (67%) Frontiers in Physiology 142 107 (75%) 47 (33%) Journal of Cellular Physiology 88 55 (63%) 47 (53%) AJP: Renal Physiology 15 15 (100%) 10 (67%) AJP: Cell Physiology 11 11 (100%) 9 (82%) Journal of Biological Rhythms 9 8 (89%) 2 (22%) AJP: Gastrointestinal and Liver Physiology 6 6 (100%) 5 (83%) Journal of Applied Physiology 31 31 (100%) 10 (32%) AJP: Heart and Circulatory Physiology 19 8 (42%) 5 (26%) Values are n, or n (% of all articles). Screening was performed to exclude articles that were not full-length original research articles (e.g. reviews, editorials, perspectives, commentaries, letters to the editor, short communications, etc.), were not published in April 2018, or did not include eligible images. Abbreviations: AJP, American Journal of Physiology .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 30 Table S2: Number of articles examined by journal in plant science Journal Articles Screened (n = 502) Original Research Articles (n = 377, 75%) Included Articles (n = 257, 51%) Nature Plants 13 3 (23%) 0 Molecular Plant 14 7 (50%) 6 (43%) Plant Cell * 15 9 (60%) 8 (53%) Plant Biotechnology Journal 12 10 (83%) 6 (50%) New Phytologist 73 53 (73%) 31 (42%) Plant Physiology 39 34 (87%) 27 (69%) Plant Cell and Environment 14 11 (79%) 7 (50%) Plant Journal 31 24 (77%) 19 (61%) Journal of Experimental Botany 74 55 (74%) 41 (55%) Journal of Ecology ** 0 Plant and Cell Physiology 21 13 (62%) 9 (43%) Molecular Plant Pathology 21 16 (76%) 15 (71%) Environmental and Experimental Botany 17 17 (100%) 12 (71%) Molecular Plant – Microbiome Interactions 8 7 (88%) 4 (50%) Frontiers in Plant Science 150 118 (79%) 72 (48%) * This journal was also included on the cell biology list (Table S3). ** No articles from the Journal of Ecology were screened as the journal did not publish an April 2018 issue. Values are n, or n (% of all articles). Screening was performed to exclude articles that were not full-length original research articles (e.g. reviews, editorials, perspectives, commentaries, letters to the editor, short communications, etc.), were not published in April 2018, or did not include eligible images. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 31 Table S3: Number of articles examined by journal in cell biology Journal Articles Screened (n = 409) Original Research Articles (n = 222, 54%) Included Articles (n = 159, 39%) Cell 50 33 (76%) 19 (38%) Nature medicine 32 10 (31%) 6 (19%) Cancer Cell 21 12 (57%) 5 (24%) Cell Stem Cell 18 7 (39%) 5 (28%) Nature Cell Biology 20 9 (45%) 9 (45%) Cell Metabolism 20 9 (45%) 8 (40%) Science Translational Medicine 25 18 (72%) 17 (68%) Cell Research 13 6 (46%) 5 (38%) Molecular Cell 38 26 (68%) 13 (34%) Nature Structural and Molecular Biology 12 7 (58%) 2 (17%) EMBO Journal 23 17 (74%) 16 (70%) Genes and Development 13 8 (62%) 5 (38%) Developmental Cell 22 15 (68%) 15 (68%) Current Biology 87 36 (41%) 26 (30%) Plant Cell * 15 9 (60%) 8 (53%) * This journal was also included on the plant science list (Table S2). Values are n, or n (% of all articles). Screening was performed to exclude articles that were not full length original research articles (e.g. reviews, editorials, perspectives, commentaries, letters to the editor, short communications, etc.), were not published in April 2018, or did not include eligible images. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 32 Table S4: Scale information in papers Field No scale information in any figure Some scale information Complete scale information Some figures, magnification in legend All figures, magnification in legend Some figures, scale bar with dimensions in legend Some figures, scale bar with dimensions All figures, scale bar with dimensions in legend All figures, scale bar with dimensions Physiology 24.4 5.2 1.7 10.5 9.3 26.7 22.1 Cell biology 10.1 0.0 1.3 22.0 11.9 40.9 13.8 Plant science 29.2 0.4 0.4 31.5 10.5 23.3 4.7 Values are % of papers. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 33 Figure S1: Flow chart of study screening and selection process .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 34 References 1. Lee P, West JD and Howe B. Viziometrics: Analyzing Visual Information in the Scientific Literature. IEEE Transactions on Big Data. 2018;4:117-129. 2. Cromey DW. Digital images are data: and should be treated as such. Methods Mol Biol. 2013;931:1-27. 3. Bik EM, Casadevall A and Fang FC. The Prevalence of Inappropriate Image Duplication in Biomedical Research Publications. mBio. 2016;7. 4. Laissue PP, Alghamdi RA, Tomancak P, Reynaud EG and Shroff H. Assessing phototoxicity in live fluorescence imaging. Nat Methods. 2017;14:657-661. 5. Marques G, Pengo T and Sanders MA. Imaging methods are vastly underreported in biomedical research. Elife. 2020;9. 6. Pain E. How to (seriously) read a scientific paper. Science. 2016. 7. Rolandi M, Cheng K and Perez-Kriz S. A brief guide to designing effective figures for the scientific paper. Advanced Materials. 2011;23:4343-4346. 8. Canese K. PubMed® Display Enhanced with Images from the New NCBI Images Database. NLM Technical Bulletin. 2010;376:e14. 9. Liechti R, George N, Gotz L, El-Gebali S, Chasapi A, Crespo I, Xenarios I and Lemberger T. SourceData: a semantic platform for curating and searching figures. Nat Methods. 2017;14:1021-1022. 10. Lindquist M. Neuroimaging results altered by varying analysis pipelines. Nature. 2020;582:36-37. 11. Botvinik-Nezer R, Holzmeister F, Camerer CF, Dreber A, Huber J, Johannesson M, Kirchler M, Iwanir R, Mumford JA, Adcock RA, Avesani P, Baczkowski BM, Bajracharya A, Bakst L, Ball S, Barilari M, Bault N, Beaton D, Beitner J, Benoit RG, Berkers R, Bhanji JP, Biswal BB, Bobadilla-Suarez S, Bortolini T, Bottenhorn KL, Bowring A, Braem S, Brooks HR, Brudner EG, Calderon CB, Camilleri JA, Castrellon JJ, Cecchetti L, Cieslik EC, Cole ZJ, Collignon O, Cox RW, Cunningham WA, Czoschke S, Dadi K, Davis CP, Luca A, Delgado MR, Demetriou L, Dennison JB, Di X, Dickie EW, Dobryakova E, Donnat CL, Dukart J, Duncan NW, Durnez J, Eed A, Eickhoff SB, Erhart A, Fontanesi L, Fricke GM, Fu S, Galvan A, Gau R, Genon S, Glatard T, Glerean E, Goeman JJ, Golowin SAE, Gonzalez-Garcia C, Gorgolewski KJ, Grady CL, Green MA, Guassi Moreira JF, Guest O, Hakimi S, Hamilton JP, Hancock R, Handjaras G, Harry BB, Hawco C, Herholz P, Herman G, Heunis S, Hoffstaedter F, Hogeveen J, Holmes S, Hu CP, Huettel SA, Hughes ME, Iacovella V, Iordan AD, Isager PM, Isik AI, Jahn A, Johnson MR, Johnstone T, Joseph MJE, Juliano AC, Kable JW, Kassinopoulos M, Koba C, Kong XZ, Koscik TR, Kucukboyaci NE, Kuhl BA, Kupek S, Laird AR, Lamm C, Langner R, Lauharatanahirun N, Lee H, Lee S, Leemans A, Leo A, Lesage E, Li F, Li MYC, Lim PC, Lintz EN, Liphardt SW, Losecaat Vermeer AB, Love BC, Mack ML, Malpica N, Marins T, Maumet C, McDonald K, McGuire JT, Melero H, Mendez Leal AS, Meyer B, Meyer KN, Mihai G, Mitsis GD, Moll J, Nielson DM, Nilsonne G, Notter MP, Olivetti E, Onicas AI, Papale P, Patil KR, Peelle JE, Perez A, Pischedda D, Poline JB, Prystauka Y, Ray S, Reuter-Lorenz PA, Reynolds RC, Ricciardi E, Rieck JR, Rodriguez- Thompson AM, Romyn A, Salo T, Samanez-Larkin GR, Sanz-Morales E, Schlichting ML, Schultz DH, Shen Q, Sheridan MA, Silvers JA, Skagerlund K, Smith A, Smith DV, Sokol- Hessner P, Steinkamp SR, Tashjian SM, Thirion B, Thorp JN, Tinghog G, Tisdall L, Tompson SH, Toro-Serey C, Torre Tresols JJ, Tozzi L, Truong V, Turella L, van 't Veer AE, Verguts T, Vettel JM, Vijayarajah S, Vo K, Wall MB, Weeda WD, Weis S, White DJ, Wisniewski D, Xifra-Porxas A, Yearling EA, Yoon S, Yuan R, Yuen KSL, Zhang L, Zhang .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 35 X, Zosky JE, Nichols TE, Poldrack RA and Schonberg T. Variability in the analysis of a single neuroimaging dataset by many teams. Nature. 2020;582:84-88. 12. National Eye Institute. Facts about color blindness. 2015. https://nei.nih.gov/health/color_blindness/facts_about. Accessed March 13, 2019. 13. Weissgerber TL. Training early career researchers to use meta-research to improve science: A participant guided, “learn by doing” approach. PLoS Biology. 2021. 14. Antonietti A, Jambor H, Alicea B, Audisio TL, Auer S, Bhardwaj V, Burgess S, Ferling I, Gazda MA, Hoeppner L, Ilangovan V, Lo H, Olson M, Mohamed SY, Sarabipour S, Varma A, Walavalkar K, Wissink EM and Weissgerber TL. Meta-research: Creating clear and informative image-based figures for scientific publications. 2020. https://osf.io/b5296/. Accessed January 29, 2021. 15. Schindelin J, Arganda-Carreras I, Frise E, Kaynig V, Longair M, Pietzsch T, Preibisch S, Rueden C, Saalfeld S, Schmid B, Tinevez JY, White DJ, Hartenstein V, Eliceiri K, Tomancak P and Cardona A. Fiji: an open-source platform for biological-image analysis. Nat Methods. 2012;9:676-82. 16. Rueden CT, Schindelin J, Hiner MC, DeZonia BE, Walter AE, Arena ET and Eliceiri KW. ImageJ2: ImageJ for the next generation of scientific image data. BMC Bioinformatics. 2017;18:529. 17. Schmied C and Jambor HK. Effective image visualization for publications - a workflow using open access tools and concepts. F1000Research. 2020;9:1373. 18. Jambor H, Surendranath V, Kalinka AT, Mejstrik P, Saalfeld S and Tomancak P. Systematic imaging reveals features and changing localization of mRNAs in Drosophila development. Elife. 2015;4. 19. Nyakatura JA, Melo K, Horvat T, Karakasiliotis K, Allen VR, Andikfar A, Andrada E, Arnold P, Laustroer J, Hutchinson JR, Fischer MS and Ijspeert AJ. Reverse- engineering the locomotion of a stem amniote. Nature. 2019;565:351-355. 20. Weissgerber TL, Winham SJ, Heinzen EP, Milin-Lazovic JS, Garcia-Valencia O, Bukumiric Z, Savic MD, Garovic VD and Milic NM. Reveal, Don't Conceal: Transforming Data Visualization to Improve Transparency. Circulation. 2019;140:1506-1518. 21. Jambor H. Better figures for the life sciences. ecrLife. August 29, 2018. https://ecrlife420999811.wordpress.com/2018/08/29/better-figures-for-life- sciences/. Accessed September 15, 2020. 22. Saladi S. JetFighter: Towards figure accuracy and accessibility. eLife. 2019. 23. Bandrowski A, Brush M, Grethe JS, Haendel MA, Kennedy DN, Hill S, Hof PR, Martone ME, Pols M, Tan S, Washington N, Zudilova-Seinstra E, Vasilevsky N and Resource Identification Initiative Members. The Resource Identification Initiative: A cultural shift in publishing. F1000Res. 2015;4:134. 24. the NPQIP Collaborative Group. Did a change in Nature journals’ editorial policy for life sciences research improve reporting? BMJ Open Science. 2019;3:e000035. 25. Hair K, Macleod MR, Sena ES and Collaboration II. A randomised controlled trial of an Intervention to Improve Compliance with the ARRIVE guidelines (IICARus). Res Integr Peer Rev. 2019;4:12. 26. Giofre D, Cumming G, Fresc L, Boedker I and Tressoldi P. The influence of journal submission guidelines on authors' reporting of statistics and use of open research practices. PLoS One. 2017;12:e0175583. 27. Diong J, Butler AA, Gandevia SC and Heroux ME. Poor statistical reporting, inadequate data presentation and spin persist despite editorial advice. PLoS One. 2018;13:e0202121. 28. Piwowar HA and Vision TJ. Data reuse and the open data citation advantage. PeerJ. 2013;1:e175. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://nei.nih.gov/health/color_blindness/facts_about https://osf.io/b5296/ https://ecrlife420999811.wordpress.com/2018/08/29/better-figures-for-life-sciences/ https://ecrlife420999811.wordpress.com/2018/08/29/better-figures-for-life-sciences/ https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 36 29. Markowetz F. Five selfish reasons to work reproducibly. Genome Biol. 2015;16:274. 30. Colavizza G, Hrynaszkiewicz I, I S, K W and B. M. The citation advantage of linking publications to research data. arXiv. 2020. 31. Cimini BA, Norrelykke SF, Louveaux M, Sladoje N, Paul-Gilloteaux P, Colombelli J and Miura K. The NEUBIAS Gateway: a hub for bioimage analysis methods and materials. F1000Res. 2020;9:613. 32. Ellenberg J, Swedlow JR, Barlow M, Cook CE, Sarkans U, Patwardhan A, Brazma A and Birney E. A call for public archives for biological image data. Nat Methods. 2018;15:849-854. 33. Williams E, Moore J, Li SW, Rustici G, Tarkowska A, Chessel A, Leo S, Antal B, Ferguson RK, Sarkans U, Brazma A, Salas REC and Swedlow JR. The Image Data Resource: A Bioimage Data Integration and Publication Platform. Nat Methods. 2017;14:775-781. 34. Bandrowski AE and Martone ME. RRIDs: A Simple Step toward Improving Reproducibility through Rigor and Transparency of Experimental Methods. Neuron. 2016;90:434-6. 35. Moher D, Liberati A, Tetzlaff J and Altman DG. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. J Clin Epidemiol. 2009;62:1006-12. 36. Jenny B and Kelso NV. Color Oracle. 2018. https://colororacle.org. Accessed March 3, 2020. 37. Kluyver T, Ragan-Kelley B, Pérez F and Granger B. Jupyter Notebooks - a publishing format for reproducible computational workflows. In: F. L. a. B. Scmidt, ed. Positioning and Power in Academic Publishing: Players, Agents and Agendas Netherlands: IOS Press; 2016. 38. Harris CR, K.J. M, van der Walt SJ, Gommers R, Virtanen P, Cornapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, Fernández del Río J, Wiebe M, Peterson P, Gérard-Merchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C and Oliphant TE. Array programming with NumPy. 2020;585:357-362. 39. Hunter JD. Matplotlib: A 2D graphics environment. Computing in Science & Engineering. 2007;9:90-95. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2020.10.08.327718doi: bioRxiv preprint https://colororacle.org/ https://doi.org/10.1101/2020.10.08.327718 http://creativecommons.org/licenses/by-nc/4.0/ 10_1101-2020_11_17_386649 ---- Topology-based Sparsification of Graph Annotations Topology-based Sparsification of Graph Annotations Daniel Danciu 1,2,* Mikhail Karasikov 1,2,3,* Harun Mustafa 1,2,3 André Kahles 1,2,3,† Gunnar Rätsch 1,2,3,4,† 1Biomedical Informatics Group, Department of Computer Science, ETH Zurich, Zurich, Switzerland 2Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland 3Swiss Institute of Bioinformatics, Zurich, Switzerland 4Department of Biology, ETH Zurich, Zurich, Switzerland Abstract Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. In this paper, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10,000 RNA-seq datasets show that RowDiff combined with Multi- BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a Multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST. 1 Introduction The exponential increase in global sequencing capacity [1] and the resulting growth of public sequence repositories have created an urgent need for the development of compact representation schemes of bio- logical sequences. Such schemes should not only maintain all relevant biological sequence variation but also provide fast access for sequence search and extraction. After initial attempts focused on the lossless compression of full sequences, e.g., using the Burrows-Wheeler transform [2], the field soon turned towards representing a proxy of the input sequences instead: the sets of all k-mers contained in them. For this, any recurrent occurrence of a substring of length k in the input is represented by a unique k-mer, forming a k-mer set. A query of a given sequence against the input text can then be replaced by exact k-mer matching against the set. Longer strings are queried as a succession of k-mers. Although it is a lossy representation of the input (as, e.g., repeats longer than k are collapsed), constructing k-mer sets has proved highly useful in practice [3, 4, 5, 6]. *Joint-first authors. †Joint corresponding authors; contact: andre.kahles@inf.ethz.ch and gunnar.ratsch@ratschlab.org. 1 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.11.17.386649doi: bioRxiv preprint https://doi.org/10.1101/2020.11.17.386649 http://creativecommons.org/licenses/by-nc/4.0/ 1.1 Representation of k-mer sets Various representations have been developed to balance the trade-off between the space taken by the k-mer set and query time or representation accuracy. Conceptually, the k-mer set fully defines a vertex-centric de Bruijn graph, where each k-mer forms a vertex and arcs are represented implicitly, based on whether any two vertices share a k − 1 overlap. The simplest representations are bitmaps or (perfect) hash-tables that indicate the presence or absence of any possible k-mer over the input alphabet in the input text. While non- optimal in space, they offer constant-time query of k-mers. More compact representations use approximate membership query data structures to probabilistically represent a de Bruijn graph [7, 8] or utilize succinct de Bruijn graphs (a generalization of the Burrows-Wheeler transform) [9], which usually require less than one byte per input k-mer over the nucleotide alphabet {A,C,G,T}. 1.2 De Bruijn graph annotation A major limitation of the above representations is that the identities of any sequence labels contained in the input text set are lost. To alleviate this, the concept of colored de Bruijn graphs emerged [10] (otherwise known as annotated or labeled de Bruijn graphs), allowing for the representation of additional annotations per k-mer. These annotations can either be stored in conjunction with the k-mers or be organized in a separate data structure, using the k-mer representation only as an index space. Although the first option is used by a number of conceptually interesting methods, such as Mantis [11] that uses counting quotient filters to represent the k-mers linked to an annotation identifier, here we will only focus on the second option, as it allows for the connection of arbitrary annotations to the k-mer set, without re-processing the k-mer index. Conceptually, the set of annotations is a relation between k-mers and labels that can be represented as a binary matrix, where the k-mer set indexes the rows and each annotation label specifies a column. Any entry (i,j) in the matrix represents the relation of k-mer i and annotation j. Different methods have been suggested to compress this annotation matrix in a way that still allows for efficient query. VARI [12, 13] concatenates the rows of the annotation matrix and compresses the result using either an RRR [14] or Elias- Fano coding [15, 16]. Rainbowfish [17] takes advantage of high redundancy in matrix rows by computing a frequency code for the unique rows, compressing the unique rows in a matrix ordered by these codes, then representing the original matrix as a variable-length code vector. However, this method and other frequency coding- based approaches become less effective for data sets with greater levels of noise or inter-sample variability. Multi-BRWT [18] compresses the matrix in a hierarchical tree structure exploiting column similarity, but leaving the possible row redundancy unexploited. Alongside these methods, there is a rich literature of different compressors for graph annotations developed over the years, each improving on the compression performance of previous methods [19, 20, 21]. All of these methods share the common property that they act as general purpose binary matrix compressors, and thus, they do not take into account any particular domain knowledge in their construction. 1.3 Leveraging graph topology to improve annotation compression While the methods mentioned above rely solely on similarities between annotation matrix elements to achieve their compression, a few have additionally leveraged graph topology to increase their compression potential. The Bloom filter correction method introduced by [22] encodes the columns of the annotation matrix in Bloom filters with high false positive rate. Assuming that all vertices within a graph unitig (a path in which all vertices except for the first and last have in- and out-degree 1) share identical annotations, a row in the annotation matrix (corresponding to all vertices from the same unitig in the graph) is computed as the 2 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.11.17.386649doi: bioRxiv preprint https://doi.org/10.1101/2020.11.17.386649 http://creativecommons.org/licenses/by-nc/4.0/ bit-wise AND of the rows stored for every vertex of that unitig. While achieving high accuracy in decoding row annotations, the corrected Bloom filters are not able to losslessly decode the rows of the encoded anno- tation matrix. In addition, the authors introduce a lossless approach based on wavelet tries which leverages graph backbone paths to improve compression performance. However, these paths must be provided by the user and cannot be computed automatically by the method. The more recently introduced Mantis MST method [23] constructs an annotation graph with nodes representing the unique rows of the annotation matrix. In this annotation graph, a weighted edge between two nodes v1 and v2 is created if there exist adjacent vertices s1 and s2 in the underlying de Bruijn graph whose annotations are represented by v1 and v2, respectively. The weight of this edge (v1,v2) is then set to the Hamming distance of the unique rows v1 and v2. Mantis MST computes the minimal spanning tree of the annotation graph and represents the annotation of a node as its bit-wise XOR with the annotation of its parent node in the spanning tree, while only the annotation of the root node is represented explicitly. 1.4 Our contribution We present a new scheme for representing graph annotations, RowDiff, which takes advantage of similar- ities between the annotations of neighboring vertices to compress annotation matrices. RowDiff can be constructed using |G|+ 3m + O(|c|) bits of memory, where |c| is the compressed size of the largest column in the annotation matrix and |G| is the size of the memory representation of the graph and in our case is less than 4m + o(m) bits [9], where m is the number of k-mers, thus making it suitable for annotating virtually arbitrarily large graphs. Since RowDiff is a transformation of the input annotation matrix attempting to in- crease its sparsity, RowDiff can be naturally chained with any generic scheme for compressed binary matrix representation to achieve further improvements in compression performance. We demonstrate the com- pression performance of RowDiff relative to the state-of-the-art lossless Rainbow-MST and MultiBRWT methods on datasets representing different annotation matrix densities. In the next sections, we define the underlying concepts (Sections 2.1 and 2.2) and detail our methods for construction (Sections 2.3 to 2.4) and querying (Section 2.5) of the RowDiff data structures. We then describe the test datasets (Section 3.1) and study the representation sizes (Sections 3.2, 3.3, and 3.4), construction time (Section 3.5) and query time (Section 3.6) of RowDiff-compressed annotations. Finally, we discuss limitations and directions for future work (Section 4). 2 Method 2.1 Notation We will operate in the following setting. Let k be a positive integer. The order k de Bruijn graph over a set of sequences S, denoted by DBGk(S), is a directed graph DBGk(S) := (Vk,Ek), whose vertices Vk are the set of all distinct sub-strings of length k of sequences in S (k-mers), and an arc links u ∈ Vk to v ∈ Vk, if u2:k = v1:k−1, where si:j denotes the sub-sequence of s from position i up to and including position j. We denote with deg−(v) and deg+(v),v ∈ Vk the in- and out-degree of a vertex, respectively. Vertices v ∈ Vk, deg−(v) = 0 are called source vertices and vertices v ∈ Vk, deg+(v) = 0 are called sink vertices. Given an arbitrary set of labels L, an annotation for a de Bruijn graph DBGk(S) is a relation A⊂ Vk×L, which assigns to each vertex v ∈ Vk a set of labels, l(v) ⊂ L. We will trivially represent A using a binary matrix A ∈ {0, 1}|Vk|×|L|, denote with Ai the i-th row of A, and with Ai ⊕ Aj the element-wise XOR of rows i and j. 3 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.11.17.386649doi: bioRxiv preprint https://doi.org/10.1101/2020.11.17.386649 http://creativecommons.org/licenses/by-nc/4.0/ 2.2 RowDiff transformation RowDiff relies on the observation that adjacent vertices in the graph are likely similarly annotated, and thus, their respective rows in the annotation matrix A are similar as well. This implies that if (u,v) ∈ Ek, storing the difference between Au and Av may be more space efficient than storing Au, i.e. popcount(Au ⊕Av) < popcount(Au), where popcount(x) represents the number of set bits in row x. RowDiff is defined as a transformation that converts an annotation matrix A of a de Bruijn graph into a new, sparser, annotation matrix A∗ of the same size and an additional anchor vector a ∈ {0, 1}|Vk|, that is, rdDBGk(S) : {0, 1} |Vk|×|L| →{0, 1}|Vk|×(|L|+1). The anchor vector a stores which rows remain unchanged. We show that the original annotation matrix A can be reconstructed from the RowDiff transformed matrix A∗ and the anchor vector a. Empirically, the RowDiff transformed matrix is significantly better compressible in the typical case where neighboring vertices have similar annotations. We develop an efficient algorithm for defining good anchors and for computing the RowDiff transform rdDBGk(S) and its inverse. For each vertex u ∈ Vk we arbitrarily define its RowDiff successor as its lexicographically largest outgoing vertex succ(u), such that (u, succ(u)) ∈ Ek and succ(u) ≥ v ∀(u,v) ∈ Ek, if such u exists. RowDiff replaces each row Au with the (likely sparser) delta relative to its RowDiff successor. For binary rows, the delta is simply the element-wise XOR, A∗u := Au ⊕Asucc(u), while for non-binary rows, the delta could store the difference between the row and its successor. In this work, we focus on binary matrices. The previous equation implies that Ai = A∗i ⊕ Asucc(i), which gives us a simple formula for recursively reconstructing the original row. In order to be able to reconstruct the original annotation A from A∗, some rows are left unchanged. A vertex v ∈ Vk for which the annotation is stored unchanged is called an anchor and its corresponding value in the anchor bit vector will be set to 1, av = 1. Sink vertices do not have a RowDiff successor, and must thus be anchors. Algorithm 1 shows the implementation of the inverse transformation rd−1DBGk(S), which reconstructs the original row Ai from the RowDiff representation A∗. Algorithm 1 Row annotation reconstruction 1: function RECONSTRUCTANNOTATION(i) 2: row ← A∗i 3: while ai = 0 do . current vertex is not an anchor 4: i ← succ(i) 5: row ← row ⊕ A∗i 6: end while 7: return row 8: end function Starting from any vertex in the de Bruijn graph, Algorithm 1 defines a traversal leading to an anchor vertex, for which the annotation was not transformed. Since de Bruijn graphs may have cycles, additional anchor vertices might have to be assigned in order to break RowDiff cycles (those cycles where every vertex is a RowDiff successor relative to its predecessor in the cycle). Proposition 1. Algorithm 1 finishes for every starting vertex, if and only if every sink vertex in the graph is an anchor and every RowDiff cycle contains at least one anchor vertex. Proof. Assume the algorithm does not finish for a starting vertex i. This implies that asucck(i) = 0,∀k ∈ N. Since the number of vertices in the graph is finite, there must exist l,m ∈ N, l 6= m, s.t. succl(i) = succm(i). Thus, (succl(i), succl+1(i), . . . , succm(i)) is a cycle and, hence, must contain at least one anchor vertex, which contradicts the initial assumption. Proof of necessity is equally trivial. 4 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.11.17.386649doi: bioRxiv preprint https://doi.org/10.1101/2020.11.17.386649 http://creativecommons.org/licenses/by-nc/4.0/ Proposition 2. Algorithm 1 correctly reconstructs the original annotation row Ai for every vertex i ∈ Vk. Proof. The algorithm computes A∗i ⊕A ∗ succ(i) ⊕···⊕A ∗ succp(i), where asuccp(i) = 1, and thus, A ∗ succp(i) = Asuccp(i). By repeatedly reducing the last 2 terms using A ∗ succp−1(i) ⊕ Asuccp(i) = Asuccp−1(i), the original equation is reduced to Ai, which is the desired value. Once the set of anchor vertices a satisfies Proposition 1, the RowDiff-transformed matrix A∗ together with the anchor indicator bitmap a encode the original annotation matrix. 2.3 Anchor assignment In addition to the small set of anchors described in Proposition 1, we seek to cap the maximum RowDiff path length (i.e. a path taken by Algorithm 1) to a certain value M (typically between 10 and 100) by ensuring that at least every M-th vertex in a RowDiff path is an anchor, as described below. This guarantees that the number of iterations in Algorithm 1 is bounded by a constant, and thus the average time complexity of reconstructing a single row is O(l ·M), where l �|L| is the average number of set bits (labels) per row. At the same time, since anchor vertices require storing the original, less sparse annotation row, it is desirable to minimize the total number of anchor vertices in order to keep the popcount (and thus the compressed size) of the RowDiff annotation A∗ small. The following anchor assignment algorithm allocates anchor vertices near-optimally in four steps as follows (see Algorithm 2). First, we traverse RowDiff paths backwards (in parallel) starting from sink Algorithm 2 Anchor assignment 1: function ASSIGNANCHORS(M) 2: visited[] ←{False} . initialize mask of visited vertices 3: anchor[] ←{False} . initialize mask of anchor vertices 4: for all s ∈ Sinks() parallel do 5: anchor ← TraverseBwd(s, visited, anchor, M) 6: for all s ∈ Sources() parallel do 7: anchor ← TraverseFwd(s, visited, anchor, M) 8: for all s ∈ Forks() parallel do 9: anchor ← TraverseFwd(s, visited, anchor, M) 10: . only vertices in simple cycles (no forks) left unvisited at this point 11: for all s ∈ Nodes() parallel do 12: anchor ← TraverseFwd(s, visited, anchor, M) 13: return anchor 14: end function vertices (see Algorithm S1). The backward traversal stops either when we reach a source vertex or when we reach a vertex v ∈ Vk, s.t. succ(v) 6= u for the previously traversed vertex u (see Figure 1, top). Note, the traversal is not terminated when reaching a vertex with multiple incoming arcs, but explores each of them and continues to further traverse these RowDiff paths backwards. When the distance from the current vertex to the next assigned anchor in the current RowDiff path reaches M, the vertex is marked as an anchor. In practice, once the backward traversal is finished, the vast majority of the vertices have been traversed, and the anchor assignment is optimal, in the sense that no anchors are closer than M to each other. In the second step, we start at source vertices and traverse RowDiff paths forwards, i.e. paths of the form v, succ(v), succ2(v), . . . (see Algorithm S2). The traversal stops when we reach an already visited vertex. In the third step, we start traversing forward at all forks with unvisited vertices. After the third 5 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.11.17.386649doi: bioRxiv preprint https://doi.org/10.1101/2020.11.17.386649 http://creativecommons.org/licenses/by-nc/4.0/ step, the only vertices that were not traversed must belong to a simple cycle (a cycle where all vertices have deg−(v) = deg+(v) = 1). The fourth step traverses these cycles (in parallel). Each of these traversals sets an anchor every M vertices during the traversal. Since we visit each vertex only once, the time complexity of anchor assignment is O(|Vk|). Proposition 3. The anchors assigned by Algorithm 2 guarantee successful termination of Algorithm 1 for any input vertex v ∈ Vk. Proof. Step 1 of the algorithm trivially guarantees that all sink vertices are anchor vertices. Steps 3 and 4 guarantee that all cycles in the graph are traversed and at least one anchor vertex is set in each cycle. The conditions in Proposition 1 are thereby satisfied and Algorithm 1 finishes and successfully reconstructs A from A∗. One important detail in the forward traversal step is handling the situation when the traversal stops due to merging into a visited vertex. Not setting an anchor in such cases may result in arbitrarily long paths with no anchors (when such merges are chained). Always setting an anchor at a merge will introduce unnecessary anchors and increase the annotation density. We handle merges with the following simple heuristic: use an additional bit vector, nearAnchor, to mark all vertices that are known to be at a distance smaller than M to an anchor vertex. During forward traversal, when hitting a visited merge vertex not marked in nearAnchor, the anchor is not set (Figure 1, bottom). A more optimal algorithm for deciding if a merge vertex should create an anchor would require labeling each vertex with the distance to its nearest anchor. In our implementation we preferred the heuristic algorithm due to its significantly reduced space complexity. Forward traversal Backward traversal Bwd traversal stops, v≠succ(u) Source Sinku succ(u) v Merge (no anchor created) Merge (anchor created) Previously visited Near anchor Traversing now m1 m 2 Figure 1: Top: RowDiff traversal. When traversing backward to assign anchor vertices, the traversal stops at vertex u, because succ(u) 6= v. When traversing forward, the last outgoing vertex is selected. Bottom: Chained merge. Dark grey vertices are marked as nearAnchor. When traversing the light grey vertices, we merge into m1, marked as nearAnchor, thus, no anchor is set. When traversing the blue vertices, an anchor must be set at m2, as m2 is not marked as nearAnchor. 6 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.11.17.386649doi: bioRxiv preprint https://doi.org/10.1101/2020.11.17.386649 http://creativecommons.org/licenses/by-nc/4.0/ 2.3.1 Anchor optimization To guarantee that none of the rows A∗v in A ∗ have more set bits than the corresponding row Av in the original annotation, we perform the following anchor optimization procedure. For each v ∈ Vk, s.t. popcount(A∗v) > popcount(Av), we make such vertex an anchor, av := 1, and replace A∗v with Av. This ensures that all rows in the RowDiff-transformed annotation matrix are at least as sparse as the corresponding rows in the original annotation matrix. Proposition 4. Each row in a RowDiff-transformed annotation matrix has the same or fewer set bits than its corresponding row in the original annotation matrix. The anchor optimization procedure is implemented similarly to the initial construction of RowDiff (see Section 2.4). Thus, it has the same time and space complexity. 2.4 RowDiff construction A naı̈ve implementation of the RowDiff construction would be to load the matrix A in memory, and gradu- ally replace its rows with their sparsified counterpart, while traversing the graph. Although fast and simple, this method requires to keep the entire annotation matrix A and the graph in memory. Unfortunately, often this is not realistic and even the annotation matrix A alone can easily reach several terabytes in size. Thus, we developed a distributed parallel construction algorithm that only loads a few columns of A at a time and, hence, needs a limited amount of memory. In the first stage, we load the graph and for each vertex pre-compute the indices of the unique RowDiff successor and the (possibly multiple) RowDiff predecessors, stored in vectors pred and succ, respectively. The pred and succ vectors are used to build A∗ in the second stage without the need to query the graph itself and load it in memory. To make the algorithm scale to de Bruijn graphs with trillions of vertices, vectors pred and succ are built and traversed in a streaming manner. They are loaded in small blocks, as described in Algorithms 3 and S3, and never kept in memory in full. Thus, pre-computing the pred and Algorithm 3 RowDiff transform 1: function SPARSIFY(columns) . sparsifies a batch loaded in memory 2: for block ← 0, numRows, BlockSize do . Process by blocks 3: load pred[block..block+BlockSize] 4: load succ[block..block+BlockSize] 5: for all c ∈ columns parallel do 6: for all i ∈ c[block..block+BlockSize] do . Iterate only set bits 7: if not c[succ[i]] then 8: . The bits at i and succ[i] are different, hence, diff 6= 0 9: c∗[i] ← True 10: end if 11: for all p ∈ pred[i] do 12: if not c[p] then 13: . The bits at p and i are different, hence, diff 6= 0 14: c∗[p] ← True 15: end if 16: end for 17: end for 18: end for 19: end function 7 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.11.17.386649doi: bioRxiv preprint https://doi.org/10.1101/2020.11.17.386649 http://creativecommons.org/licenses/by-nc/4.0/ S pa rs ify in p ar al le l Lo ad in b at ch es Sparsification overview 1 c ij := ¬c s[i]j ^ c p[i]j := ¬c p[i]j ^ parallel pred/succ if c ij == 1: m p HDD HDD HDD HDD RAM RAM p i. . . p k s i. . . s k c ij. . . c kj Figure 2: RowDiff transform algorithm – Schematic overview of sparsification on a single machine. Top: Columns are loaded into memory in batches (until memory is exhausted) and each batch is fully transformed to RowDiff. The result is serialized and the process moves on to the next batch. Bottom: Each batch is transformed to RowDiff as follows. The algorithm iteratively loads into memory blocks of the pre- computed vectors pred and succ. Then, all columns of the batch are processed in parallel. The algorithm iterates only through set bits of each column in the active block and computes the elements of the RowDiff transformed matrix A∗ (see Algorithm 3 for a more detailed description). succ vectors essentially makes it possible to query the graph topology during the second stage while only using O(1) additional space. After the RowDiff annotation A∗ has been generated, the pred and succ vectors are not required for querying and, thus, can be discarded. The second stage of the construction algorithm (the sparsification workflow) is schematically described in Figure 2. The initial sparsification of A can be trivially distributed by dividing the columns of A into groups and processing each group on a different machine. Each machine processes its assigned columns in batches. The size of each batch is determined dynamically by loading columns into memory until a desired upper limit is reached. This upper limit must be greater than the largest column being processed in compressed bit-vector format, but otherwise not restricted. For each column in the batch, we iterate only the set bits (only those rows corresponding to vertices annotated with the label represented by that column) and compare them with the bits at positions pred and succ in the same column to compute the RowDiff-transformed row, as shown in Figure 2. 2.4.1 Scalability and complexity Algorithm 3 only traverses set bits in A, and for each set bit in row i it performs O(deg−(i) + 1) operations, hence the total time complexity is O((1 + α)popcount(A)), where α is the average in-degree of the graph. 8 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.11.17.386649doi: bioRxiv preprint https://doi.org/10.1101/2020.11.17.386649 http://creativecommons.org/licenses/by-nc/4.0/ For de Bruijn graphs, α ≤ |Σ|, and hence the time complexity is linear in the number of set bits of the original annotation matrix, i.e. O(popcount(A)). Algorithm S3, for constructing pred and succ, traverses each vertex exactly once, hence its time complexity is O(|Vk|). Since the buffer used by Algorithm S3 has a constant size, the space complexity is |DBGk(S)|+ O(1), where |DBGk(S)| denotes the memory footprint of the graph, which, for instance, in the case of the BOSS representation [9], typically does not exceed 4m+ o(m) bits, where m = |Vk|. After taking into account Algorithm 2 for anchor assignment, which requires 3 additional bits per vertex to indicate anchors and the traversal state, and putting it all together, we get that the RowDiff-transform can be performed in O(popcount(A) + |Vk|) time and in |DBGk(S)| + 3|Vk| + O(|c|) space, where |c| is the memory footprint of the largest (densest) column of A in a compressed bit-vector format. Note that the first term in the sum is usually the dominant. In conclusion, we mention again that RowDiff construction can be easily distributed on multiple ma- chines with modest hardware requirements and run in parallel on each machine, which makes the method very attractive for practical use on very large data sets. 2.5 Querying annotations for paths We now note that, when querying annotations for paths in the graph, or sets of rows corresponding to vertices from a local neighborhood in the graph, Algorithm 1 leads to redundant reconstruction work, as many of the queried rows belong to the same RowDiff paths. To alleviate this, we perform the traversal first and pre- compute all RowDiff paths from the rows queried. Then, we query all diff rows in one batch and reconstruct annotations for each row from the query. This ensures that no arc in these paths is traversed more than once. Moreover, querying all rows in one batch often allows making the query of the underlying representation of the sparsified binary matrix faster by exploiting its potential intrinsic features (e.g., jointly querying n bits in m columns is more cache-efficient and faster than n queries of single bits in each of the m columns). 2.6 Implementation details We implemented RowDiff as part of the MetaGraph framework [6]. The code for reproducing results of the experiments is available at https://github.com/ratschlab/row_diff. For storing original columns of the annotation matrix as well as the indicator bitmap with anchor vertices, we used the SD vectors from the sdsl-lite library [24] for compressed representation of bitmaps. For compression of the transformed annotation matrix, we used the Multi-BRWT representation scheme proposed in [18], with its improved and scaled up implementation from MetaGraph. 3 Results and Discussion In this section, we evaluate the performance of the methods described above both in terms of their final representation sizes and their construction time. In addition, we also study the effect of the maximum RowDiff path length on the final RowDiff representation size of the compressed annotations. Finally, we evaluate the degree of size reduction that RowDiff provides on a per-column basis. 3.1 Data sets We evaluated the compression performance of RowDiff on three data sets with different levels of sequence variability and thus graph density. Our first data set consists of all Fungi sequences from RefSeq release 97 [25], with annotations derived from the taxonomic IDs of the sequences’ respective organisms. Our 9 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.11.17.386649doi: bioRxiv preprint https://doi.org/10.1101/2020.11.17.386649 http://creativecommons.org/licenses/by-nc/4.0/ second and third data sets are derived from the cohort of 10,000 publicly available human RNA-Seq exper- iments used in [23]. We constructed annotated de Bruijn graphs from the RNA-Seq data set in the same manner as in [23], using a k value of 23, albeit with two samples discarded due to their withdrawal from the Sequence Read Archive. We will refer to this data set as RNA-Seq (k=23). The third data set is constructed using the graph cleaning approach implemented in MetaGraph [6], using a k value of 31. We will refer to this data set as RNA-Seq (k=31). For evaluating construction time and representation size, we shuffled the samples in each data set and generated subsets of increasing size. We evaluated RowDiff against MST [23], employed in Mantis [11], which, to the best of our knowledge, is the most compact annotation representation method to date. Similarly to Rainbowfish [17], MST reduces the original annotation matrix to a set of unique rows and consists of two components: a vector, mapping indexes of rows of the annotation matrix to its unique rows (color classes) and unique rows compressed in a minimum spanning tree. In Mantis, this mapping vector is included into a hash table storing the k-mers of the de Bruijn graph, which is usually at least an order of magnitude larger than the compressed annotation. Thus, to make a fair comparison, we exclude the large contribution of Mantis’ graph representation, and only consider the mapping vector, using the same representation as in Rainbowfish [17]. Thus, we refer to the MST annotation representation as Rainbow-MST. Note that Rainbow-MST forms a graph annota- tion representation which, similarly to RowDiff, can be used with any de Bruijn graph representation with indexed k-mers. 3.2 Representation size We now compare the representation size for RowDiff and other state-of-the-art graph annotation compres- sion methods. Figure 3 shows the representation size for the RNA-Seq (k=31) and RefSeq (Fungi) data sets. On the RNA-Seq (k=31) data set, RowDiff-MultiBRWT effectively takes advantage of the topology of the graph annotation and the similarity of rows of the annotation matrix and achieves a nearly 4-fold size reduction compared to Multi-BRWT applied on non-sparsified columns. Compared to the Rainbow-MST method, RowDiff-MultiBRWT achieves a 2-fold size reduction. Rainbow-MST computation on the subsets with more that 4000 samples could not be computed because Mantis did not complete within the 10 day limit of our compute cluster. For this reason, we also plotted the size of the Rainbow-MST mapping vector, which, being a subset of the MST annotation data, represents a lower bound for Rainbow-MST. 2000 4000 6000 8000 10000 Number of SRA samples 0 20 40 60 80 S iz e, G B Rainbow-MST Rainbow-MST (mapping only) MultiBRWT RowDiff-RowSparse RowDiff-MultiBRWT (a) RNA-Seq (k=23) data set 2000 4000 6000 8000 10000 Number of SRA samples 0 10 20 30 40 50 60 S iz e, G B Rainbow-MST Rainbow-MST (mapping only) MultiBRWT RowDiff-RowSparse RowDiff-MultiBRWT (b) RNA-Seq (k=31) data set 2000 4000 6000 8000 10000 12000 Number of taxonomy IDs 0.0 2.5 5.0 7.5 10.0 12.5 15.0 S iz e, G B Rainbow-MST Rainbow-MST (mapping only) MultiBRWT RowDiff-RowSparse RowDiff-MultiBRWT (c) RefSeq (Fungi) data set Figure 3: Representation size. The purple and red lines represent the size of the RowDiff annotation with and without MultiBRWT, respectively. The blue line indicates the size of the Rainbow-MST annotation. The orange line represents the size of the Rainbow-MST mapping vector and represents a lower bound on the Rainbow-MST representation size. Rainbow-MST computation on the RNA-Seq (k=31) data set with > 4000 samples did not complete within the 10 day limit of our compute cluster. 10 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.11.17.386649doi: bioRxiv preprint https://doi.org/10.1101/2020.11.17.386649 http://creativecommons.org/licenses/by-nc/4.0/ On the RefSeq (Fungi) data set, RowDiff takes advantage of the longer stretches of vertices with iden- tical annotations and achieves a 26-fold size reduction relative to Rainbow-MST. Notably, this significant difference comes from the fact that virtually all of the space used by Rainbow-MST is taken by the mapping vector on this data set. On the RNA-Seq (k=23) data set, RowDiff-MultiBRWT achieves a 2.5-fold size reduction relative to Multi-BRWT and a 1.5-fold reduction relative to Rainbow-MST. The RowSparse format stores the indices of set bits in each row in a compressed integer vector. This type of annotation is faster to construct and to query than the Multi-BRWT representation, but its footprint is significantly larger on denser datasets, such as RNA-Seq (k=31). 3.3 Effects of graph density on compression In this section we analyze how the density of the annotated graph affects RowDiff compression. In a first experiment we take a random subset of 1500 entries from the RefSeq Fungi data set and build graphs and corresponding annotations for k-mer sizes ranging from 15 to 31. Table 1 shows how as the sparsity of the graph increases (with increasing k-mer length), the compression ratio |A|/|A∗| sharply increases. Table 1: Compression ratio vs graph density. The sparser the graph the higher the compression ratio. k-mer size Average node degree Compression ratio |A|/|A∗| 15 1.98 1.30 17 1.10 4.79 19 1.01 18.89 23 1.003 31.66 31 1.0017 34.53 In a second experiment, we test how the maximum path length M affects the annotation size for graphs of various densities. Table 2 shows the annotation size on the RNA-Seq (k=23), RNA-Seq (k=31), and RefSeq (Fungi) data sets for various values of M. While increasing the maximum path length has negligible effect on the denser RNA-Seq graphs (with average node degrees of 1.08 and 1.04 respectively), it reduces the annotation size by a factor of up to 5.75 on the much sparser RefSeq (Fungi) graph (with an average node degree of 1.003). Table 2: Annotation size vs maximum path length M for RNA-Seq (k=23,31) and Refseq Fungi (k=31). A sharp decrease in annotation size can be observed for the sparse RefSeq (Fungi) graph. Sizes shown in GB for RNA-Seq and MB for Refseq. Dataset M=10 M=20 M=50 M=75 M=100 RNA-Seq23 126 118 116 115 119 RNA-Seq31 67 63 60 60 59 Refseq Fungi 1455 813 398 302 252 11 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.11.17.386649doi: bioRxiv preprint https://doi.org/10.1101/2020.11.17.386649 http://creativecommons.org/licenses/by-nc/4.0/ 3.4 Compression of single columns In this experiment, we measure how RowDiff compresses individual columns of the annotation matrix. Figure 4 shows the compression factor 1 −|A∗·,i|/|A·,i| achieved by RowDiff on two datasets representing two different extreme cases of sequence variability. The de Bruijn graph constructed from assembled genomes RefSeq (Fungi) contains significantly fewer branches and bubbles than the graph constructed from reads RNA-Seq (k=31), thus its annotation is signifi- cantly better compressed by RowDiff, with an average reduction factor of 42. 0.0 0.2 0.4 0.6 0.8 Reduction factor: (Original - RowDiff) / Original 0 1 2 3 4 5 (a) RNA-Seq (k=23) data set 0.5 0.6 0.7 0.8 0.9 1.0 Reduction factor: (Original - RowDiff) / Original 0 20 40 60 80 100 (b) RNA-Seq (k=31) data set Figure 4: Histogram of the column reduction factor 1 − |A∗·,i|/|A·,i|. On the denser RNA-Seq (k=23) graph, the reduction factor peaks at 0.6, while on RNA-Seq (k=23) the reduction factor peaks at 0.97. The columns are stored as SD compressed vectors. 3.5 Construction time In Figure 5, we compare the construction times for building RowDiff and MST [23]. The construction time 2000 4000 6000 8000 10000 Annotation columns 0 50 100 150 200 250 300 350 C o n st ru ct io n ti m e, m in Rainbow-MST (MST only) RowDiff-MultiBRWT Figure 5: Construction time for the RowDiff and MST annotation representations on the RNA-Seq (k=23) data set with 72 threads. 12 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.11.17.386649doi: bioRxiv preprint https://doi.org/10.1101/2020.11.17.386649 http://creativecommons.org/licenses/by-nc/4.0/ for RowDiff-MultiBRWT includes the RowDiff transform from original columns to the RowDiff format (with M = 100) in addition to the time for conversion of the transformed columns to the Multi-BRWT binary matrix representation. For MST, the time does not include construction of the mapping vector and includes only the time for compression of the unique annotation rows, which is a lower bound on the total construction time for the MST method. Note that the construction time for RowDiff-MultiBRWT grows linearly in the number of columns of the annotation matrix, and superlinearly for MST. 3.6 Query performance In this experiment, we measured the time needed for querying the RNA-Seq (k=23) annotation for human transcripts. The query is performed with the algorithm optimized for long paths (see Section 2.5). First, we construct a list of annotation rows that have to be reconstructed from the RowDiff format and a list of all diff rows for querying in the RowDiff matrix. Then, all these rows are queried and the original annotation rows are reconstructed. Table 3 shows the time taken for reconstruction of the original annotations for 100 and 1000 random human transcripts, which includes the time for querying the diff rows and reconstruction of the original annotations. Since RowDiff requires traversing the de Bruijn graph to get RowDiff paths, the query time for RowDiff depends on the traversal performance of the underlying representation of the de Bruijn graph. In this experiment, we used the succinct de Bruijn graph representation available in MetaGraph [6]. Table 3: Time for querying 100 and 1000 random human transcripts with RowDiff-RowSparse and RowDiff- MultiBRWT. The second column shows the total number of original annotation rows reconstructed for the query. All benchmarks were performed with a single thread on Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz. Query data Query time # rows RowDiff RowDiff reconstructed RowSparse MultiBRWT 100 transcripts 44,995 8.7 sec 38 sec 1000 transcripts 553,280 95 sec 342 sec 4 Conclusions In this paper, we introduced RowDiff, a new technique for compacting graph labels by leveraging the likely similarities in annotations of nodes adjacent in the graph. We designed a parallel construction algorithm with linear time complexity in the number of node-label pairs and small memory footprint. In addition, the algorithm can efficiently be distributed and parallelized, making it applicable on arbitrarily large graphs. RowDiff reduced the size of graph annotations by 2- to 26-fold when used in combination with Multi-BRWT relative to Mantis-MST, the most efficient state-of-the-art representation. Although the row reconstruction method inevitably leads to an increase in ad hoc row query time due to the larger number of required annotation matrix queries, this limitation is alleviated in practice due to the tendency of real-world sequences to feature k-mers which co-occur on matching RowDiff paths. The optimization of anchor assignment is a clear direction for future development of these methods. The anchor assignment method we have presented is designed to reduce the row reconstruction time by setting an upper bound on the traversal length. However, given that there is a trade-off between the size and the query time of the final representation, designing an objective function and a corresponding algorithm to best optimize these measures is a non-trivial task. 13 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.11.17.386649doi: bioRxiv preprint https://doi.org/10.1101/2020.11.17.386649 http://creativecommons.org/licenses/by-nc/4.0/ Moving beyond the representation of binary relations, a simple extension of the RowDiff method can be used as an efficient way to represent genomic coordinates for indexes of reference genomes. By representing a coordinate at each anchor node, the coordinates of all other nodes in that anchor’s corresponding RowDiff path can be computed via their traversal distance to the anchor. Each improvement in the compression of sequence graphs and their associated annotations opens up further opportunities for their real-world applicability. When handling large annotations, even a 2-fold difference in the representation size can make a previously unapproachable annotation accessible to the available hardware. With RowDiff, we have demonstrated that there still is great potential for improving the representation of annotations on sequence graphs. 5 Acknowledgements Mikhail Karasikov and Harun Mustafa are funded by the Swiss National Science Foundation Grant No. 407540 167331 “Scalable Genome Graph Data Structures for Metagenomics and Genome Annotation” as part of Swiss National Research Programme (NRP) 75 “Big Data”. A. K. and D. D. are funded from ETH core funding to Gunnar Rätsch. References [1] Stephens, Z. D. et al. Big data: astronomical or genomical? PLoS biology 13, e1002195 (2015). [2] Cox, A. J., Bauer, M. J., Jakobi, T. & Rosone, G. Large-scale compression of genomic sequence databases with the burrows–wheeler transform. Bioinformatics 28, 1415–1419 (2012). [3] Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using minhash. Genome biology 17, 132 (2016). [4] Breitwieser, F., Baker, D. & Salzberg, S. L. Krakenuniq: confident and fast metagenomics classifica- tion using unique k-mer counts. Genome biology 19, 198 (2018). [5] Bradley, P., den Bakker, H. C., Rocha, E. P., McVean, G. & Iqbal, Z. Ultrafast search of all deposited bacterial and viral genomic data. Nature biotechnology 37, 152 (2019). [6] Karasikov, M. et al. Metagraph: Indexing and analysing nucleotide archives at petabase-scale. bioRxiv (2020). [7] Chikhi, R. & Rizk, G. Space-efficient and exact de bruijn graph representation based on a bloom filter. Algorithms for Molecular Biology 8, 22 (2013). [8] Benoit, G. et al. Reference-free compression of high throughput sequencing data with a probabilistic de bruijn graph. BMC bioinformatics 16, 288 (2015). [9] Bowe, A., Onodera, T., Sadakane, K. & Shibuya, T. Succinct de Bruijn graphs. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012). [10] Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de bruijn graphs. Nature genetics 44, 226–232 (2012). 14 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.11.17.386649doi: bioRxiv preprint https://doi.org/10.1101/2020.11.17.386649 http://creativecommons.org/licenses/by-nc/4.0/ [11] Pandey, P. et al. Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index. Cell Systems (2018). URL http://dx.doi.org/10.1016/j.cels.2018.05.021. [12] Muggli, M. D. et al. Succinct colored de Bruijn graphs. Bioinformatics (2017). [13] Muggli, M. D., Alipanahi, B. & Boucher, C. Building large updatable colored de bruijn graphs via merging. Bioinformatics 35, i51–i60 (2019). [14] Raman, R., Raman, V. & Satti, S. R. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms (TALG) 3, 43–es (2007). [15] Elias, P. Efficient storage and retrieval by content and address of static files. Journal of the ACM (JACM) 21, 246–260 (1974). [16] Fano, R. M. On the number of bits required to implement an associative memory (Massachusetts Institute of Technology, Project MAC, 1971). [17] Almodaresi, F., Pandey, P. & Patro, R. Rainbowfish: A Succinct Colored de Bruijn Graph Rep- resentation. In Schwartz, R. & Reinert, K. (eds.) 17th International Workshop on Algorithms in Bioinformatics (WABI 2017), vol. 88 of Leibniz International Proceedings in Informatics (LIPIcs), 18:1–18:15 (Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 2017). URL http://drops.dagstuhl.de/opus/volltexte/2017/7657. [18] Karasikov, M. et al. Sparse binary relation representations for genome graph annotation. Journal of Computational Biology 27, 626–639 (2020). [19] Bingmann, T., Bradley, P., Gauger, F. & Iqbal, Z. Cobs: a compact bit-sliced signature index. In International Symposium on String Processing and Information Retrieval, 285–303 (Springer, 2019). [20] Harris, R. S. & Medvedev, P. Improved representation of sequence bloom trees. Bioinformatics 36, 721–727 (2020). [21] Marchet, C. et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Research 31, 1–12 (2021). [22] Mustafa, H. et al. Dynamic compression schemes for graph coloring. Bioinformatics 35, 407–414 (2019). [23] Almodaresi, F., Pandey, P., Ferdman, M., Johnson, R. & Patro, R. An efficient, scalable and exact rep- resentation of high-dimensional color information enabled via de bruijn graph search. In International Conference on Research in Computational Molecular Biology, 1–18 (Springer, 2019). [24] Gog, S., Beller, T., Moffat, A. & Petri, M. From theory to practice: Plug and play with succinct data structures. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2014). [25] O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Research (2016). 15 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.11.17.386649doi: bioRxiv preprint https://doi.org/10.1101/2020.11.17.386649 http://creativecommons.org/licenses/by-nc/4.0/ 10_1101-2020_12_24_424317 ---- Multi-class Cancer Classification and Biomarker Identification using Deep Learning Multi-class Cancer Classification and Biomarker Identification using Deep Learning Fariha Muazzam 1 Abstract Genetic data is important for analysing cellular functions whose disruption gives rise to various kinds of cancer. The intricacies of gene interaction are captured in various kinds of data for cancer detection through sequencing technology, but diagnosis, prognosis and treatment are still hard. Advent of machine learning helped researchers in supervised and unsupervised learning tasks along with gene identification but resourcefulness has not been overtly satisfactory. This research revolves around multi-class cancer classification, feature extraction and relevant gene identification through deep learning methods for 12 different types of cancers using RNA-SEQ from The Cancer Genome Atlas. It has been constrained by hardware resource availability and within them the experiments that have been performed have shown promising results. Stacked De-noising Autoencoders were used for feature extraction and biomarker identification while 1D Convolutional Neural Networks for classification. Classification was performed with extracted features and relevant genes,which gave average performance of around 94% and 95% respectively. We were able to identify generic cancer-related pathways and their associated genes through Stacked De-noising Auto-encoders generated weight matrix and features. The common pathways include WNT Signalling Pathway, Angiogenesis. Moreover, across all pathways some recurrent genes were observed, namely: PIK3C2G, PCDHB8, WNT10A and these genes were found, in literature, to be involved in multiple types of cancer. The proposed approach shows superior performance and promise against traditional techniques used by bioinformatics community, in terms of accuracy and relevant gene identification. Keywords: Cancer Detection, Cancer Prevention, Targeted Therapy, Precision Medicine 1 Department of Computer Science, National University of Computer And Emerging Sciences Correspondence: Fariha Muazzam (l165018@lhr.nu.edu.pk) mailto:l165018@lhr.nu.edu.pk INTRODUCTION Genes play an important role in the normal functioning of humans’ bodily processes and physiology (1). However, there is a nuance of uncertainty associated with molecular events that occur which can cause alteration in routine processes. Such changes in mechanism can lead to mutations or chromosomal rearrangements which can be harmful or benign, but are heavily associated with cancer causation (1). Identification of genes or group of genes propagating cancerous cell formation provides meaningful opportunity to detect cancer at an early stage or stagnate its progression at a later stage (1). In today’s day and age cancer is one of the leading diseases, causing 8.2 million deaths each year ( 2). Cancer diagnosis and treatment remain to be center of attention for medical professionals and researchers everywhere. Development of high- throughput DNA sequencing technology has led to varied discoveries in the field of genomics as mutation profiles, RNA expressions or micro-RNA profiles can be easily detected now (1).The importance of such genetic data can be realized by the fact that cancer diagnosis, progression and prognosis can be statistically analyzed through machine learning algorithms. Furthermore, sub-networks of genes and individual biomarkers responsible for cancer can be marginalized for precision medicine (1) (3). Machine learning and deep learning techniques have been used extensively in domains such as image processing, natural language processing or audio recognition and have shown great promise. However, with regard to field of bioinformatics, focus has always been towards recognizing subtypes or biomarkers through clustering algorithms. In recent past, focus has shifted towards classification through supervised learning algorithms for RNA-seq expressions. With somatic mutations, very naive or basic methods have been used for classification. Also, multi-class classification has not really been explored even though cross-cancer biomarkers identification has been tampered with. Machine learning algorithms ease two challenges associated with study of genetic data: extraction of meaningful genes and classification of cancer. Techniques like Principal Component Analysis, K-Means Clustering and Independent Component Analysis have been used to reduce dimensions while K-Nearest Neighbors, Random Forrest and Support Vector Machines for classification (4) (5). Due to availability of large datasets and computational resources, researchers have moved towards using deep learning algorithms in classification problems like object detection or image classification (1). More recently, bioinformatics has been penetrated with the applications of deep learning to genetic data for drug discovery, gene regulation or protein classification as huge sets of data are accessible (6). Hence, cancer detection based on gene expressions or mutation profiles has been experimented with deep learning architectures to improve classification accuracy and identification of biomarkers. For cancer detection through gene expressions, Generative Adversarial Network(GAN) (7), Stacked Denoising Auto encoder(SDA)(5), Artificial Neural Networks(ANN) (5), Discriminant Deep Belief Networks(DDBN)(8) and One-Dimensional Convolutional Neural Networks(1DCNN) (9) have been used. Deep Neural Networks(DNN) have been used for heterogeneous classification of different types of cancer using somatic mutation profiles (1). This research picks up from detection of different types of cancer RNA-Seq expressions using deep neural networks with application of dimensionality reduction (1). RNA-seq expressions data for breast cancer has been reduced using Kernel Principal Component Analysis(KPCA) and Principal Component Analysis(PCA) and classified using SVM with Linear and Radial Basis Function kernels and ANN (5). Heterogeneous RNA- seq expressions data has been analyzed with SDA for feature extraction and biomarker identification and DNN and 1DCNN for multi-class classification. The purpose was to achieve high classification performance and extract meaningful genes for targeted therapy by exploring deep learning architectures that have not been tried yet on RNA-seq expressions based cancer classification. This section would be followed by description of materials and methods, results acquired and final conclusion of the whole study. RELATED WORK Cancer detection from genetic data has been a challenging task but an important one for bioinformatics researchers. Due to cheaper DNA sequencing technology, larger datasets are available to be used for diagnosis, treatment or prognosis. Hence, various feature extraction and machine learning algorithms have been used for dimensionality reduction and classification over the years. Moreover, the world has moved from studying effects of individual gene functions to gene networks. Moreover, same networks can cause various diseases as well. Over the years with advancement of sequencing technology, scientists have incorporated various forms of gene expressions data in their studies; ranging from microarray expression to DNA sequencing (10). In recent years the shift has been moved from microarrays to RNA-seq datasets for gene expressions-based cancer research. However, regardless of the data type most of the techniques used for cancer detection or relevant gene identification have been the same. Clustering analysis has been used to group significant genes together and aid with accurate classification of samples. K-Nearest Neighbours has been used for quantifying correlation between gene expressions for prostate cancer (11) and with varied distance measures for classification of breast cancer (12). Also k-means clustering classification based on driver genes, identified using wavelet transforms for colon and leukemia samples (13). Hierarchal clustering has been utilized to classify subtypes of breast cancer data(14) and cancer data with reduced dimensionality(15). Apart from clustering, SVMs have been used stupendously for classification of gene expression profiles for different kinds of cancers. Multi-category SVMs aided the subtype classification of leukaemia dataset to a great extent (4). Network Algorithms have also been used to identify network of genes contributing to propagation of multiple types of cancer (16). However, since, numerous machine learning algorithms have been developed; researchers have explored their usefulness with respect to cancer diagnosis and biomarker identification. With the advent of deep learning methods, there has been an obvious inclination towards using them for dimensionality reduction as well as classification. Gupta et al.(1) in their paper used this architecture for learning meaningful representation of gene expressions data of yeast cell cycle Clusters of genes evaluated from raw input were already labeled and were compared with the clustering of output of SDA. Moreover, PCA was also tested on gene expression profiles and evaluated with aforementioned clustering algorithms. The results reveal that SDA capture gene co-expressions better than PCA by all means. Danaae et al. (5) focused their research on extracting deeply connected genes from RNA-seq expressions of breast cancer data using SDA. PCA and KPCA were used as comparative techniques to measure SDA’s efficacy. Apart from reducing dimensionality, they have analyzed the weight matrix of SDA to identify contributor genes. These genes have been tagged as Deeply Connected Genes(DCGs). Panther pathways was used to analyze functions corresponding to different genes and tumor suppressor genes. Bhat et al. (7) experimented with Generative Adversarial Deep Convolution Networks to accurately classify gene expression-based datasets of two types of cancer: breast cancer and prostrate cancer. Karabulut et al. (8) demonstrated the efficiency of DDBN on classification of cancer as compared to traditional model like SVM. Experiments were performed on three different types of cancer individually: laryngeal, colorectal and bladder. For comparison SVM, Random Forrest and K-NN were applied on all datasets. Results revealed that DDBN outperformed all the afore-mentioned classification. Liu et al. (9) focused their research on discrimination of tumor samples from normal ones. They proposed sample expansion method inspired from SAE and SDA to enlarge training samples. 1DCNN has been proposed in this paper for tumor classification. It takes input in one dimensional vector instead of traditional two dimensions used for image classification. The performance of 1DCNN was better than that of SAE on each dataset Teixeira et al. (17) worked for singling out most informative genes using SDA for classification of thyroid cancer using ANN. They used traditional methods like PCA and Kernel PCA for comparison with deep learning method for feature extraction. Output of SDA was analyzed by extracting the weight matrix and using Connected Weights Method and three groups of genes were discovered with inter-related functions. Hence, the effectiveness of deep learning models for feature extraction and relevant gene identification is prominent especially when the world is moving towards precision medicine. So it is that, multi-class classification and biomarker identification is the current focus and for that reason researchers have been experimenting with deep learning. Deep learning has become famous for classification problems related to larger datasets and feature extraction for wide variety of fields and more recently for bioinformatics too. MATERIAL AND METHOD Acquisition of data Gene Expressions datasets have been most widely used with relation to anomaly classification as mentioned in before. The dataset for this study has been formulated from The Cancer Genome Atlas(TCGA) supported portals. • RNA-seq Expressions TCGA portal provides gene expressions data in form of read counts as well as normalized expressions for 33 different types of cancer. For multi-class classification, each kind of data has to have same genes and this is ensured by the fact that they are sequenced by same technology and preprocessed with same techniques. Broad Institute GDAC portal provides dataset for RNA-seq expressions in raw form as well as RSEM normalized form. For this research, Illumina Hiseq RSEM normalized dataset has been used as seen in Table 1 Table 1: TCGA multi-class cancer dataset Dataset Split The dataset for 12 types was combined into 1 dataset with each sample given a corresponding label for its type of cancer. The labels were numbers between 0-11 for each sample, where each number corresponds to a specific cancer type. The dataset contained around 4967 samples for 12 types of cancer, and was split into training, validation and test sets. The percentage split of each set was 70%, 15% and 15% respectively. As there was an apparent class imbalance among different types, so division of dataset was kept proportionate per class. To elaborate it means, that each type was divided into three sets with the afore-mentioned percentage split. Preprocessing The genes have been normalized and those with zero values across all samples have been removed, as they would not contribute to the results. SDA The experiments used output of SDA as an input to 1DCNN for classification of cancer types. SDA has been trained through greedy-layer wise training where each layer is trained for a specific number of iterations and the output of the preceding was used as input to the succeeding layer. Number of hidden units per layer were decreased gradually because it has known to incorporate the features better. Five experiments were Cancer Type No. of Cancerous Samples Breast invasive carcinoma(BRCA) 1100 Adrenocortial carcinoma(ACC) 70 Cervical and endocervical cancer(CESC) 304 Head and neck Squamous Carci- noma(HNSC) 520 Kidney renal papillary cell carcinoma(KIRP) 290 BrainLower Grade Glioma ( LGG) 516 Lung adenocarcinoma (LUAD) 501 Pancreatic adenocarcinoma (PAAD) 178 Prostate adenocarcinoma (PRAD) 497 Stomach adenocarcinoma(STAD) 416 Uterine Carcinosarcoma (UCS) 57 Bladder urothelial carcinoma(BLCA) 408 performed revealing substantial results and they produced five high-ranked gene sets and reduced feature sets. The output of SDA was used in two ways: • Using Reduced Features The output of the final layer of SDA was the reduced features of the dataset. These features for each sample were stored for training dataset after the desired iterations were performed. Final weights for each layer were also stored so that it could be used to reduce features for test and validation dataset. • Using High Ranked Genes Final weight matrix when analyzed shows that the weights of the genes were normally distributed. A small portion of genes had high weights which had been regarded as high- weight genes. These genes were filtered in training and testing datasets so as to reduce the number of features as done in (18). The weight matrix of each layer was multiplied to generate a Number of Genes X Number of Features Matrix. Mweight=∏ i=0 n W i (1) For each node, mean weight and standard deviation was calculated and genes were ranked by filtering genes outside specific number of standard deviations. G = mean−nstd ∗std > Genes > mean + nstd ∗std (2) 1DCNN The reduced features extracted using SDA was fed to 1dcnn for classification. The overall accuracy of the system determines whether the extracted features were of any significance or not. Biomarker Identification For biomarker identification, high-ranked gene sets were generated for different SDA architectures and their relevant pathways were identified from panther database. Overlapping pathways and genes were analyzed amongst all sets and there were quite a few that overlapped. The overlapping genes were checked against literature to confirm whether the identified genes are cross-cancer ones, and they are identified as biomarkers. Figure 1: Workflow RESULTS This study focused exploration with RNA-Seq expression dataset only due to its availability; however it can be safely assumed that the built pipeline could be useful for other types of datasets as well. SDA As mentioned in previous section, the output as reduced features and high-ranked genes based on weight matrix was used. Different number of layers of SDA was trained with different hidden units. As per literature, if the number of hidden units is decreased gradually then SDA better incorporates the features for reconstruction. The original number of genes was 20531 and removing the genes with zero value across all samples, left total number of genes to be 20313. The hidden units ranged between 15000-200 for whole architecture but first two layers contained fixed number of units 15000 and 10000 respectively. Only third- last and last layer were changed for experiments. Substantial experiments were conducted with 3 and 4 layers as that gave higher accuracy. For reduced features, the best results were obtained when the reconstruction layer contained higher number of units. The features were tested by using 1DCNNs for classification. However, the accuracy kind of plateaued at 4000 features with around 96.5%. The following graph in Fig 1 shows the accuracy achieved with 1dcnns and varied number of layers and reduced features. The experiments included in this graph are with 3 and 4 layers. The first and second layer contained fixed 15000 and 10000 units respectively. RNA-Seq Expressions Normalization Dimensionality Reduction/Feat ure Extraction Classification High-ranked Genes Biomarker Identification Figure 2: Accuracy with linear combination of reduced features High-Ranked Genes The weight matrix for each layer of SDA was used to rank genes based on the combination of their weights. As per literature it has been observed that genes with higher weights tend to act as contributing genes towards cancer. As per (18) the weight matrix of SDA follows an approximate normal distribution and the highly negative or highly positive genes in terms of their weights are significant genes. So, the genes away from mean weights would be categorized as the high-ranked ones. So we used standard deviation from the mean to identify the relevant genes. Due to limitation of resources, the experiments could only be performed within a restricted range; nevertheless they show huge performance in terms of relevant gene identification. It was observed that genes that stood ground away from the mean were actually the relevant ones. Also the genes that overlapped amongst different SDA architectures were considered to be cross-cancer relevant genes. Since the aim of this research has always been that we achieve maximum performance with minimal genes; architectures within the range of 200-1000 features give better performance within 4-5 standard deviation. Four genes were found to be similar amongst all pathways for all sets across all standard deviations, so proof of them being involved in multiple types was studied in literature. The study shows the promise and relevance of realized genes as seen in Table 2. Table 2: Relevance of Identified Genes In Literature Genes Cancer Types WNT10A BRCA 19 LUAD 20 BLCA 21 PRAD 22 PAAD 23 PIK3C2G BRCA 24 BLCA 25 HNSC 26 Apart from that, there are 4 pathways that are found to be common in overlapping genes for different standard deviations, however two of them are same as found in all sets of experiment-generated genes for all standard deviations namely: WNT Pathway and Angiogenesis. Also the genes associated with these pathways are similar to that found in experiment-generated gene sets. The following Figure 3 shows how standard deviations between 4-5 relates pathways and overlapping genes and the scope for meaningful analysis. Figure 3: Pathway hits against different standard deviations The following tables show the summarized results for reduced features and high-ranked genes in comparison to other similar studies. Table 3: Summarized Results for Reduced Features Paper Classification Mean Per Class Accuracy Danae. et al(5) Breast Cancer 98.26 Proposed Multi-class( 12 types including breast cancer) 94.25 Table 4: Summarized Results for High-ranked genes Paper Classifiaction High- ranked genes Mean Per Class Accuracy Pathway Hits Danaee et. al(5) Breast Cancer 500 94.78 1 Proposed Multi-class( 12 types including breast cancer) 956 95.32 7 CONCLUSION This study was aimed at classifying 12 types of cancer and identifying relevant genes and the results show that the proposed approach shows promise for the said task. Usage of SDA with 1DCNN has revealed an average accuracy of 94% for reduced features and 95% for high-ranked genes. This shows that relevant gene sets could help with cancer classification task as well as cross-cancer gene and pathway identification. We were able to identify cancer-relevant pathways and genes for the sets, that different experiments generated, from Panther Database. The common genes amongst all experiments were verified by literature as to be involved in multiple cancers. This shows that our method can be used for multi-class or single-class cancer classification and for recognizing the relevant genes as biomarkers. This gives hope to identify those genes that have yet not been explored by literature. Panther Database is used by bioinformatics community to study the origin, families and relevance of genes with respect to single type or varied types of cancer. That involves a lot of manual analysis, but deep learning decreases the load by pointing to relevant genes and pathways or identify newer pathways and genes. The hardware resource constrained the study but reliability and significance of automating the classification and identification with deep learning was still realized. More experiments would show more avenues that could be explored for cancer study through deep learning. Furthermore, using more types of cancer would also aid in identifying larger sets of cross-cancer biomarkers and pathways. This study is just a step to show the relevance of using automated gene identification techniques which are reliable and can handle large amount of variations and unknowns and ambiguities. Whereas, the traditional statistical techniques for genes involve thresholding depending on the samples and the genes involved. Even though resource limitation in terms of GPU hours was tackled during the course of study, it still provided good results. ADDITIONAL INFORMATION Ethics Approval This is an original study performed using open source dataset of TCGA and there is no violation of rights and obligations for usage of the dataset. Data Availability The data was downloaded from broad institute firehose database (https://gdac.broadinstitute.org/). Conflict of Interest There is no conflict of interest in with regarding to authors’ contributions Funding The project was completed by using first author’s own funds. No external funding was involved. Author’s Contributions The project was implemented and paper was written by first author. Second author provided guidance for forming the workflow and methodology of the study. Acknowledgements This study could not have been without the guidance and support of my supervisor Dr. Saira Karim. REFERENCES 1. Gupta A, Wang H, Ganapathiraju M. Learning structure in gene expression data using deep architectures, with an application to gene clustering. In: Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on. 2015. p. 1328–35. 2. Yuan Y, Shi Y, Li C, Kim J, Cai W, Han Z, et al. DeepGene : an advanced cancer type classifier based on deep learning and somatic point mutations. BMC Bioinformatics [Internet]. 2016;17(Suppl 17). Available from: http://dx.doi.org/10.1186/s12859-016-1334-9 3. Fawzy H, Kamel M, Al-amodi HSAB. Exploitation of Gene Expression and Cancer Biomarkers in Paving the Path to Era of Personalized Medicine. Genomics Proteomics Bioinformatics [Internet]. 2017;15(4):220–35. Available from: http://dx.doi.org/10.1016/j.gpb.2016.11.005 4. Lee Y, Lee C-K. Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics. 2003;19(9):1132–9. 5. Danaee P, Ghaeini R, Hendrix DA. A deep learning approach for cancer detection and relevant gene identification. In: PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017. 2017. p. 219–29. 6. Min S, Lee B, Yoon S. Deep learning in bioinformatics. Brief Bioinform. 2017;18(5):851–69. 7. Bhat RR, Viswanath V, Li X. DeepCancer : Detecting Cancer through Gene Expressions via Deep Generative Learning. (Ml). 8. Karabulut EM. Discriminative deep belief networks for microarray based cancer classification . 2017;28(3):1016–24. 9. Liu J, Wang X, Cheng Y, Zhang L. Tumor gene expression data classification via sample expansion- based deep learning. Oncotarget. 2017;8(65):109646. 10. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57. 11. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, et al. Gene expression correlates of clinical prostate cancer behavior. 2002;1(March):203–9. https://gdac.broadinstitute.org/ 12. Rules C, Medjahed SA. Breast Cancer Diagnosis by using k-Nearest Neighbor with Different Breast Cancer Diagnosis by using k-Nearest Neighbor with Different Distances and Classification Rules. 2013;(January). 13. Mishra P, Bhoi N, Meher J. Effective clustering of microarray gene expression data using signal processing and soft computing methods. In: Electrical, Electronics, Signals, Communication and Optimization (EESCO), 2015 International Conference on. 2015. p. 1–4. 14. Woodward WA, Krishnamurthy S, Yamauchi H, El-Zein R, Ogura D, Kitadai E, et al. Genomic and expression analysis of microdissected inflammatory breast cancer. Breast Cancer Res Treat. 2013;138(3):761–72. 15. Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. 2001;7(6):673. 16. Martinez-Ledesma E, Verhaak RGW, Treviño V. Identification of a multi-cancer gene expression biomarker for cancer clinical outcomes using a network-based algorithm. Sci Rep. 2015;5:11966. 17. Teixeira V, Camacho R, Ferreira PG. Learning influential genes on cancer gene expression data with stacked denoising autoencoders. In: Bioinformatics and Biomedicine (BIBM), 2017 IEEE International Conference on. 2017. p. 1201–5. 18. Microbe-host AI. ADAGE-Based Integration of Publicly Available Pseudomonas aeruginosa Gene Interactions. 1(1):1–17. 19. Braune E-B, Seshire A, Lendahl U. Notch and Wnt Dysregulation and Its Relevance for Breast Cancer and Tumor Initiation. Biomedicines. 2018;6(4):101. doi:10.3390/biomedicines6040101 20. Tammela T, Sanchez-Rivera FJ, Cetinbas NM, et al. A Wnt-producing niche drives proliferative potential and progression in lung adenocarcinoma. Nature. 2017;545(7654):355-359. doi:10.1038/nature22334 21. Zhang M, Li H, Zou D, Gao J. Ruguo key genes and tumor driving factors identification of bladder cancer based on the RNA-seq profile. Onco Targets Ther. 2016;9:2717-2723. doi:10.2147/OTT.S92529 22. Ahmad I, Sansom OJ. Role of Wnt signalling in advanced prostate cancer. J Pathol. 2018;245(1):3-5. doi:10.1002/path.5029 23. Fakhar M, Najumuddin, Gul M, Rashid S. Antagonistic role of Klotho-derived peptides dynamics in the pancreatic cancer treatment through obstructing WNT-1 and Frizzled binding. Biophys Chem. 2018;240(June):107-117. doi:10.1016/j.bpc.2018.07.002 24. Fidalgo F, Rodrigues TC, Pinilla M, et al. Lymphovascular invasion and histologic grade are associated with specific genomic profiles in invasive carcinomas of the breast. Tumor Biol. 2015;36(3):1835-1848. doi:10.1007/s13277-014-2786-z 25. Tilley SK, Kim WY, Fry RC. Analysis of bladder cancer tumor CpG methylation and gene expression within The Cancer Genome Atlas identifies GRIA1 as a prognostic biomarker for basal-like bladder cancer. Am J Cancer Res. 2017;7(9):1850-1862. 26. Simpson DR, Mell LK, Cohen EEW. Targeting the PI3K/AKT/mTOR pathway in squamous cell carcinoma of the head and neck. Oral Oncol. 2015;51(4):291-298. doi:10.1016/j.oraloncology.2014.11.012 Multi-class Cancer Classification and Biomarker Identification using Deep Learning Abstract INTRODUCTION RELATED WORK MATERIAL AND METHOD Preprocessing SDA 1DCNN RESULTS High-Ranked Genes 10_1101-2021_02_01_429246 ---- Sequence-specific minimizers via polar sets Sequence-specific minimizers via polar sets Hongyu Zheng1, Carl Kingsford1, and Guillaume Marçais∗1 1Computational Biology Department, Carnegie Mellon University, Pittsburgh, USA February 10, 2021 Abstract Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences. We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers. A reference implementation and code for analyses under an open-source license are at https://github.com/kingsford- group/polarset. 1 Introduction The minimizer (Roberts et al., 2004a,b) methods, also known as winnowing (Schleimer et al., 2003), are methods to sample positions or k-mers (substrings of length k) from a long string. Thanks to its versatility, this method is used in many bioinformatics programs to reduce memory requirements and computational resources. Read mappers (Li and Birol, 2018; Jain et al., 2020b,a), k-mer counters (Erbert et al., 2017; Deorowicz et al., 2015), genome assemblers (Ye et al., 2012; Chikhi et al., 2015) and many more (see Marçais et al. (2019) for a review) use minimizers. In most cases, sampling the smallest number of positions, as long as a string is roughly uniformly sampled, is desirable as it leads to sparser data structures or less computation as fewer k-mers need to be processed. Minimizers have such a guarantee of approximate uniform sampling: given the parameters w and k, it guarantees to select at least one k-mer in every window of w consecutive k-mers. It achieves this goal by selecting the smallest k-mer (the “minimizer”) in every w-long window, where smallest is defined by a choice of an order O on the k-mers. Even though every minimizer scheme satisfies the constraint above, depending on the choice of the order O the total number of selected k-mers may vary significantly. Consequently, research on minimizers has focused on finding orders O that obtain the lowest possible density, where the density is defined as the number of selected k-mers over the length of the sequence. In particular, most research concentrates on the average case: what is the lowest expected density given a long random input sequence? (Marçais et al., 2017, 2018; Ekim et al., 2020; Orenstein et al., 2016). In practice, many tools use a “random minimizer” where the order is defined by choosing at random a permutation of ∗To whom correspondence should be addressed. gmarcais@cs.cmu.edu 1 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://github.com/kingsford-group/polarset https://github.com/kingsford-group/polarset https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ all the k-mers (e.g., by using a hash function on the k-mers). This choice has the advantage of being simple to implement and providing good performance on the average case. Here we investigate a different setup that is common in bioinformatics applications. Instead of the average density over a random input we try to optimize the density for one particular string or sequence. When applying minimizers in computational genomics, in many scenarios the sequence is known well in advance and it does not change very often. For example, a read aligner may align reads repeatedly against the same reference genome (e.g., the human reference genome). In such cases, optimizing the density on this specific sequence is more meaningful than on a random sequence. Moreover, the human genome has markedly different properties than a random sequence and optimization for the average case may not carry over to this specific sequence. In the read aligner example, a minimizer with lower density leads to a smaller index to save on disk and fewer seeds to consider in the seed-and-extend alignment algorithm while preserving the same sensitivity thanks to the approximate uniform sampling property. The idea of constructing sequence sketches tailored to a specific sequence has been explored before (Chikhi et al., 2015; DeBlasio et al., 2019; Jain et al., 2020b), but it remains less understood than the average case. Random sequences have nice properties that allow for simplified probabilistic analysis. Consequently, different analytic tools are needed to analyze sequence-specific minimizers. In fact, minimizers designed to have low density in the average case often offer only modest improvements on sequences of interest such as reference genomes (Zheng et al., 2020a). The current theory for minimizers with low density in average is tightly linked to the theory of universal hitting sets (UHS) (Orenstein et al., 2016; Marçais et al., 2018; Kempa and Kociumaka, 2019). As the name suggests, a UHS is a set of k-mers that “hits” every w-long window of every possible sequence (hence the universality; it is an unavoidable set of k-mers). Universal hitting sets of small size generate minimizers with a provable upper-bound on their density. Universal hitting sets are less useful in the sequence-specific case as the requirement to hit every window of every sequence is too strong, and UHSs are too large to provide a meaningful upper-bound on the density in the sequence-specific case. New theoretical tools are needed to analyze the sequence-specific case. Frequency-based orders are examples of sequence-specific minimizers (Chikhi et al., 2015; Jain et al., 2020b). In these constructions, k-mers that occur less frequently in the sequence compare less than k- mers that occur more frequently. The intuition is to select rare k-mers as they should be spread apart in the sequence, hence giving a sparse sampling. This intuition is only partially correct. First, there is no theoretical guarantee that a frequency-based order gives low density minimizers, and there are many theoretical counter- examples. Second, in practice, frequency-based orders often give minimizers with lower density, but not always. For example, Winnowmap (Jain et al., 2020b) uses a two-tier classification (very frequent vs. less frequent k-mers) as it performs better than an order strictly following frequency of occurrence. Another approach to sequence-specific minimizers is to start from a UHS U and to remove as many k-mers from U as long as it still hits every w-long window of the sequence of interest (DeBlasio et al., 2019). Because this procedure starts with a UHS that is not related to the sequence, the amount of possible improvement in density is limited. Additionally, given the exponential growth in size of the UHS with k, current methods are computationally limited to k ≤ 15, which is limiting in many applications. The construction proposed here takes a different approach and introduces polar sets. The polar sets concept can be seen as complementary to the universal hitting sets: while a UHS is a set of k-mers that intersects with every w-long window at least once, a polar set is a set of k-mers that intersect with any window at most once. The name “polar set” is an analogy to a set of polar opposite magnets that cannot be too close to one another. That is, our construction builds upon sets of k-mers that are sparse in the sequence of interest, and consequently the minimizers derived from these polar sets have provably tight bounds on their density. Our main contribution is Theorem 1 that gives an upper bound and a lower bound on the density obtained by a minimizer created from a polar set. These bounds are expressed in term of the “total link energy” of the polar set on the given sequence. The link energy is a new concept that measures how well spread apart the elements of the polar sets are in the sequence: the higher the energy, the more spread apart the k-mers are. Then we show that the link energy is almost exactly the improvement in density one gains from using a minimizer created from the polar set compared to a random minimizer. In the following sections we also show that the problem of finding a polar set with maximum total link energy is, unsurprisingly, NP-hard, and we describe a heuristic to create polar sets with high total link energy. 2 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Finally, we show that our implementation of this heuristic generates minimizers that have specific density on the human reference genome much lower than any other previous methods, and, for some parameter choices, relatively close to the theoretical minimum. 2 Methods 2.1 Overview We set the stage by defining important terms and concepts, then giving an overview of the main results, which are then proved formally in the following sections. The sequence S is a string on the alphabet Σ of size σ = |Σ|. The parameters k and w define respectively the length of the k-mers and the window size. We assume that S is relatively long compared to these parameters: |S|� w + k. Definition 1 (Minimizer and Windows). A minimizer is characterized by (w,k,O) where O is a complete order of Σk. A window is a sequence of length (w + k − 1) consisting of exactly w k-mers. Given a window as input, the minimizer outputs the location of the smallest k-mer according to O, breaking ties by preferring the leftmost k-mer. The minimizer (w,k,O) is applied to the sequence S by finding the position of the smallest k-mer in every window of S. Because two consecutive windows in S have a large overlap, the same k-mer is often selected in these two windows, hence the minimizer returns a sampling of positions in the sequence S. The specific density of the minimizer on S is defined as the number of selected positions over the length |S|. The density is between 1/w, because at least one k-mer in every window must be picked, and 1, because it is a sampling of the positions of S. Therefore the goal is to find orders O that have a density as close to 1/w as possible. A minimizer with density 1/w is a perfect minimizer. For simplicity, when stating the density of a minimizer we ignore any additive term that is o(1/w) (i.e., asymptotically negligible compared to 1/w). A random minimizer is defined by choosing at random one of the permutations of all k-mers. The expected density of a random minimizer is 2/(w + 1) (Schleimer et al., 2003; Roberts et al., 2004b; Zheng et al., 2020a). Equivalently, the expected distance between adjacent selected k-mers is (w + 1)/2. The random minimizers will serve as a baseline to compare to. Defining orders. For practical reasons, we define orders by defining a set U and considering orders that are compatible with U: an order O is compatible with U if for O every element of U compares less than any element not in U. That is, only the smallest elements for O are specified (the elements of U) and a minimizer using an order compatible with U will preferentially select the elements of U. There exist many orders that are compatible with U as the relative order between the elements within U is not specified. Universal Hitting Sets. A set U is a universal hitting if for every one of σw+k−1 possible windows (recall σ is the size of the alphabet), it contains a k-mer from U. In the average case, minimizers compatible with U have densities upper bounded by |U|/σk, because only k-mers from the universal hitting set can be selected. Supplementary Section S2 provides a more detailed discussion of why this bound provided by universal hitting sets does not always apply for sequence-specific minimizer analysis, and why universal hitting sets do not specialize well. Short sequences. On a short random sequence (in a sense made precise by Lemma 1) most k-mers are unique (i.e., they occur only once in the sequence S). Therefore, it is likely that there is a set U of unique k-mers of S that are exactly w bases apart in S, and a minimizer compatible with U is perfect. Unfortunately most sequences of interest (e.g., reference genomes) are too long, too repetitive and in general do not satisfy the hypothesis of Lemma 1. For most sequences it is not possible to find a set of “perfect seeds” of k-mers spaced exactly w apart. 3 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Polar sets. An polar set is a relaxed version of a perfect set: any pair of k-mers m1 and m2 from a polar set A are always more than w/2 bases apart in S (see the more general Definition 2). The intuition behind this definition is that for a minimizer compatible with A, any k-mer from A selected by the minimizer is at distance ≥ (w + 1)/2 from the previous and the next selected k-mer. Hence, k-mers selected from A are at least as sparse, and usually more sparse than k-mers selected using a random minimizer in expectation. Section 2.2 gives a formal definition of the link energy of a polar set and Theorem 1 gives upper and lower bounds using this link energy for the density of a minimizer compatible with a polar set. This theorem shows that the link energy of the polar set A is a measure of how much reduction in density is obtained by using a minimizer compatible with A rather than a random minimizer. Hence, designing a polar set with high link energy is a method to find minimizers with provably low density. Section 2.3 introduces layered polar sets, which are an extension to polar sets, and builds a heuristic method to create such sets. 2.2 Polar sets and link energy 2.2.1 Key Definitions k-mers in UHS Polar k-mers (slackness s=0) All windows contain at least one UHS k-mer Some windows have multiple All windows contain at most one polar k-mer Some windows have none Perfect Seeds All windows contain exactly one seed k-mer Figure 1: Comparing universal hitting sets, perfect seeds (compatible minimizers become perfect minimizers) and polar sets. Each block indicates a k-mer, and each segment indicates a window of length 5 (w = 5). To provide a better contrast with universal hitting sets, we show polar sets with slackness s = 0 (see Definition 2). We now define polar sets, the key component for our proposed methods. Definition 2 (Polar set). Given sequence S and parameters (w,k,s) with 0 ≤ s < 1/2, a polar set A of slackness s is a set of k-mers such that every two k-mers in A appears at least (1 −s)w bases apart in S. This can be viewed as a complementary idea to the universal hitting sets or a relaxed form of perfect sets. As discussed in the introduction, a universal hitting set requires the set to hit every w consecutive k-mers at least once, while a polar set with s = 0 requires the set to hit every w consecutive k-mers at most once. A set of perfect seeds, if it exists, is both a polar set with zero slackness and a universal hitting set. See Figure 1 for a more concrete example. The condition s < 1/2 is critical for our analysis. Specifically, this condition is required to obtain a lower bound on the specific density of compatible minimizers, not just an upper bound. Definition 3 (Link energy). Given sequence S, parameters (w,k) and a polar set A, if two k-mers on S are l ≤ w bases apart and are both in A, the link energy of the pair is defined as 2l/(w + 1) − 1 ≥ 0. The total link energy of A is the sum of link energy across all eligible pairs. 4 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Any two k-mers from A in S must be more than w/2 bases apart, so two k-mers cannot form a link if there is a third k-mer from A between them. With s = 0, the link energy is fixed to be 2w/(w + 1) − 1 = 1 − 2/(w + 1) ≈ 1 for each eligible pair, and the total link energy is approximately the number of pairs that form a link, which in turn is the number of k-mer pairs in the polar set that are exactly w bases away on S. In the following sections, we introduce and discuss the backbone of the polar set framework, which revolves around closer inspection of how a random minimizer works on a specific sequence, and drawing contrast between sequence-specific minimizers and non-sequence-specific minimizers. We use the term “non- sequence-specific minimizers” to refer to constructions of minimizer that does not specifically target a certain sequence, but rather aim to minimize density, the expected specific density on a random string. 2.2.2 Perfect Minimizer for short sequences A perfect minimizer is a minimizer that achieves density of exactly 1/w. While the only known examples of perfect minimizers are in the asymptotic case where w � k (Marçais et al., 2018), perfect sequence-specific minimizers exist with high probability for short sequences. Lemma 1. If |S| < (1 − �) √ wσk/2, with at least � probability a random sequence of length |S| has a perfect minimizer. Proof. The optimal minimizer is constructed with fixed interval sampling. More specifically, we take every w k-mer in S and denote the resulting k-mer set U, then construct a minimizer compatible with U. The resulting minimizer is perfect if and only if the k-mers in U only appear in the selected locations. There are |S|/w selected locations and (1 − 1/w)|S| locations not selected, and for each pair of selected and not selected locations, the k-mer at these two locations are identical with probability σ−k (see Supplementary Section S1). By union bound, the probability that the sequence violates the polar set condition is at most |S|2σ−k/w < (1−�)2, and the sequence has a perfect minimizer with probability at least 1−(1−�)2 > �. 2.2.3 Context Energy and Energy Savers Contexts provide an alternative way to measure the density of a minimizer (Zheng et al., 2020a). These play a central role on the analysis of polar sets. Definition 4 (Charged Contexts). A context of S is a substring of length (w + k), or equivalently (w + 1) consecutive k-mers, or equivalently 2 consecutive windows. A context is charged if the minimizer selects a different k-mer in the first window than in the second window. See top left of Figure 2 for examples of charged contexts. Intuitively, a charged context corresponds to the event that a new k-mer is picked, and counting picked k-mers is equivalent to counting charged contexts. Lemma 2 (Specific Density by Charged Contexts). For a given sequence S and a minimizer, the number of selected locations by the minimizer equals the number of charged contexts plus 1. Given a context c, define E(c) as the probability that c will be charged with a random minimizer (one with a random ordering of k-mers), which we call the energy of c. Lemma 3. The expected number of picked k-mers in S under a random minimizer is 1 + E0(S), where E0(S) = ∑ c E(c) is called the initial energy of S and the summation is over every context of S. This is proved by combining the linearity of expectation and Lemma 2. This implies that the total energy of a sequence is directly related to the specific density of random minimizers, which is number of picked locations in S divided by number of k-mers in S. E(c) admits a simple formula: Lemma 4. E(c) = 2/u(c) if the last k-mer in the context is unique, 1/u(c) otherwise, where u(c) denotes the number of unique k-mers in c. 5 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Proof. Consider an imaginary minimizer with w′ = w+ 1 and identical k. The context of a (w,k)−minimizer is a window of the imaginary minimizer, and it is charged if and only if the imaginary minimizer picks either the first or the last k-mer. If the imaginary minimizer does not pick either end, the two constituent windows of the context share the same minimal k-mer, and the context is not charged. With a random minimizer, the probability that the first k-mer is picked in the imaginary window is 1/u(c). The probability that the last k-mer is picked is 1/u(c) if the last k-mer is unique, 0 otherwise, because the minimizer break ties by preferring leftmost k-mer. The two events are mutually exclusive, so E(c) is the sum of these two terms. If all k-mers in a context are unique, E(c) = 2/(w + 1) is guaranteed, which we call the baseline. If this holds for all windows, a random minimizer will have specific density of 2/(w + 1), similar to applying a random minimizer to a random sequence. As lower u(c) only increases E(c), E(c) < 2/(w + 1) only if the last k-mer in c is not unique and there are over (w + 1)/2 unique k-mers in the context. Definition 5. A context c is called an energy saver if E(c) < 2/(w + 1), and its energy deficit is defined as 2/(w + 1)−E(c). The energy deficit of S, denoted D(S), is the total energy deficit across all energy savers: D(S) = ∑ c max(0, 2/(w + 1) −E(c)). In general, the value of D(S) is very small due to the fact that energy saver contexts (those with E(c) < 2/(w + 1)) are rare. Lemma 5. For a random context, the probability that it is an energy saver is at most wσ−k. Proof. We bound the probability that the last k-mer in a context is not unique. The probability that the last k-mer equals a specific k-mer in another location is σ−k (see Supplementary Section S1). Applying union bound over w other k-mers (as each context has (w + 1) k-mers) we get the desired result. There are examples of sequences where energy saver contexts are abundant. An extreme scenario is when the sequence S is has a period of w, and has w distinct k-mers. In this case, all contexts become energy saver contexts. These scenarios are rare in practice. Similarly, we can define energy spenders and energy surplus as follows: Definition 6. A context c is called an energy spender if E(c) > 2/(w + 1), and its surplus is defined as E(c) − 2/(w + 1). The energy surplus of S, denoted X(S), is the total energy surplus across all energy spenders: X(S) = ∑ c max(0,E(c) − 2/(w + 1)). Contexts with energy surpluses are more common than energy savers, but still fairly rare in a random sequence with suitable choice of w and k: Lemma 6. For a random context, the probability that it is an energy spender is at most w(w + 1)σ−k/2. Proof. A context becomes an energy spender if the last k-mer is unique, and some k-mers appears twice. We bound the probability that some k-mers in the context appear twice. Following previous arguments, any two k-mers in a given context are identical to each other with probability σ−k, and we apply a union bound of size w(w + 1)/2 (enumerating over pairs of k-mers) to obtain the desired result. 2.2.4 Density Bounds with Polar Sets With the proper tools, we now state the main theorem of the Polar Sets. Theorem 1. Given a sequence S and a polar set A on S, let E0(S) be the initial energy of S, D(S) be the total energy deficit, X(S) be the total energy surplus, and L(S,A) be the total link energy from the polar set. The number of selected k-mers over S for a random minimizer compatible with A is at most 1 + E0(S) + D(S) −L(S,A), and at least 1 + E0(S) −X(S) −L(S,A). Proof. We first prove the upper bound part. We start by elevating the energy of every energy saver context to the baseline 2/(w + 1). By definition, this increases the total energy of S by D(S), so number of selected k-mers is now upper bounded by 1 + E0(S) + D(S). Formally, ∑ E(x) ≤ 1 + E0(S) + D(S). 6 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ A context has two windows. always charged never charged Contexts without polar k-mers (Singleton: 2 contexts charged, 6 covered, L=0) Charged if different k-mer selected in these windows. Not charged otherwise. minimizer selection A B C 2 3 4 could be charged (Linked Polar k-mers: 2+2=4 contexts charged, 13 covered, L=1/3) Contexts with polar k-mers as random minimizer 1 2 out of 4 is charged. ex. A D(S). For a ballpark estimate, we assume S is a random sequence, and assume the slackness parameter s = 0 in construction of the polar set. In this setup, each link has exactly 1 − 2/(w + 1) ≈ 1 energy. As seen in Lemma 5, a context is an energy saver with probability wσ−k, and its deficit is at most 2/(w + 1) − 1/w ≈ 1/w, meaning D(S) ≈ σ−k|S|. This further means we need the number of links to be at least σ−k|S| to provably beat a random minimizer. On the other hand, ignoring the effect of D(S), in order to beat the specific density of a random minimizer by �/(w + 1), total link energy of �|S|/(w + 1) is needed. Assuming no slackness, this means the number of links need to be at least �|S|/(w − 1). Intuitively, � portion of the sequence needs to be covered by links between close enough k-mers in polar set. A proper polar set requires s > 1/2 for the main theorem to hold. When s ≤ 1/2, only the upper bound part of the theorem holds with an alternative definition of link energy. We will discuss the alternative definition in Section 2.3.4, and further discuss generalization of polar sets in Supplementary Section S2.4. 2.2.5 Hardness of Optimizing Polar Sets The link energy formulation of polar sets allows us to cast the problem in graph theoretical framework. Consider an undirected, weighted graph where every unique k-mer is a vertex. An edge connects two k-mers with the following: If these two k-mers ever appear within fewer than (1−s)w bases of each other in S, the weight is −∞. Otherwise, the weight of this edge is the total link energy by selecting only these two k-mers, which might establish several links given each k-mer may appear in S multiple times. There can also be self-loops with weights, given a k-mer may appear close to itself on the reference sequence. The problem of finding optimal polar sets becomes the problem of finding an induced subgraph with maximum weight. The general maximum induced subgraph problem is well known to be NP-hard via reduction from max- clique. In Supplementary Section S3, we provide an explicit proof that shows optimization of polar sets, even with an alphabet of three, is NP-Hard. 2.3 Constructing Polar Sets In this section, we propose a practical extension to polar sets, and formally introduce our heuristics. 8 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Final Coverage (Not-Covered k-mers) Layer 1 Layer 2 Layer 3 Selected and not-covered locations Covered locations Figure 3: Examples of layered polar sets, with three layers. Without layered polar sets, the k-mers from layer 2 and 3 could not be selected as in the polar set because of self-collision. The whole sequence is covered in this case (every window contains a polar k-mer from one layer). Layer 1 is the one with highest priority and our layered heuristics construct it first. 2.3.1 Layered Polar Sets Assume we have already constructed a polar set A that covers some segments of the reference sequence. Here, covered means that every window contains a k-mer from the polar set, or equivalently, A acts as a universal hitting set on these segments. Now, to cover the rest of the reference, we shall extend A so more k-mers become polar k-mers. It is natural to consider generating a polar set over the uncovered portion of the reference sequence, then merge this set with A. This however leads to problems. Let A′ be a polar set over the uncovered portion of the reference sequence. A∪A′ might not always be a valid polar set, because a k-mer m′ ∈ A′ may appear in the already-covered part of the reference sequence, and appear close to another k-mer m ∈ A, thus violating the polar set condition for A∪A′. On the other hand, the reason we set up the constraint for polar sets is to ensure that k-mers in the polar set will always be selected by any compatible minimizer. In other words, we want to ensure we know exactly the set of k-mers that will be selected. The issue was that m′ ∈ A′ might not always be selected by a compatible minimizer. However, from the perspective of constructing efficient minimizers, we do not need m′ to be selected everywhere, as in some places the reference sequence is already covered with k-mers in A. By forcing m < m′ for any m ∈ A, we ensure that m′ will only be selected outside the segments covered by A. Applying this argument to all k-mers in A′, we can essentially ignore the sequence segments already covered by A when constructing A′, as long as the ordering is satisfied. This gives a way to progressively construct the layers of polar sets: at each layer we only need to consider regions of the reference sequence that are not yet covered by previous layers. Formally: Definition 7. A layered polar set is a list of sets of k-mers {Ai}, for 1 ≤ i ≤ m. With slackness s, the layered polar set condition is satisfied if for any k-mer in Aj, for each of its appearance at location t in the reference sequence, either of the following holds: • It is at least (1 −s)w bases apart from any k-mer in {A1,A2, · · · ,Aj}. • It is covered: There are two k-mers in {A1,A2, · · · ,Aj−1} (importantly does not include Aj), appearing at location l and h, satisfying l < t < h and h− l ≤ w. Similarly, a compatible order for {Ai} is an order that places all k-mers from A1 first in arbitrary order, then those in A2, ..., then those in Am and finally those not in any of {Ai} in a random order. The link energy L({Ai},S) is similarly defined over the pairs of close k-mer appearances that are not covered. More formally: 9 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Definition 8. For a layered polar set, if two k-mers in the layered polar sets, not necessarily from the same layer, appear l ≤ w bases apart in S, and neither are covered, the link energy between them is 2l/(w+1)−1 > 0. L({Ai},S) is the total link energy across all pairs. These definitions of layered polar sets and link energy have two important properties. First the link energy is non-decreasing as more layers are added to the set. And, second, an almost identical argument proves the same bounds for layered polar sets as for polar sets in Theorem 1. See Figure 3 for a concrete example of layered polar sets. 2.3.2 Polar Set Heuristic We consider a simple heuristic to generate a polar set. The core idea is to select as many k-mers as possible from the set of k-mers that appear exactly w bases away from each other. We cannot select all of them as it may violate the polar set condition due to some k-mers appearing multiple times. Because reference sequences are long strings (in the range of billions of bases for mammalian genomes), we consider algorithms that scale well with the length of the reference sequence, preferably close to linear. Fix an offset o ∈ [0,w − 1], we start by listing all locations t such that t = o mod w in the reference sequence S. We then randomly shuffle the locations, and for each location t in this random order, add the k-mer at location t to the polar set. When we add a k-mer m to the polar set, we also locate and remove all k-mers in the polar set that appear fewer than (1−s)w bases away from m. Additionally, if a k-mer appears multiple times in the list, it is considered only once at the first encounter. This is to prioritize k-mers that appear less often; Frequent k-mers are expected to be processed early given their multiple occurrences, and are more likely to be absent in the final polar set as they have more chances to be removed due to conflicts. Jain et al. (2020b) has explored a similar idea in building tiered random minimizers using a biased hash function. Our algorithm also has a variant, which we call “monotonic”. In this variant, we require that adding a new k-mer m and removing the k-mers conflicting with m actually increases the link energy. Otherwise, the k-mer is skipped and no conflicting k-mers are removed. This variant is slower but results in more efficient polar sets. We filter k-mers before they are considered for addition to polar sets. k-mers that collides with itself (appears fewer than (1 − s)w bases away from its own copy) cannot be in the polar set. We also filter out k-mers by their frequency in the reference sequence (see Section 2.3.3 for the threshold value). Algorithm 1 Pseudocode for Polar Set Heuristics function PolarSet(S,w,k) Start with an empty set A ←{} and a random offset o Shuffle list of locations t = o mod w for 0 ≤ t < |S| for each t in the list and the k-mer mt at location t do Skip if mt is filtered, or has been processed previously Obtain list l of occurrences of mt via suffix array Obtain list of conflicting k-mers via linked blocks Remove all conflicting k-mers and add mt to the polar set A end for return A end function Algorithm 1 shows the pseudocode for the non-monotonic variant of the heuristic. The monotonic variant is similar. We describe the data structures in Section 2.3.4, and analyze the time complexity in Section 2.3.5. 2.3.3 Layered Heuristics and Hyperparameters We construct layered polar sets with a similar algorithm. The properties of layered polar sets guarantee that new layers cannot decrease the final link energy of the polar set. 10 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ We rerun the polar set heuristic multiple times, each time with a new random offset o. Each round is run with the current layers of polar sets, and the resulting polar set is added as a new layer. The algorithm for each layer is mostly identical to the single-layer version, with a few changes. • When processing a k-mer, we skip all of its occurrences that are covered by existing layers of the polar set. • We skip k-mers at non-covered locations t that is fewer than (1 − s)w bases away from a k-mer in a previous layer. These k-mers cannot be in the layer without violating the layered polar set condition. • At the end of each round, we remove all k-mers selected in the current layer that do not form a link with any k-mers. We also gradually increase the threshold of k-mer frequency at each round to prioritize low frequency k-mers. In our experiments, we use a total of 7 rounds, with last two rounds being monotonic. The frequency threshold is set at the value to include 85% of locations of the reference in the first round, gradually increasing to 95% in the last round. The slackness s is also a tunable parameter, which determines when a pair of k-mers is considered in collision. Lower value of s ensures the distance between adjacent polar k-mers are large and have higher link energy for every pair of linked k-mers, but results in smaller number of k-mers selected, implying fewer links. Higher value of s means larger polar sets covering more of the reference sequence and more links formed, but adjacent polar k-mers may be closer to each other resulting in lower energy per link. In our experiments, we use a fixed slackness s = 0.4 after parameter search. This results in approximately 20% less efficient links (average link energy compared to theoretical maximum), but higher total link energy due to inclusion of more links. A more thorough parameter tuning might suggest a gradually increasing value of s between rounds. 2.3.4 Supporting Data Structures Our heuristics require some data structures to operate efficiently both in theory and in practice. Suffix Array. In order to quickly index k-mers and obtain the list of occurrences of a k-mer, we precompute the suffix array, the inverse suffix array and the heights table (also known as the LCP array) of the reference sequence. All can be computed in linear time. This allows us to find the list of T locations that share the same k-mer as location t, in O(T) time. Linked Blocks. The layered polar set property ensures that in any stretch of w/2 bases, at most one k-mer at one location is selected into any layer of the layered polar sets, excluding covered locations. We use a data structure called linked blocks to represent the set of these selected locations of k-mers. Let h = bw/2c, we divide the locations in the reference sequence into h-long blocks, and use an array of length |S|/h to represent these blocks. Each value in the array C[b] is either −1, meaning there are no selected location within this block spanning location [bh, (b + 1)h), or a nonnegative integer j, indicating that the k-mer at location bh + j is selected. With linked blocks we can do the following operation quickly: Definition 9. PeekL(x) returns the closest selected location to the left of x, up to w bases. This is because we only need to query up to three blocks. Adding a location and removing a location also only involves a single block. Similarly we can define PeekR(x). With this data structure, we can implement many critical operations in the aforementioned heuristics. The step of filtering k-mers, more specifically determining whether a k-mer collides with itself, is done using this data structure, in similar fashion to bucket sorting. By maintaining two linked blocks, one for the current layer and one for all previous layers, we can determine whether a location is covered by the previous layers, and list collisions on the current layer. 11 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Calculating Link Energy. In the monotonic variant of our heuristics, we need to calculate the total link energy before and after adding a k-mer. In our implementation, we update the link energy of the polar set as we add and remove locations to the linked blocks, using the following alternative formula for link energy: L({Ai},S) = 2Acov/(w + 1) −Aele −Aseg. Here, Acov is the number of contexts that contain a k-mer from the polar set, Aele is the number of non- covered location of selected k-mers, and Aseg is the number of continuous segments of windows that contain a k-mer from the polar set. When adding and removing a location to the linked blocks, the changes to these three values are calculated using linked block primitives in constant time, so we can update the link energy in constant overhead. As a sanity check, we see that when adding an isolated k-mer, Acov increases by (w + 1) and the other two values increase by 1, resulting in a net link energy gain of zero, consistent with the original definition. We can also compute the link energy of the polar k-mers in bottom part of Figure 3 using this formula, where Acov = 13,Aele = 3 and Aseg = 1, resulting in the total link energy of 1/3. 2.3.5 Time Complexity Analysis We now analyze the time complexity of the layered polar sets heuristic, assuming no monotonic rounds for now. Let n be the length of the reference sequence, and assume a constant-sized alphabet. We assume a word of constant size can hold an integer in [0,n], and that accessing an element in an array of length n takes constant time. These conditions hold for genomes and 64-bit machines. This means the primitive operations on linked blocks take constant time, and operations involving the suffix array also take constant time. Consider a worst case scenario: By iterating k-mers that appear exactly w bases away from each other, we iterated over all k-mers in the reference sequence. Assume a k-mer m occurs T times in the reference sequence. In filtering phase, we first fetch the list of T locations in O(T) time using the suffix array, and we want to determine if there are two elements whose difference is less than (1 − s)w. This can be done using the linked blocks in O(T) time. In the case of layered polar sets, we also want to determine if each of the locations is covered by previous layers, and if it is fewer than (1 − s)w bases away from a location in a previous layer. As we use one linked block for all previous layers, this can be done in O(T) time. The filtering phase thus finishes in O(T) time. The main algorithm is split into three parts: detecting k-mers that are close to m in the reference sequence, removing those k-mers from the polar set, and adding m to the polar set. Detecting and listing k-mers that are close to m takes O(T) time, as each location reports only four collisions at most, two to the left and two to the right. Removing a k-mer that occurs T ′ times takes O(T ′) time, but since each k-mer is only added and removed once in one round, this amortizes to O(T) time. Adding m to the polar set also takes O(T) time. The singleton detection step (removing k-mers forming no links) also takes O(T) time for checking if m is a singleton. As each k-mer is only visited once in the main algorithm, and in the worst case scenario every k-mer in S is visited, we conclude that the layered polar set heuristics runs in ∑ O(T) = O(n) time for each layer, and as a special case the (non-layered) polar set heuristics runs in O(n). The monotonic variant of the heuristic can in theory run in O(n2) time, but it is not significantly slower in practice. 3 Results All the experiments are run using the human reference genome hg38. To facilitate the performance com- parison across a range of parameter values of w and k, we report the density factor (Marçais et al., 2017) instead of the density. The density factor is the density multiplied by (w + 1). Regardless of the value of w, the random minimizer has an expected density factor of 2 and a perfect minimizer has a density factor of ≈ 1. 3.1 Energy Deficit and Energy Surplus First, we calculate the average energy deficit X(S)/|S| and average energy surplus D(S)/|S|. The results are in Figure 4A. 12 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=10) 0.0100 0.0075 0.0050 0.0025 0.0000 0.0025 0.0050 0.0075 0.0100 D e n si ty f a c to r 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=100) 0.0100 0.0075 0.0050 0.0025 0.0000 0.0025 0.0050 0.0075 0.0100 D e n si ty f a c to r Zero Energy Surplus Energy Deficit A 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=10) 1.2 1.4 1.6 1.8 2.0 D e n si ty f a c to r 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=100) 1.0 1.2 1.4 1.6 1.8 2.0 D e n si ty f a c to r Random Minimizers Lower Bound Fixed Interval Sampling Miniception Layered Polar Sets B Figure 4: Left: Energy surplus and deficit for short (w = 10) and for long (w = 100) windows, computed on the human reference sequence hg38. The difference between the two lines is the difference between the upper and lower bound of Theorem 1. It is very small and the bounds are very good estimates in practice. Right: Density factor for the proposed methods, for short and long windows, computed on hg38. The bottom orange dashed line is the theoretical minimum density (perfect minimizers). The reference genome is more repetitive than a purely random sequence. However, empirically the energy surplus and deficit are still small, well below 0.01 measured in density factor, implying a relative error of at most 1% when estimating specific density with link energy. Thus, when constructing efficient minimizers by (layered) polar sets, using link energy to estimate specific density is efficient and accurate. For reference, on a random sequence the average energy surplus and deficit are below 10−7 in absolute value, for the parameter range we are interested in. 3.2 Evaluating Polar Set Heuristics We next evaluate our proposed algorithms for layered polar sets. We implemented the algorithm with Python3. Experiments are run in parallel and the longest ones finish within a day. The peak memory usage stands at 100 GB, which happens at the start loading the precomputed suffix array using Python pickle. We compare our results against some other candidates: • Random Minimizers. Achieves density factor of 2 in theory and in practice, as indicated in last section. • Lower Bound. This corresponds to the density factor for perfect minimizers. While our theory predicts existence of perfect minimizers matching the lower bound with large value of k, this rarely happens with practical parameter values. • Fixed Interval Sampling. This method uses every w k-mers from S as the set U to define a compatible minimizer. • The Miniception (Zheng et al., 2020a), a practical algorithm that provably achieves lower density in many scenarios. The hyperparameter k0 is set to max(5,k −w) for our experiments. We do not include existing algorithms for constructing compact universal hitting sets because these methods do not scale to values of k > 14. Our heuristics work the best when k-mers do not appear too frequently, or roughly speaking, when σk > n where n is the length of the reference sequence. This choice of parameter is common in bioinformatics analysis. With the sequence at the size of human reference genome, our heuristics work well starting at k = 15. Additionally, the Miniception achieves comparable performance with leading UHS-based heuristics, so its performance also serves as a viable proxy. We consider two scenarios, first with short windows (w = 10) and second with long windows (w = 100). The results are shown in Figure 4B. Our experiments indicate that our simple heuristics yield efficient mini- mizers, greatly outperforming random minimizers and the Miniception, while maintaining a consistent edge over fixed interval sampling methods, in both short windows and long windows settings. The improvement 13 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ is more pronounced when the windows are long. Given our layered polar set heuristics consist of multiple rounds, in Supplementary Section S5.1 we show the progression of density factors through rounds, demon- strating that the layered heuristics are particularly effective at low values of k. We next show that in building sequence-specific minimizers using layered anchor sets, we do not sacrifice their performance in the general case measured by (expected) density. In Supplementary Section S5.2, we sketch a random sequence using the sequence-specific minimizers we built for hg38. As expected, the performance closely matches that of a random minimizer. 4 Discussion 4.1 Limits and Future of Polar Sets While the concept of polar sets is interesting and leads to improvements in state-of-the-art sequence-specific minimizer design, we should acknowledge its limitations. First, it cannot be used in designing non-sequence- specific minimizers when w > k. Arguably, this means the method is more tailored for sequence-specific minimizers. See Supplementary Section S4 for proof and more discussion on non-sequence-specific polar sets. Our experimental results show that the performance of minimizers based on polar sets greatly improves as k grows. When each k-mer appears many times in the reference sequence, it becomes hard to select many k-mers without violating the polar set condition. For comparison, in Supplementary Section S5.3 we show the results when we apply the heuristics to human chromosome 1 sequence only, which is about 1/10 as long as the whole human reference genome. Improvements across the board for the heuristic algorithms and the fixed interval sampling methods are observed. The repetitiveness of human reference genome also means much more difficult optimization of specific density. In Supplementary Section S5.4, we show the results when we apply the heuristics to build sequence-specific minimizers on a random sequence that are as long as the chr1 sequence. It is significantly easier to reach the theoretical minimum specific density of 1/w in this setup compared to the previous one. With better computing power and more efficient algorithms, it is desirable to compute an optimal polar set. Thanks to our link energy formulation, the problem of optimal polar set can be formed with integer linear programming (ILP), each k-mer being a binary variable. For moderately-sized reference sequences, an optimal polar set can be found. However, no such convenient formulation exists for layered polar sets, and it is an interesting question whether there is a tractable optimization problem for minimizers in general. 4.2 Practicality of Sketches-by-Optimization The polar sets can be used wherever universal hitting sets are used, in most cases. Given that our heuristics for layered polar sets only produce a small number of layers, implementation of a compatible minimizer with layered polar sets is not fundamentally different from that with a universal hitting set. The fixed interval sampling method is very similar to previously proposed methods (Khiste and Ilie, 2015; Almutairy and Torng, 2018; Frith et al., 2020), where the sketch of a reference sequence is simply the set of k-mers appearing at locations divisible by w. Polar sets might not be able to directly replace fixed interval sampling, however it can be readily expanded into a set of seeds that covers the whole reference sequence. These approaches are currently relatively underused, compared to more traditional approach of minimiz- ers like lexicographical, random or slight variants of either one. A significant reason for their unpopularity is the fact that using these methods requires looking up a table of k-mers, be it a set of polar k-mers or universal hitting k-mers, for every k-mer in the query sequence. In contrast, for a random minimizer imple- mented using a hash function, no lookup is required during the sequence sketch generation process. Since these lookup tables are usually the result of sequence-specific optimization, we say these methods fall into the category of “sketches-by-optimization”. This contrast leads to interesting tradeoffs in efficiency. For example, using a polar-set-compatible minimizer generates a more compact sequence sketch, but might take more time at query compared to using a random minimizer, due to the time spent in loading and querying the set of polar k-mers. We believe better implementation of k-mer lookup tables and better optimization of sequence sketches, possibly in a joint manner, will popularize sketches-by-optimization. Existing methods already take step towards this goal. Jain et al. (2020b) uses a compact lookup table to index frequent k-mers, and Liu et al. 14 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ (2019) uses a Bloom filter to perform approximate query over fixed interval samples. Techniques like k-mer Bloom filters (Pellow et al., 2017) might also further help the performance. 4.3 Alternative Measurements of Efficiency Throughout this manuscript our goal has been the optimization of specific density. Low density results in smaller sequence sketches, and for many applications this is desirable. However, depending on the way one uses the sequence sketch, alternative measurements of efficiency may be desirable (also see discussion in Edgar (2020)). For example, in k-mer counting, minimizers are used to place k-mers into buckets. In this case, the specific density is less relevant, and we are more concerned about the number of buckets, and the load balance between different buckets (Marçais et al., 2017; Nyström-Persson et al., 2020). For read mapping, smaller sequence sketches have its own advantage, while some may prefer reducing the number of matches, or reducing the false positive seed matches in general. We believe many of these objectives are correlated with each other, and we are interested in both further exploring benefits of a small sequence sketch, and optimization techniques for alternative measurements of efficiency. 5 Conclusion Inspired by deficiencies with current theory and practice around sequence-specific minimizers, we propose the concept of polar sets, a new approach to construct sequence-specific minimizers with the ability to directly optimize the specific density of the resulting sequence sketch. We also propose simple and efficient heuristics for constructing (layered) polar sets, and demonstrate via experiments on the human reference genome the superior performance of minimizers constructed by our proposed heuristics. While there are still concerns around the practical utility, we believe the polar set framework will be a valuable asset in design and analysis of efficient sequence sketches. Funding This work has been supported in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to C.K., by the US National Institutes of Health (R01GM122935), and the US National Science Foundation (DBI-1937540). This work was partially funded by The Shurl and Kay Curci Foundation. This project is funded, in part, under a grant (#4100070287) with the Pennsylvania Department of Health. The Department specifically disclaims responsibility for any analyses, interpretations or conclusions. Conflict of interests: C.K. is a co-founder of Ocean Genomics, Inc. G.M. is V.P. of software development at Ocean Genomics, Inc. References Almutairy, M. and Torng, E. (2018). Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches. PLOS ONE , 13(2), e0189960. Blackburn, S. R. (2015). Non-overlapping codes. IEEE Transactions on Information Theory , 61(9), 4890– 4894. Chikhi, R., Limasset, A., and Medvedev, P. (2015). Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics , 32(12), i201–i208. DeBlasio, D., Gbosibo, F., Kingsford, C., and Marçais, G. (2019). Practical universal k-mer sets for minimizer schemes. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics , BCB ’19, pages 167–176, New York, NY, USA. ACM. 15 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Deorowicz, S., Kokot, M., Grabowski, S., and Debudaj-Grabysz, A. (2015). KMC 2: Fast and resource-frugal k-mer counting. Bioinformatics , 31(10), 1569–1576. Edgar, R. C. (2020). Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. bioRxiv . Ekim, B., Berger, B., and Orenstein, Y. (2020). A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets. BioRxiv: 2020.01.17.910513. Erbert, M., Rechner, S., and Müller-Hannemann, M. (2017). Gerbil: a fast and memory-efficient k-mer counter with GPU-support. Algorithms for Molecular Biology , 12(1), 9. Frith, M. C., Noé, L., and Kucherov, G. (2020). Minimally-overlapping words for sequence similarity search. BioRxiv . Jain, C., Rhie, A., Hansen, N., Koren, S., and Phillippy, A. M. (2020a). A long read mapping method for highly repetitive reference sequences. bioRxiv , page 2020.11.01.363887. Jain, C., Rhie, A., Zhang, H., Chu, C., Walenz, B. P., Koren, S., and Phillippy, A. M. (2020b). Weighted minimizer sampling improves long read mapping. Bioinformatics , 36(Supplement 1), i111–i118. Kempa, D. and Kociumaka, T. (2019). String synchronizing sets: sublinear-time bwt construction and optimal lce data structure. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing , pages 756–767. Khiste, N. and Ilie, L. (2015). E-mem: efficient computation of maximal exact matches for very large genomes. Bioinformatics , 31(4), 509–514. Levenshtein, V. I. (1970). Maximum number of words in codes without overlaps. Problemy Peredachi Informatsii , 6(4), 88–90. Li, H. and Birol, I. (2018). Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics , 34(18), 3094–3100. Liu, Y., Zhang, L. Y., and Li, J. (2019). Fast detection of maximal exact matches via fixed sampling of query k-mers and bloom filtering of index k-mers. Bioinformatics , 35(22), 4560–4567. Marçais, G., Pellow, D., Bork, D., Orenstein, Y., Shamir, R., and Kingsford, C. (2017). Improving the performance of minimizers and winnowing schemes. Bioinformatics , 33(14), i110–i117. Marçais, G., DeBlasio, D., and Kingsford, C. (2018). Asymptotically optimal minimizers schemes. Bioin- formatics , 34(13), i13–i22. Marçais, G., Solomon, B., Patro, R., and Kingsford, C. (2019). Sketching and sublinear data structures in genomics. Annual Review of Biomedical Data Science, 2(1), 93–118. Mykkeltveit, J. (1972). A proof of Golomb’s conjecture for the de Bruijn graph. Journal of Combinatorial Theory, Series B , 13(1), 40–45. Nyström-Persson, J. T., Keeble-Gagnère, G., and Zawad, N. (2020). Compact and evenly distributed k-mer binning for genomic sequences. bioRxiv . Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., and Kingsford, C. (2016). Compact universal k-mer hitting sets. In Algorithms in Bioinformatics , Lecture Notes in Computer Science, pages 257–268. Springer, Cham. Pellow, D., Filippova, D., and Kingsford, C. (2017). Improving bloom filter performance on sequence data using k-mer bloom filters. Journal of Computational Biology , 24(6), 547–557. Roberts, M., Hunt, B. R., Yorke, J. A., Bolanos, R. A., and Delcher, A. L. (2004a). A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology , 11(4), 734–752. 16 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M., and Yorke, J. A. (2004b). Reducing storage requirements for biological sequence comparison. Bioinformatics , 20(18), 3363–3369. Schleimer, S., Wilkerson, D. S., and Aiken, A. (2003). Winnowing: Local Algorithms for Document Finger- printing. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD ’03, pages 76–85. ACM. Ye, C., Ma, Z. S., Cannon, C. H., Pop, M., and Yu, D. W. (2012). Exploiting sparseness in de novo genome assembly. BMC Bioinformatics , 13, S1. Zheng, H., Kingsford, C., and Marçais, G. (2020a). Improved design and analysis of practical minimizers. Bioinformatics , 36(Supplement 1), i119–i127. Zheng, H., Kingsford, C., and Marçais, G. (2020b). Lower density selection schemes via small universal hitting sets with short remaining path length. arXiv preprint arXiv:2001.06550 . 17 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Supplementary Materials S1 A Technical Lemma on k-mer Repetition Here we prove a technical lemma on repetitive occurrence of k-mers. Similar versions of this can be found in (Chikhi et al., 2015). Recall σ is the size of the alphabet. Lemma S7. Given a random sequence and a pair of locations i < j, the probability that the k-mer starting at i equals the k-mer starting at j is exactly σ−k. Proof. If j − i ≥ k, the two k-mers do not share bases, so given they are both random k-mers independent of each other, the probability is σ−k. Otherwise, the two k-mers intersect. We let d = j − i, and use mi to denote the k-mer starting at location i. We use s to denote the substring from the start of mi to the end of mj with length k + d (or equivalently, the union of mi and mj). If mi = mj, the p th character of mi is equal to the pth character of mj, meaning sp = sp+d for all 0 ≤ p < k. This further means s is a repeating sequence of period d, so s is uniquely determined by its first d characters and there are σd possible configurations of s. The probability a random s satisfies mi = mj is then σ d/σk+d = σ−k. S2 Universal Hitting Sets and Related Analyses Universal hitting sets have been an important component in constructing practical minimizers. In this section, we provide a more formal and technical discussion on universal hitting sets. In Section S2.1, we for- mally define UHS and discuss why existing heuristics to construct UHS are not adequate for sequence-specific minimizer. In Section S2.2 and Section S2.3, we discuss the two existing methods to analyze compatible minimizers of UHSes, and show that these approaches both have issues that make them unfit for our goal. In Section S2.4 we discuss how UHSes can in fact be treated as special cases of polar sets, which may inspire new developments in this line of research. S2.1 Definitions and Inelasticity of UHS Definition S10 (Universal Hitting Sets). Let U be a set of k-mers. If U intersects with every w consecutive k-mers, it is a UHS over k-mers with path length w and relative size |U|/σk. A decycling set is a set of k-mers that intersect with any sufficiently long strings. Any universal hitting sets must be a decycling set, so lower bound on the size of decycling sets applies to all universal hitting sets. Lemma S8 (Minimal Decycling Sets). Any UHS over k-mers with finite path length has relative size Ω(1/k). With a universal hitting set, it is guaranteed that any compatible minimizer will only select k-mers within the UHS on any sequence. Currently, the most popular approach for constructing efficient minimizers is via construction of a compact universal hitting set, followed by constructing a compatible minimizer. These universal hitting sets are usually constructed by expanding from a minimal decycling set. As we have shown before (Zheng et al., 2020b), the Mykkeltveit MDS (Mykkeltveit, 1972), the MDS that is predominantly used as the starting point already covers all windows of length O(k3). Empirically, with larger value of w only a few k-mers needs to be added to satisfy the universal hitting condition. As a result, UHSes constructed for different references look like each other, and the compatible minimizers do not specialize well. A related concern about using UHSes on specific sequences is on handling of repetitive k-mers. As we have discussed, repetitive k-mers are prevalent in human reference genome. Any universal hitting set always contains homomers like AAA · · ·A as it is required to cover a sequence of all As. This argument also extends to other repetitive k-mers. Such homomers, or repetitive k-mers, would then be preferred when using compatible minimizers for sequence sketching. This problem of prioritizing repetitive k-mers is also present in fixed interval sampling. Meanwhile, existing literature (Li and Birol, 2018; Jain et al., 2020b) suggests it is in fact beneficial to not select these k-mers for read mapping, while proposing different remedies to this issue. Our proposed methods also have the effect of avoiding repetitive k-mers, as these k-mers likely don’t pass the filtering step. S1 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ S2.2 Analysis via Density Upper Bound There are two existing ways to analyze the density of compatible minimizers. The first is via the following lemma, as we have mentioned in the main text: Lemma S9. If U is a UHS over k-mers, any compatible minimizer has density at most |U|/σk. This lemma is universally applicable and it does not depend on the ordering within U. However, this is an upper bound which becomes non-informative with w > 2k and sufficiently large k. Because any universal hitting set is at least as large as a minimal decycling set (Lemma S8), and a random (w,k)−minimizer achieves density of approximately 2/(w + 1), Lemma S10 at best tells us the compatible minimizer is no worse than a random one. S2.3 Analysis via Probability of Single UHS Contexts There is a second approach to analysis of compatible minimizers from universal hitting sets (Marçais et al., 2017). The key lemma reads as follows (slightly rephrased): Lemma. If U is a UHS over k-mers, let SP(U) be the probability that a context contains only one element in U. Under certain assumptions, the expected density of a random minimizer compatible with U is 2(1 − SP(U))/(w + 1). We now show this lemma depends on assumptions that highly depends on the structure of U. We start with some notations, slightly different from the original paper. Fix a context, let mi denote the ith k-mer in the context. We also let zi = 1(mi ∈ U), let H denote the event that the context is charged, and let Z = ∑w i=0 zi. Let C(n,k) be the binomial coefficients. The proof involves the following equation (we only list the first term - there are four analogous terms): P(H | Z = j) = C(w − 1,j − 1) C(w + 1,j) P(H | Z = j,z0 = 1,zw = 0) + · · · which involves a counting argument: Given Z = ∑w i=0 zi = j, there are C(w + 1,j) different configurations of z, and C(w−1,j−1) of them satisfies z0 = 1 and zw = 0. However, by invoking this counting argument, it is implicitly assumed that every configuration satisfying ∑ zi = j happens with the same probability, as stated (again, we only keep the terms with z0 = 1 and zw = 0 and hide the rest of terms): Assumption. Let P(z) be the probability of generating a random context and observing zi = 1(mi ∈ U). If∑ zi = ∑ z′i, P(z) = P(z ′). If this is true, we also have P(z | Z = j) = 1/C(w + 1,j). We now recover the statement as follows: P(H | Z = j,z0 = 1,zw = 0) = ∑ Z=j,z0=1,zw=0 P(H | z)P(z | Z = j,z0 = 1,zw = 0) = ∑ Z=j,z0=1,zw=0 P(H | z)/C(w,j − 1) P(H | Z = j) = ∑ ∑ Z=j P(H | z)P(z | Z = j) = ∑ Z=j,z0=1,zw=0 P(H | z)P(z | Z = j) + · · · = ∑ Z=j,z0=1,zw=0 P(H | z)/C(w + 1,j) + · · · = C(w − 1,j − 1) C(w + 1,j) P(H | Z = j,z0 = 1,zw = 0) + · · · S2 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ The assumption is true in expectation if the UHS itself is a random subset of Σk, which is not the case as that set also has to satisfy the UHS condition. For a general set U, the probability that a k-mer is in U is highly dependent on whether the preceding intersecting k-mers are in U, and the assumption is likely not valid in most scenarios. Finally, universal hitting sets may be constructed in a specific way to enable better analysis of compatible minimizers, as seen in (Zheng et al., 2020a). We do not discuss these, as they do not apply to other universal hitting sets. S2.4 UHS as Improper Polar Sets The alternative formula for link energy, as described in Section 2.3.4, allows us to define the link energy of any subset of k-mers, not just those satisfying the polar set condition. The main theorem for polar set still holds, but only the upper bound part. Interestingly, if we plug in a universal hitting set, we get Acov = n,Aele = |U|,Aseg = 1 and the link energy of 2n/(w + 1) −|U|− 1, where n is the number of k-mers in the reference sequence and |U| is the total number of times a k-mer in UHS appear in the reference sequence. Plugging this into the main polar set theorem, we recover the specific density upper bound |U|/n for universal hitting sets, up to an error of D(S)/n. In this sense, universal hitting sets can be seen as a specific and extreme case of an improper polar set. S3 NP-Completeness of Optimal Polar Set In this section, we show a reduction from the problem of maximal independent set to the problem of optimal polar set, with an alphabet of 3. Let G = (V,E) be the instance for maximal independent set, and without loss of generality, let |V | = 2d. We use Σ = {X, 0, 1} as the alphabet, and for the polar set instance, we let w = 2d + 1, k = d and s = 0. This means, we want to find subset of d−mers that form many links exactly 2d + 1 bases away, but no two d−mers in the polar set can be fewer than 2d + 1 bases from each other. With s = 0, link energy is equivalent to number of links up to a scaling factor, so we are optimizing number of links that can be formed. We now construct the query string for polar set, which we divide into three sections. Disqualification Gadget. Given an arbitrary d−mer z ∈ Σd, we let the disqualification gadget be the following string: DQ(z) = X2d+1zXdzX2d+1 With presence of DQ(z), z cannot appear in the polar set, because it appears twice exactly 2d bases away in the disqualification gadget. The X2d+1 section on both ends of the gadget is to prevent d-mers within the gadget to form links with adjacent gadgets or sections, as Xd is not in the polar set. Disqualification Section. We append a disqualification gadget to the query string for every d-mer (there are at most 3k = n1.5 of them), except all d-mers containing only 0 and 1. Vertex Section. For each vertex v in G, let a be its binary representation. We add X2d+1aXd+1aX2d+1 to the query string. Edge Section. For each edge (u,v) in G, let a,b be the binary representation of the two ends. We add X2d+1aXdbX2d+1 to the query string. The final query string is formed by the concatenation of three sections. Theorem S2. The maximal independent set can be solved by solving the optimal polar set of aforementioned query string. Proof. We claim any polar set of the query string corresponds to an independent set V ′ of G, with |V ′| links. All d-mer in the polar set are those representing vertexes in G, as other d-mers (those containing X) cannot appear due to the disqualification section. For each d-mer in the polar set, we get one link from the S3 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ vertex section of the query string. If (u,v) ∈ E, the two d-mers representing u and v cannot be selected into the polar set at the same time, because in the edge section these two d-mers are apart by exactly 2d bases, violating polar set condition. On the other hand, all independent sets of G can be represented by a polar set, with total links |V ′| using the same argument. We conclude that the optimal polar set of the query string is representation of a maximal independent set of G, which proves the statement. This reduction also implies hardness of approximately solving optimal polar sets. S4 On Non-Sequence-Specific Polar Sets For the sake of simplicity, in this section we only discuss polar sets with s = 0. The discussion about s > 0 is highly similar. As we have discussed, the density of a minimizer is the expected specific density over a random sequence. Equivalently, it equals the specific density on the de Bruijn sequence of order at least w + k. Therefore, one may construct polar sets on the de Bruijn sequence of sufficient order, to build non-sequence-specific minimizers. However, this is impossible with long windows: Lemma S10. No non-trivial polar set exists when w > k and S is the de Bruijn sequence of order w + k. Proof. We simply show no k-mers can be in the set. For every k-mer m, the sequence mm exists within S, because S is the de Bruijn sequence of order at least 2k. Picking m violates the condition for polar set because it appears twice with k < w bases apart in S. Polar sets exist on de Bruijn sequences of order w +k, when w ≤ k. With w = k, these polar sets become non-overlapping k-mers (Levenshtein, 1970), that is, the set of k-mers where no proper prefix of a k-mer equals a proper suffix of another k-mer. The problem of finding large set of non-overlapping k-mers is hard in general, although constructive algorithms exist (Blackburn, 2015) for constant factor approximation. With w < k we obtain minimally-overlapping k-mers, a concept that has also been studied in other contexts (Frith et al., 2020). We believe the concept of non-sequence-specific polar sets is of both practical and theoretical interest. S5 Supplementary Experiments and Figures S5.1 Density Factor of Layered Polar Sets by Round To show that our proposed layered anchor set heuristics is useful, in Figure 5 we plot the density factor after each round of optimization on the human reference genome hg38. All algorithms are run for a total of 7 rounds, with last two being monotonic rounds. We select 7 to ensure the resulting sets are not too complicated and can be computed in reasonable amount of time. With more rounds, many of the results can be further improved. S5.2 Viability of Sequence-Specific Minimizers on Non-target sequences To validate that optimization of sequence-specific density does not come at the cost of higher (non-sequence- specific) density, we generate the sequence-specific minimizers for hg38 reference genome, then apply these minimizers on a random sequence. Figure 6 shows the results. We expect these to perform close to random minimizers when σk � N where N is the length of the reference sequence. In these cases, most k-mers in a random sequence is not seen in the reference sequence, and optimized sequence-specific minimizers behave just like random minimizers in most cases. The performance for the Miniception is almost identical to that in hg38, and is not shown in this plot. The layered polar sets is also arguably more robust at lower values of k, as its density stays close to that of a random minimizer. S4 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ 1 2 3 4 5 6 7 Round No. (hg38, w=10) 1.2 1.4 1.6 1.8 2.0 Es tim at ed D en si ty F ac to r 1 2 3 4 5 6 7 Round No. (hg38, w=100) 1.0 1.2 1.4 1.6 1.8 2.0 Es tim at ed D en si ty F ac to r Start Lower Bound k=15 k=16 k=17 k=18 k=19 k=20 k=21 k=22 k=23 k=24 k=25 Figure 5: Density factor of layered anchor sets after each round of the optimization, corresponding to the experiments shown in Figure 4B. 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=10) 1.2 1.4 1.6 1.8 2.0 2.2 D en si ty fa ct or 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=100) 1.0 1.2 1.4 1.6 1.8 2.0 D en si ty fa ct or Random Minimizers Lower Bound Fixed Interval Sampling Layered Polar Sets Figure 6: Performance of sequence-specific minimizers on random sequences (optimized on hg38) with w = 10 (left) and w = 100 (right). This is different from Figure 8: Here the specific density is measured on a unrelated random sequence. S5 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ S5.3 Experiments on Human Chromosome 1 To show the effect of reference sequence length on the performance of sequence-specific minimizers, in Figure 7 we show the performance plot when we build sequence-specific minimizers for chr1 only. The human chromosome 1 sequence is around 10% of the whole hg38 sequence, and consistent with our theory, the time and memory spent to run these experiments on chr1 are around 10% of that for hg38 ones. 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=10) 1.2 1.4 1.6 1.8 2.0 D en si ty fa ct or 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=100) 1.0 1.2 1.4 1.6 1.8 2.0 D en si ty fa ct or Random Minimizers Lower Bound Fixed Interval Sampling Miniception Layered Polar Sets Figure 7: Performance of sequence-specific minimizers, optimized and tested on human chromosome 1 with w = 10 (left) and w = 100 (right). S5.4 Building Sequence-Specific Minimizers on Random Sequences To further show that human reference genome is highly repetitive and construction of efficient sequence- specific minimizers is hard in such setup, we run the algorithms to generate sequence-specific minimizers on a random sequence of length 230 000 000 , similar to that of chromosome 1. Figure 8 shows the performance of layered polar sets and fixed interval sampling method. Compared with Figure 7, we observe it is much easier to build efficient minimizers on a random sequence, and to match the theoretical lower bound, even given the reference sequences has similar length. S6 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=10) 1.2 1.4 1.6 1.8 2.0 D en si ty fa ct or 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=100) 1.0 1.2 1.4 1.6 1.8 2.0 D en si ty fa ct or Random Minimizers Lower Bound Fixed Interval Sampling Miniception Layered Polar Sets Figure 8: Performance of sequence-specific minimizers, optimized and tested on a 230 000 000−long random sequence with w = 10 (left) and w = 100 (right). This is different from Figure 6: Here the specific density is measured on the same sequence the minimizers optimize on. S7 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Introduction Methods Overview Polar sets and link energy Key Definitions Perfect Minimizer for short sequences Context Energy and Energy Savers Density Bounds with Polar Sets Hardness of Optimizing Polar Sets Constructing Polar Sets Layered Polar Sets Polar Set Heuristic Layered Heuristics and Hyperparameters Supporting Data Structures Time Complexity Analysis Results Energy Deficit and Energy Surplus Evaluating Polar Set Heuristics Discussion Limits and Future of Polar Sets Practicality of Sketches-by-Optimization Alternative Measurements of Efficiency Conclusion A Technical Lemma on k-mer Repetition Universal Hitting Sets and Related Analyses Definitions and Inelasticity of UHS Analysis via Density Upper Bound Analysis via Probability of Single UHS Contexts UHS as Improper Polar Sets NP-Completeness of Optimal Polar Set On Non-Sequence-Specific Polar Sets Supplementary Experiments and Figures Density Factor of Layered Polar Sets by Round Viability of Sequence-Specific Minimizers on Non-target sequences Experiments on Human Chromosome 1 Building Sequence-Specific Minimizers on Random Sequences 10_1101-2021_02_08_428881 ---- ACE: Explaining cluster from an adversarial perspective ACE: Explaining cluster from an adversarial perspective Yang Young Lu 1 Timothy C. Yu 2 Giancarlo Bonora 1 William Stafford Noble 1 3 Abstract A common workflow in single-cell RNA-seq anal- ysis is to project the data to a latent space, cluster the cells in that space, and identify sets of mark- er genes that explain the differences among the discovered clusters. A primary drawback to this three-step procedure is that each step is carried out independently, thereby neglecting the effects of the nonlinear embedding and inter-gene depen- dencies on the selection of marker genes. Here we propose an integrated deep learning frame- work, Adversarial Clustering Explanation (ACE), that bundles all three steps into a single workflow. The method thus moves away from the notion of “marker genes” to instead identify a panel of ex- planatory genes. This panel may include genes that are not only enriched but also depleted rela- tive to other cell types, as well as genes that exhib- it differences between closely related cell types. Empirically, we demonstrate that ACE is able to identify gene panels that are both highly discrimi- native and nonredundant, and we demonstrate the applicability of ACE to an image recognition task. 1. Introduction Single-cell sequencing technology has enabled the high- throughput interrogation of many aspects of genome biolo- gy, including gene expression, DNA methylation, histone modification, chromatin accessibility and genome 3D archi- tecture (Stuart & Satija, 2019) In each of these cases, the resulting high-dimensional data can be represented as a s- parse matrix in which rows correspond to cells and columns correspond to features of those cells (gene expression val- ues, methylation events, etc.). Empirical evidence suggests that this data resides on a low-dimensional manifold with latent semantic structure (Welch et al., 2017). Accordingly, 1Department of Genome Sciences, University of Washington, Seattle, WA 2Graduate Program in Molecular and Cellular Biology, University of Washington, Seattle, WA 3Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA. Correspondence to: William Stafford Noble . Preliminary work. Under review. identifying groups of cells in terms of their inherent latent semantics and thereafter reasoning about the differences be- tween these groups is an important area of research (Plumb et al., 2020). In this study, we focus on the analysis of single cell RNA- seq (scRNA-seq) data. This is the most widely available type of single-cell sequencing data, and its analysis is chal- lenging not only because of the data’s high dimensionality but also due to noise, batch effects, and sparsity (Amodio et al., 2019). The scRNA-seq data itself is represented as a sparse, cell-by-gene matrix, typically with tens to hundreds of thousands of cells and tens of thousands of genes. A com- mon workflow in scRNA-seq analysis (Pliner et al., 2019) consists of three steps: (1) learn a compact representation of the data by projecting the cells to a lower-dimensional space; (2) identify groups of cells that are similar to each other in the low-dimensional representation, typically via clustering; and (3) characterize the differences in gene ex- pression among the groups, with the goal of understanding what biological processes are relevant to each group. Op- tionally, known “marker genes” may be used to assign cell type labels to the identified cell groups. A primary drawback to the above three-step procedure is that each step is carried out independently. Here, we pro- pose an integrated, deep learning framework for scRNA-seq analysis, Adversarial Clustering Explanation (ACE), that projects scRNA-seq data to a latent space, clusters the cells in that space, and identifies sets of genes that succinctly explain the differences among the discovered clusters (Fig- ure 1). At a high level, ACE first “neuralizes” the clustering procedure by reformulating it as a functionally equivalent multi-layer neural network (Kauffmann et al., 2019). In this way, in concatenation with a deep autoencoder that gen- erates the low-dimensional representation, ACE is able to attribute the cell’s group assignments all the way back to the input genes by leveraging gradient-based neural network explanation methods. Next, for each sample, ACE seeks small perturbations of its input gene expression profile that lead the neuralized clustering model to alter the group as- signments. These adversarial perturbations allow ACE to define a concise gene set signature for each cluster or pair of clusters. In particular, ACE attempts to answer the question, “For a given cell cluster, can we identify a subset of genes whose expression profiles are sufficient to identify members (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.428881doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.428881 Adversarial clustering explanation of this cluster?” We frame this problem as a ranking task, where thresholding the ranked list yields a set of explanatory genes. ACE’s joint modeling approach offers several benefits rela- tive to the existing state of the art. First, most existing meth- ods for the third step of the analysis pipeline—identifying genes associated with a given group of cells—treat each gene independently (Love et al., 2014). These approach- es ignore the dependencies among genes that are induced by gene networks, and often yield lists of genes that are highly redundant. ACE, in contrast, aims to find a smal- l set of genes that jointly explain a given cluster or pair of clusters. Second, most current methods identify genes associated with a group of cells without considering the nonlinear embedding model which maps the gene expres- sion to the low-dimensional representation where the groups are defined in the first place. To our knowledge, the only exception is the global counterfactual explanation (GCE) algorithm (Plumb et al., 2020), but that algorithm is limited to using a linear transformation. A third advantage of ACE’s integrated approach is its ability to take into account batch effects during the assignment of genes to clusters. Stan- dard nonlinear embedding methods, such as t-SNE (Van der Maaten & Hinton, 2008) and UMAP (McInnes & Healy, 2018; Becht et al., 2019), cannot take such structure into account and hence may lead to incorrect interpretation of the data (Amodio et al., 2019; Li et al., 2020). To address this problem, deep autoencoders with integrated denoising and batch correction can be used for scRNA-seq analysis (Lopez et al., 2018; Amodio et al., 2019; Li et al., 2020). We demonstrate below that batch effect structure can be usefully incorporated into the ACE model. A notable feature of ACE’s approach is that, by identify- ing genes jointly, the method moves away from the notion of a “marker gene” to instead identify a “gene panel”. As such, genes in the panel may not be solely enriched in a single cluster, but may together be predictive of the clus- ter. In particular, in addition to a ranking of genes, ACE assigns a Boolean to each gene indicating whether its inclu- sion in the panel is positive or negative, i.e., whether the gene’s expression is enriched or depleted relative to clus- ter membership. We have applied ACE to both simulated and real datasets to demonstrate its empirical utility. Our experiments demonstrate that ACE identifies gene panels that are highly discriminative and exhibit low redundancy. We further provide results suggesting that ACE is useful in domains beyond biology, such as image recognition. The Apache licensed source code of ACE (see submitted file) will be made publicly available upon acceptance. 2. Related work ACE falls into the paradigm of deep neural network interpre- tation methods, which have been developed primarily in the context of classification problems. These methods can be loosely categorized into three types: feature attribution meth- ods, counterfactual-based methods, and model-agnostic ap- proximation methods. Feature attribution methods assign an importance score to individual features so that higher scores indicate higher importance to the output prediction (Simonyan et al., 2013; Shrikumar et al., 2017; Lundberg & Lee, 2017). Counterfactual-based methods typically i- dentify the important subregions within an input sample by perturbing the subregions (by adding noise, rescaling (Sun- dararajan et al., 2017), blurring (Fong & Vedaldi, 2017), or inpainting (Chang et al., 2018)) and measuring the resulting changes in the predictions. Lastly, model-agnostic approxi- mation methods approximate the model being explained by using a simpler, surrogate function which is self-explainable (e.g., a sparse linear model, etc.) (Ribeiro et al., 2016). Recently, some interpretation methods have emerged to un- derstand models beyond classification tasks (Samek et al., 2020; Kauffmann et al., 2020; 2019), including the one we present in this paper for the purpose of cluster explanation. ACE’s perturbation approach draws inspiration from ad- versarial machine learning (Xu et al., 2020) where imper- ceivable perturbations are maliciously crafted to mislead a machine learning model to predict incorrect outputs. In particular, ACE’s approach is closest to the setting of a “white-box attack,” which assumes complete knowledge to the model, including its parameters, architecture, gradients, etc. (Szegedy et al., 2013; Kurakin et al., 2016; Madry et al., 2017; Carlini & Wagner, 2017). In contrast to these meth- ods, ACE re-purposes the malicious adversarial attack for a constructive purpose, identifying sets of genes that explain clusters in scRNA-seq data. ACE operates in concatenation with a deep autoencoder that generates the low-dimensional representation. In this paper, ACE uses SAUCIE (Amodio et al., 2019), a commonly- used scRNA-seq embedding method that incorporates batch correction. In principle, ACE is generalizable to any off-the- shelf scRNA-seq embedding methods, including SLICER (Welch et al., 2016), scVI (Way & Greene, 2018), scANVI (Xu et al., 2021), DESC (Li et al., 2020), and ItClust (Hu et al., 2020). 3. Approach 3.1. Problem setup We aim to carry out three analysis steps for a given scRNA- seq dataset, producing a low-dimensional representation of each cell’s expression profile, a cluster assignment for each cell, and a concise set of “explanatory genes” for each (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.428881doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.428881 Adversarial clustering explanation Genes C el ls Encoder Decoder Gene 1 ... Embeddings Gene 2 Gene 3 Gene p Cell ... Gene 1 ... Gene 2 Gene 3 Gene p Neuralized clusteringEncoderCell ... Source group assignment Target group assignment Gene p Rank Gene Score #1 Gene 1#2 Gene 2#3 Gene 3#4 ... ... ... Input: gene expression matrix Deep autoencoder learns low-dimensional representation Embedding clustering Clustering is neuralized and concatenated with the encoder Differentiation analysis by ACE Output: gene relevance + ... ... P er tu rb at io n vesus ... Figure 1. ACE workflow. ACE takes as input a single-cell gene expression matrix and learns a low-dimensional representation for each cell. Next, a neuralized version of the k-means algorithm is applied to the learned representation to identify cell groups. Finally, for pairs of groups of interest (either each group compared to its complement, or all pairs of groups), ACE seeks small perturbations of its input gene expression profile that lead the neuralized clustering model to alter the assignment from one group to the other. The workflow employs a combined objective function to induce the nonlinear embedding and clustering jointly. ACE produces as output the learned embedding, the cell group assignments, and a ranked list of explanatory genes for each cell group. cluster or pair of clusters. Let X = (x1,x2, · · · ,xn) T ∈ Rn×p be the normalized gene expression matrix obtained from a scRNA-seq experiment, where rows correspond to n cells and columns correspond to p genes. ACE relies on the following three components: (1) an autoencoder to learn a low-dimensional representation of the scRNA-seq data, (2) a neuralized clustering algorithm to identify groups of cells in the low-dimensional representation, and (3) an adversarial perturbation scheme to explain differences between groups by identifying explanatory gene sets. 3.2. Learning the low-dimensional representation Embedding scRNA-seq expression data into a low- dimensional space aims to capture the underlying structure of the data, based upon the assumption that the biological manifold on which cellular expression profiles lie is inher- ently low-dimensional. Specifically, ACE aims to learn a mapping f(·) : Rp 7→ Rd that transforms the cells from the high-dimensional input space Rp to a lower-dimensional embedding space Rd, where d � p. To accurately represent the data in Rd, we use an autoencoder consisting of two components, an encoder f(·) : Rp 7→ Rd and a decoder g(·) : Rd 7→ Rp. This autoencoder optimizes the generic loss min θ n∑ i=1 ‖xi −g(f(xi))‖ 2 2 (1) Finally, we denote Z = (z1,z2, · · · ,zn) T ∈ Rn×d as the low-dimensional representation obtained from the encoder, where zi ∈ Rd = f(xi) is the embedded representation of cell xi. The autoencoder in ACE can be extended in several impor- tant ways. For example, in some settings, Equation 1 is augmented with a task-specific regularizer Ω(X): min θ n∑ i=1 ‖xi −g(f(xi))‖ 2 2 + Ω(X). (2) As mentioned in Section 2, the scRNA-seq embedding method used by ACE, SAUCIE, encodes in Ω(X) a batch correction regularizer by using maximum mean discrepancy. In this paper, ACE uses SAUCIE coupled with a feature se- lection layer (Abid et al., 2019), with the aim of minimizing redundancy and facilitating selection of diverse explanatory gene sets. 3.3. Neuralizing the clustering step To carry out clustering in the low-dimensional space learned by the autoencoder, ACE uses a neuralized version of the k-means algorithm. This clustering step aims to partition Z ∈ Rn×d into C groups, where each group potentially corresponds to a distinct cell type. The standard k-means algorithm aims to minimize the fol- (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.428881doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.428881 Adversarial clustering explanation lowing objective function by identifying a set of group cen- troids { µc ∈ Rd : c = 1, 2, · · · ,C } : min ∑ ic δicoc(zi) (3) where δic indicates whether cell zi belongs to group c and the “outlierness” measure oc(zi) of cell zi relative to group c is defined as oc(zi) = ‖zi −µc‖ 2. Following Kauffmann et al. (2019), we neuralize the k- means algorithm by creating a neural network containing C modules, each with two layers. The architecture is mo- tivated by a soft assignment function that quantifies, for a particular cell zi and a specified group c, the group assign- ment probability score pc(zi) = exp(−βoc(zi))∑ k exp(−βok(zi)) (4) where the hyperparameter β controls the clustering fuzzi- ness. As β approaches infinity, Equation 4 approaches the indicator function for the closest centroid and thus reduces to hard clustering. To measure the confidence of group assignment, we use a logit function written as mc(zi) = log ( pc(zi) 1 −pc(zi) ) = β · β min k 6=c { ‖zi −µk‖ 2 −‖zi −µc‖ 2 } (5) where minβk 6=c{·} = − 1 β log ∑ exp(−β(·)) indicates a soft min-pooling layer. (See Kauffmann et al. (2019) for a detailed derivation.) The rationale for using the logit func- tion is that if there is as much confidence supporting the group membership as against it, then the confidence score mc(z) = 0. Additionally, Equation 5 has the following interpretation: the data point z belongs to the group c if and only if the distance to its centroid is smaller than the distance to all other competing groups. Equation 5 further decomposes into a two-layer neural network module: hck(zi) = w T ckzi + bck mc(zi) = β · β min k 6=c {hck(zi)} (6) where the first layer is a linear transformation layer with parameters wck = 2 · (µc −µk) and bck = ‖µk‖ 2 −‖µc‖ 2, and the second layer is the soft min-pooling layer introduced in Equation 5. ACE constructs one such module for each of the C clusters, as illustrated in Figure 1. 3.4. Explaining the groups ACE’s final step aims to induce, for each cluster identified by the neuralized k-mean algorithm, a ranking on genes such that highly ranked genes best explain that cluster. We consider two variants of this task: the one-vs-rest setting compares the group of interest Zs = f(Xs) ⊆ Z to its complement set Zt = f(Xt) ⊆ Z, where Xt = X\Xs; the one-vs-one setting compares one group of interest in Zs = f(Xs) ⊆ Z to a second group of interest Zt = f(Xt) ⊆ Z. In each setting, the goal is to identify the key differences between the source group Xs ⊆ X and the target group Xt ⊆ X in the input space, i.e., in terms of the genes. We treat this as a neural network explanation problem by finding the minimal perturbation within the group of interest, x ∈ Xs, that alters the group assignment from the source group s to the target group t. Specifically, we optimize an objective function that is a mixture of two terms: the first term is the difference between the current sample x and the perturbed sample x̂ = x + δ where δ ∈ Rp, and the second term quantifies the difference in group assignments induced by the perturbation. The objective function for the one-vs-one setting is min δ ‖δ‖1 + λ max(0,α + ms(x + δ) −mt(x + δ)) (7) where λ > 0 is a tradeoff coefficient to either encourage a small perturbation of x when small or a stronger alternation to the target group when large. The second term penalizes the situation where the group logit for the source group s is still larger than the target group t, up to a pre-specified margin α > 0. In this paper we fix α = 1.0. The difference between the current sample x and the potentially perturbed x̂ is measured by the L1 norm to encourage sparsity and non-redundancy. Note that Equation 7 assumes that the input expression matrix is normalized so that a perturbation added to one gene is equivalent to that same perturbation added to a different gene. Analogously, in the one-vs-rest case, the objective function for the optimization is min δ ‖δ‖1 +λ max(0,α+ms(x+δ)−max t6=s mt(x+δ)) (8) where the second term penalizes the situation in which the group logit for the source group s is larger than all non- source target groups. Finally, with the δ ∈ Rp obtained by optimizing either Equation 7 or Equation 8, ACE quantifies the importance of the ith gene relative to a perturbation from source group s to target group t as the absolute value of δi, thereby inducing a ranking in which highly ranked genes are more specific to the group of interest. 4. Baseline methods We compare ACE against six methodologically distinct base- line methods, each of which induces a ranking on genes in terms of group-specific importance, analogous to ACE. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.428881doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.428881 Adversarial clustering explanation DESeq2 (Love et al., 2014) is a representative statistical hypothesis testing method that tests for differential gene expression based on a negative binomial model. The main caveat of DESeq2 is that it treats each gene as independent. The Jensen-Shannon Distance (JSD) (Cabili et al., 2011) is a representative distribution distance-based method which quantifies the specificity of a gene to a cell group. Similar to DESeq2, JSD considers each gene independently. Global counterfactual explanation (GCE) (Plumb et al., 2020) is a compressed sensing method that aims to identify consistent differences among all pairs of groups. Unlike ACE, GCE requires a linear embedding of the scRNA-seq data. The gene relevance score (GRS) (Angerer et al., 2020) is a gradient-based explanation method that aims to attribute a low-dimensional embedding back to the genes. The main limitations of GRS are two-fold. First, the embedding used in GRS is constrained to be a diffusion map, which is chosen specifically to make the gradient easy to calculate. Second, taking the gradient with respect to the embedding only indi- rectly measures the group differentiation compared to taking the gradient with respect to the group difference directly, as in ACE. SmoothGrad (Smilkov et al., 2017) and SHAP (Lundberg & Lee, 2017), which are designed primarily for classification problems, are two representative feature attribution methods. Each one computes an importance score that indicates each gene’s contribution to the clustering assignment. Smooth- Grad relies on knowledge to the model, whereas SHAP does not. 5. Results 5.1. Performance on simulated data To compare ACE to each of the baseline methods, we used a recently reported simulation method, SymSim (Zhang et al., 2019), to generate two synthetic scRNA-seq datasets: one “clean” dataset and one “complex” dataset. In both cases, we simulated many redundant genes, in order to adequately challenge methods that aim to detect a minimal set of informative genes. The simulation of the clean dataset uses a protocol similar to that of Plumb et al. (2020). We first used SymSim to generate a background matrix containing simulated counts from 500 cells, 2000 genes, and five distinct clusters. We then used this background matrix to construct our simu- lated dataset of 500 cells by 220 genes. The simulated data is comprised of three sets of genes: 20 causal genes, 100 dependent genes, and 100 noise genes. To select the causal genes, we identified all genes that are differentially expressed by SymSim’s criteria (nDiff-EVFgene > 0 and | log2 fold-change| > 0.8) between at least one pair of clus- ters, and we selected the 20 genes that exhibit the largest average fold-change across all pairs of clusters in which the gene was differentially expressed. A UMAP embedding on these causal genes alone confirms that they are jointly capa- ble of separating cells into their respective clusters (Fig. 2A). Next, we simulated 100 dependent genes, which are weight- ed sums of 1–10 randomly selected causal genes, with added gaussian noise. As such, a dependent gene is highly cor- related with a causal gene or with a linear combination of multiple causal genes. The weights were sampled from a continuous uniform distribution, U(0.01, 0.8), and the gaus- sian noise was sampled from N(0, 1). As expected, the dependent genes are also jointly capable of separating cells into their respective clusters (Fig. 2A). Lastly, we found all genes that were not differentially expressed between any cluster pair in the ground truth, and we randomly sampled 100 noise genes. These genes provide no explanation of the clustering structure (Fig. 2A). To simulate the complex dataset, we used SymSim to add dropout events and batch effects to the background ma- trix generated previously. We then selected the same exact causal and noise genes as in the clean dataset, and used the same exact random combinations and weights to generate the dependent genes. Thus, the clean and complex datasets contain the same 220 genes; however, the complex dataset enables us to gauge how robust ACE is to artifacts of tech- nical noise observed in real single-cell RNA-seq datasets (Fig. 2B). To compare the different gene ranking methods, we need to specify the ground truth cluster labels and a performance measure. We observe that the embedding representation learned by ACE exhibits clear cluster patterns even in the p- resence of dropout events and batch effects, and thus ACE’s k-means clustering is able to recover these clusters (Ap- pendix Figure A.1). Accordingly, to compare different meth- ods for inducing gene rankings, we provide ACE and each baseline method with the ground truth clustering labels from the original study (Zheng et al., 2017). ACE then calculates the group centroid used in Equation 3 by averaging the data points of the corresponding ground truth cluster. The em- bedding layer together with the group centroids are then used to build the neuralized clustering model (Equation 6). Each method produces gene rankings for every cluster in a one-vs-rest fashion. To measure how well a gene ranking captures clustering structure, we use the Jaccard distance to measure the similarity between a cell’s k nearest neighbors (k-NN) when using a subset of top-ranked genes and a cell’s k-NN when using all genes. To compute the k-NN, we use the Euclidean distance metric. The Jaccard distance is defined as JD(i) = 1 − Sfull ∩Ssub Sfull ∪Ssub (9) (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.428881doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.428881 Adversarial clustering explanation Figure 2. Comparing ACE to baseline methods on simulated scRNA-seq datasets. Each dataset consists of 20 causal genes, 100 dependent genes, and 100 noise genes. (A) UMAP embeddings of cells composing the clean dataset. Panels correspond to embeddings using the three subsets of genes (causal, dependent, and noise), as well as all of the genes together. (B) Same as panel A, but for the complex dataset. (C) Comparison of methods via Jaccard distance as a function of the number of genes in the ranking. ACE performs substantially better than each of the baseline methods on the clean dataset. The gray dashed line indicates the mean Jaccard distance achieved by the 20 causal genes alone. (D) Same as panel C but for the complex dataset. where Sfull represents cell i’s k-NN’s when using all genes, and Ssub represents cell i’s k-NN’s when using a subset of top-ranked genes. If the subset of top-ranked genes does a good job of explaining a cluster of cells, then Sfull ∩Ssub and Sfull ∪ Ssub should be nearly equal, and the Jaccard distance should approach 0. We select the gene ranking used to derive a subset of top-ranked genes based on the cell cluster assignment. For example, if the cell belongs in cluster 2, we use the cluster 2 vs. rest gene ranking. Thus, to obtain a global measure of how well a clustering structure is captured on a subset of top-ranked genes, we report the mean Jaccard distance across all cells. Our analysis shows that ACE considerably outperforms each of the baseline methods on the clean dataset, indicating that it is superior at identifying the minimal set of informative genes (Fig. 2B). Notably, ACE outperforms the mean Jac- card distance achieved by the causal genes alone before reaching 20 genes used, suggesting that the method success- fully identifies dependent genes that are more informative than individual causal genes. ACE also performs strongly on the complex dataset, though it appears to perform on par with SmoothGrad and SHAP) (Fig. 2D). Notably, these three methods —ACE, SHAP, and SmoothGrad —share a common feature, employing the SAUCIE framework that facilitates automatic batch effect correction, highlighting the utility of DNN-based dimensionality reduction and in- terpretation methods for single-cell RNA-seq applications. 5.2. Real data analysis We next applied ACE to a real dataset of peripheral blood mononuclear cells (PBMCs) (Zheng et al., 2017), repre- sented as a cell-by-gene log-normalized expression matrix containing 2638 cells and 1838 highly variable genes. The cells in the dataset were previously categorized into eight cell types, obtained by performing Louvain clustering (Blon- del et al., 2008) and annotating each cluster on the basis of differentially expressed marker genes. As shown in Fig- ure 3A and Appendix Figure A.2, ACE’s k-means clustering successfully recovers the reported cell types based upon the 10-dimensional embedding learned by SAUCIE. We first aimed to quantify the discriminative power of the top-ranked genes identified by ACE in comparison to the six baseline methods. To do this, we applied all the six baseline methods to the PBMC dataset using the groups identified by the k-means clustering based on the SAUCIE embedding. For each group of cells, we extracted the top- k group-specific genes reported by each method, where k ranges from 1%, 2%, · · · , 100% among all genes. Given the selected gene subset, we then trained a support vector machine (SVM) classifier with a radial basis function kernel to separate the target group from the remaining groups. The SVM training involves two hyperparameters, the regular- ization coefficient C and the bandwidth parameter σ. The σ parameter is adaptively chosen so that the training data is Z-score normalized, using the default settings in Scikit- learn (Pedregosa et al., 2011). The C parameter is selected (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.428881doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.428881 Adversarial clustering explanation Top 1% genes The intersection among different methods (B) (C) (D) # of genes included # of genes included From most to least important Important genes specific to CD4 T cell 0 5 10 0 4 8 UMAP1 U M A P 2 B cells CD14+ Monocytes CD4 T cells CD8 T cells Dendritic Cells FCGR3A+ Monocytes Megakaryocytes NK cells (A) From most to least important A U R O C P e a rs o n c o rr e la tio n Figure 3. Comparing ACE to baseline methods on PBMC dataset. (A) UMAP embedding of PBMC cells labelled by ACE’s k-means clustering assignment. (B) Classification performance of each method, as measured by AUROC, as a function of the number of genes in the set. Error bars correspond to the standard error of the mean of AUROC scores from each test split across different target groups. (C) Redundancy among the top k genes, as measured by Pearson correlation, as a function of k. Error bars correspond to the standard error of the mean calculated from the group-specific correlations. (D) The figure plots overlaps among the top 18 genes (corresponding to 1% of 1838 genes) identified by all seven methods with respect to the CD4 T cell cluster. by grid search from {5−5, 5−4, · · · , 50, · · · , 54, 55} . The classification performance, in terms of area under the receiv- er operating characteristic curve (AUROC), is evaluated by 3-fold stratified cross-validation, and an additional 3-fold cross-validation is applied within each training split to de- termine the optimal C hyperparameter. Finally, AUROC scores from each test split across different target groups are aggregated and reported, in terms of the mean and the stan- dard error of the mean. Two cell types—megakaryocytes and dendritic cells—are excluded due to insufficient sample size (< 50). As shown in Figure 3B, the top-ranked genes reported by ACE are among the most discriminative across all methods, particularly when the inclusion size is small (≤ 3%). The only method that yields superior performance is DESeq2. We next tested the redundancy of top-ranked genes, as it is desirable to identify diverse explanatory gene sets with minimum redundancy. Specifically, for each target group of cells, we calculate the Pearson correlations between all gene pairs within top k genes, for varying values of k. The mean and standard error of the mean of these correlations are computed within each group and then averaged across dif- ferent target groups. The results of this analysis (Figure 3C) suggest that the top-ranked genes reported by ACE are a- mong the least redundant across all methods. Other methods that exhibit low redundancy include GRS and the two meth- ods that use the same SAUCIE model (i.e., SmoothGrad and SHAP). In conjunction with the discriminative power analysis in Figure 3B, we conclude that ACE achieves a powerful combination of high discriminative power and low redundancy. Finally, to better understand how these methods differ from one another, we investigated the consistency among the top- ranked genes reported by each method. For this analysis, we focused on one particular group, CD4 T cells. We discover strong disagreement among the methods (Figure 3D). Sur- prisingly, no single gene is selected among the top 1% by all methods. Among all methods, ACE covers the most that are reported by at least one other method (14 out of 18 genes). The four genes that ACE uniquely identifies (red bar in Fig- ure 3D)—CCL5, GZMK, SPOCD1, and SNRNP27—are depleted rather than enriched relative to other cell types. It is worth mentioning that both CCL5 and GZMK are enriched in CD8 T cells (Thul et al., 2017), the closest cell type to CD4 T cell (Figure 3A). This observation suggests ACE identifies cells that exhibit highly discriminative changes in expression between two closely related cell types. In- deed, among ACE’s 18-gene panel, 15 genes are depleted rather than enriched, suggesting that much of CD4’s cell identity may be due to inhibition rather than activation of specific genes. In summary, ACE is able to move away from the notion of a “marker gene” to instead identify a highly discriminative, nonredundant gene panel. 5.3. Image analysis Although we developed ACE for application to scRNA-seq data, we hypothesized that the method would be useful in do- mains beyond biology. Explanation methods are potentially useful, for example, in the analysis of biomedical images, where the explanations can identify regions of the image responsible for assignment of the image to a particular phe- notypic category. As a proof of principle for this general domain, we applied ACE to the MNIST handwritten digits dataset (LeCun, 1998), with the aim of studying whether ACE can identify which pixels in a given image explain why the image was assigned to one digit versus another. Specif- ically, we solve the optimization problem for each input image in Equation 7, seeking an image-specific set of pixel modifications, subject to the constraint that the perturbed image pixel values are restricted to lie in the range [0, 1]. Note that this task is somewhat different from the scRNA- seq case: in the MNIST case, ACE finds a different set of explanatory pixels for each image, whereas in the scRNA- (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.428881doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.428881 Adversarial clustering explanation 0 8 1 4 1 7 2 3 3 8 4 5 4 9 5 0 5 6 5 8 6 0 7 0 7 1 7 2 7 9 8 0 8 3 8 5 9 0 9 8 -1 +1Perturbation range: Pixels in initial digit: Figure 4. Applying ACE to the MNIST dataset. ACE is able to explain 20 types of digit transitions in a pixel-wise manner. These digit transitions are chosen such that each digit category is covered at least once in both directions. seq case, ACE seeks a single set of genes that explains label differences across all cells in the dataset. ACE was applied to this dataset as follows. We used a sim- ple convolution neural network architecture containing two convolution layers, each with a modest filter size (5, 5), a modest number of filters (32) and ReLU activation, followed by a max pooling layer with a pool size (2, 2), a fully con- nected layer, and a softmax layer. The model was trained on the MNIST training set (60,000 examples) for 10 epochs, using Adam (Kingma & Ba, 2015) with an initial learning rate of 0.001. The network achieves 98.7% classification accuracy on the test test of 10,000 images. We observe that the embedding representation in the last pooling layer ex- hibits well-separated cluster patterns (Appendix Figure A.3). Since our goal is not to learn the cluster structure per se, for simplicity, we fixed the number of groups to be the number of digit categories (i.e., 10) and calculated the group cen- troid used in Equation 3 by averaging the data points of the corresponding category. The embedding layer together with the group centroids are then used to build the neuralized clustering model (Equation 6.) The results of this analysis show that ACE does a good job of identifying sets of pixels that accurately explain differ- ences between pairs of digits. We examined the pixel-wise explanations of 20 pairs of digits, randomly selected to cov- er each digit category at least once in both directions (Fig. 4). For example, to convert “8” to “5,” ACE disconnects the top right and bottom left of “8,” as expected. Similarly, to convert “8” to “3,” ACE disconnects the top left and bottom left of “8.” It is worth noting that the modifications intro- duced by ACE are inherently symmetric. For example, to convert “1” to “7” and back again, ACE suggests adding and removing the same part of “7.” 6. Discussion and conclusion In this work, we have proposed a deep learning-based scRNA-seq analysis pipeline, ACE, that projects scRNA- seq data to a latent space, clusters the cells in that space, and identifies sets of genes that succinctly explain the d- ifferences among the discovered clusters. Compared to existing state-of-the-art methods, ACE jointly takes into consideration both the nonlinear embedding of cells to a low-dimensional representation and the intrinsic dependen- cies among genes. As such, the method moves away from the notion of a “marker gene” to instead identify a panel of genes. This panel may include genes that are not only enriched but also depleted relative to other cell types, as well as genes that exhibit important differences between closely related cell types. Our experiments demonstrate that ACE identifies gene panels that are highly discriminative sets and exhibit low redundancy. We also provide results suggesting that ACE’s approach may be useful in domains beyond biology, such as image recognition. This work points to several promising directions for future research. In principle, ACE can be used in conjunction with any off-the-shelf scRNA-seq embedding method. Thus, empirical investigation of the utility of generalizing ACE to use embedders other than SAUCIE would be interest- ing. Another possible extension is to apply neuralization to alternative clustering algorithms. For example, in the con- text of scRNA-seq analysis the Louvain algorithm (Blondel et al., 2008) is commonly used and may be a good candidate for neuralization. A promising direction for future work is to provide confidence estimation for the top-ranked group- specific genes, in terms of q-values (Storey, 2003), with the help of the recently proposed knockoffs framework (Barber & Candès, 2015; Lu et al., 2018). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.428881doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.428881 Adversarial clustering explanation References Abid, A., Balin, M. F., and Zou, J. Concrete autoencoders for differentiable feature selection and reconstruction. International Conference on Machine Learning, 2019. Amodio, M., Dijk, D. V., Srinivasan, K., Chen, W. S., Mohsen, H., Moon, K. R., Campbell, A., Zhao, Y., Wang, X., Venkataswamy, M., and Krishnaswamy, S. Exploring single-cell data with deep multitasking neural networks. Nature Methods, pp. 1–7, 2019. Angerer, P., Fischer, D. S., Theis, F. J., Scialdone, A., and Marr, C. Automatic identification of relevant genes from low-dimensional embeddings of single cell rnaseq data. Bioinformatics, 2020. Barber, R. F. and Candès, E. J. Controlling the false discov- ery rate via knockoffs. The Annals of Statistics, 43(5): 2055–2085, 2015. Becht, E., McInnes, L., Healy, J., Dutertre, C., Kwok, I. W. H., Ng, L. G., Ginhoux, F., and Newell, E. W. Dimen- sionality reduction for visualizing single-cell data using UMAP. Nature Biotechnology, 37(1):38–44, 2019. Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefeb- vre, E. Fast unfolding of communities in large net- works. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, 2008. Cabili, M. N., Trapnell, C., Goff, L., Koziol, M., Tazon- Vega, B., Regev, A., and Rinn, J. L. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev, 25(18): 1915–1927, 2011. Carlini, N. and Wagner, D. Towards evaluating the robust- ness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE, 2017. Chang, C., Creager, E., Goldenberg, A., and Duvenaud, D. Explaining image classifiers by counterfactual generation. arXiv preprint arXiv:1807.08024, 2018. Fong, R. and Vedaldi, A. Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3429–3437, 2017. Hu, J., Li, X., Hu, G., Lyu, Y., Susztak, K., and Li, M. Iterative transfer learning with neural network for clus- tering and cell type classification in single-cell RNA-seq analysis. Nature Machine Intelligence, 2(10):607–618, 2020. Kauffmann, J., Esders, M., Montavon, G., Samek, W., and Müller, K. From clustering to cluster explanations via neural networks. arXiv preprint arXiv:1906.07633, 2019. Kauffmann, J., Müller, K., and Montavon, G. Towards ex- plaining anomalies: a deep taylor decomposition of one- class models. Pattern Recognition, 101:107198, 2020. Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, 2015. Kurakin, A., Goodfellow, I., and Bengio, S. Adversar- ial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016. LeCun, Y. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998. Li, X., Wang, K., Lyu, Y., Pan, H., Zhang, J., Stambo- lian, D., Susztak, K., Reilly, M. P., Hu, G., and Li, M. Deep learning enables accurate clustering with batch ef- fect removal in single-cell RNA-seq analysis. Nature Communications, 11(1):1–14, 2020. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., and Yosef, N. Deep generative modeling for single-cell transcrip- tomics. Nature Methods, 15(12):1053–1058, 2018. Love, M., Huker, W., and Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with deseq2. Genome Biology, 15(550), 2014. Lu, Y. Y., Fan, Y., Lv, J., and Noble, W. S. DeepPINK: reproducible feature selection in deep neural networks. In Advances in Neural Information Processing Systems, 2018. Lundberg, S. and Lee, S. A unified approach to interpret- ing model predictions. Advances in Neural Information Processing Systems, 2017. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to ad- versarial attacks. arXiv preprint arXiv:1706.06083, 2017. McInnes, L. and Healy, J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv, 2018. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour- napeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. Pliner, H. A., Shendure, J., and Trapnell, C. Supervised clas- sification enables rapid annotation of cell atlases. Nature Methods, 16(10):983–986, 2019. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.428881doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.428881 Adversarial clustering explanation Plumb, G., Terhorst, J., Sankararaman, S., and Talwalka- r, A. Explaining groups of points in low-dimensional representations. ICML, 2020. Ribeiro, M., Singh, S., and Guestrin, C. "why should i trust you?": Explaining the predictions of any classifier. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 1135–1144, New York, NY, USA, 2016. ACM. Samek, W., Montavon, G., Lapuschkin, S., Anders, C. J., and Müller, K. R. Toward interpretable machine learning: Transparent deep neural networks and beyond. arXiv preprint arXiv:2003.07631, 2020. Shrikumar, A., Greenside, P., Shcherbina, A., and Kunda- je, A. Learning important features through propagating activation differences. In International Conference on Machine Learning, 2017. Simonyan, K., Vedaldi, A., and Zisserman, A. Deep in- side convolutional networks: Visualising image clas- sification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013. Smilkov, D., Thorat, N., Kim, B., Viégas, F., and Watten- berg, M. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017. Storey, J. D. The positive false discovery rate: A bayesian interpretation and the q-value. The Annals of Statistics, 31(6):2013–2035, 2003. Stuart, T. and Satija, R. Integrative single-cell analysis. Nature Reviews Genetics, 20:252–272, 2019. Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribu- tion for deep networks. In International Conference on Machine Learning, 2017. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. Thul, P., Åkesson, L., Wiking, M., Mahdessian, D., Gelada- ki, A., Blal, H., Alm, T., Asplund, A., Björk, L., Breckels, L., et al. A subcellular map of the human proteome. Science, 356(6340), 2017. Van der Maaten, L. and Hinton, G. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(2579- 2605):85, 2008. Way, G. and Greene, C. Bayesian deep learning for single- cell analysis. Nature Methods, 15(12):1009–1010, 2018. Welch, J., Hartemink, A., and Prins, J. SLICER: inferring branched, nonlinear cellular trajectories from single cell rna-seq data. Genome Biology, 17(1):1–15, 2016. Welch, J. D., Hartemink, A. J., and Prins, J. F. MATCHER: manifold alignment reveals correspondence between sin- gle cell transcriptome and epigenome dynamics. Genome biology, 18(1):138, 2017. Xu, C., Lopez, R., Mehlman, E., Regier, J., Jordan, M., and Yosef, N. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Molecular Systems Biology, 17(1):e9620, 2021. Xu, H., Ma, Y., Liu, D., Liu, H., Tang, J., and Jain, A. Adversarial attacks and defenses in images, graphs and text: A review. International Journal of Automation and Computing, 17(2):151–178, 2020. Zhang, X., Xu, C., and Yosef, N. Simulating multiple faceted variability in single cell RNA sequencing. Nature Communications, 10(1):1–16, 2019. Zheng, G. X. Y., Terry, J. M., Belgrader, P., Ryvkin, P., Bent, Z. W., Wilson, R., Ziraldo, S. B., Wheeler, T. D., McDermott, G. P., Zhu, J., Gregory, M. T., Shuga, J., Montesclaros, L., Underwood, J. G., Masquelier, D. A., Nishimura, S. Y., Schnall-Levin, M., Wyatt, P. W., Hind- son, C. M., Bharadwaj, R., Wong, A., Ness, K. D., Beppu, L. W., Deeg, H. J., McFarland, C., Loeb, K. R., Va- lente, W. J., Ericson, N. G., Stevens, E. A., Radich, J. P., Mikkelsen, T. S., Hindson, B. J., and Biela, J. H. Mas- sively parallel digital transcriptional profiling of single cells. Nature Communications, 8:14049, 2017. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.428881doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.428881 Adversarial clustering explanation 75 50 25 0 25 50 75 UMAP1 80 60 40 20 0 20 40 60 80 UM AP 2 label=0 label=1 label=2 label=3 label=4 label=5 label=6 label=7 label=8 label=9 Figure A.3. The embedding representation in the last pooling layer of the convolutional neural network exhibits well-separated cluster patterns among 10 digits on the MNIST dataset. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.428881doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.428881 Adversarial clustering explanation −5.0 −2.5 0.0 2.5 −6 −4 −2 0 2 4 UMAP1 U M A P 2 −4 −2 0 2 4 6 −5 0 5 10 UMAP1 U M A P 2 Group 1 2 3 4 5 (A) (B) Figure A.1. The embedding representation learned by SAUCIE exhibits well-separated cluster patterns on both (A) clean and (B) complex simulated scRNA-seq datasets. Figure A.2. The embedding representation learned by SAUCIE exhibits similar cluster patterns by using either (A) the Louvain algorithm or (B) k-means clustering on the PBMC dataset. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.428881doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.428881 10_1101-2021_02_08_430070 ---- On the application of BERT models for nanopore methylation detection ✐ ✐ ✐ ✐ ✐ ✐ ✐ ✐ Genome Analysis On the application of BERT models for nanopore methylation detection Yao-zhong Zhang 1,∗, Sera Hatakeyama 1, Kiyoshi Yamaguchi 1, Yoichi Furukawa 1, Satoru Miyano 2, Rui Yamaguchi 3, and Seiya Imoto 1,∗ 1Institute of Medical Science, the University of Tokyo, Tokyo, 108-0071, Japan 2 M&D Data Science Center, Tokyo Medical and Dental University, Tokyo, 101-0062, Japan 3Aichi Cancer Center Research Institute, Nagoya, 464-8681, Japan ∗To whom correspondence should be addressed. Abstract Motivation: DNA methylation is a common epigenetic modification, which is widely associated with various biological processes, such as gene expression, aging, and disease. Nanopore sequencing provides a promising methylation detection approach through monitoring abnormal signal shifts for detecting modified bases in target motif regions. Recently, model-based approaches, especially those with deep learning models, have achieved significant performance improvements on nanopore methylation detection. In this work, we explore using bidirectional encoder representations from transformers (BERT) for doing the task, which can provide non-recurrent neural structures for fast parallel computation. Results: We find original BERT architecture does not work as well as the bidirectional recurrent neural network (biRNN) on the nanopore methylation prediction task. Through further analysis, we observe recurrent patterns of positional-signal-shift in the context window surrounding target 5-methylcytosine (5mC) and N6-methyladenine (6mA) motifs. We propose a refined BERT with relative position representation and center hidden units concatenation, which takes account of task-specific characters into modeling. We perform systematic evaluations in-sample and cross-sample. The experiment results show that the refined BERT model can achieve competitive or even better results than the state-of-the-art biRNN model, while the model inference speed is about 6x faster. Besides, on the cross-sample evaluation of datasets from the different research groups, BERT models demonstrate a good generalization performance. Availability: The source code and data are available at https://github.com/yaozhong/methBERT Contact:yaozhong@ims.u-tokyo.ac.jp 1 Introduction Methylation of DNA/RNA/histone is commonly observed in developmental disorders, aging, and genomic disease, such as cancer. Fast and accurately detecting methylation status has a fundamental requirement to find distinctive biomarkers for aging/disease profiling. For a virome/metagenome study, quick and accurate epi-transcriptome detection also plays an important role in understanding unseen strains (Kim et al., 2020). One commonly used DNA methylation detection approach is Whole-Genome Bisulfite Sequencing (WGBS). To detect modified bases, WGBS first takes sodium bisulfite conversion before sequencing. As the pre-chemical bisulfite conversion is a relatively harsh process, it makes DNA sequences more fragmental and a large amount of DNA is usually required. Also, limited to the read length, it is difficult to align short reads in low-complex regions and analyze methylation patterns in a long- range. The data processing of WGBS is sophisticated and time-consuming. Various biases (e.g. GC and fragment length) including those introduced by bisulfite treatment are required to be dealt with in the data analysis. WGBS can only be used for DNA samples, which limits its application of detecting RNA methylation. Single-molecule sequencing (e.g., PacBio and Nanopore) provides a promising approach through detecting abnormal signals in target motif regions, as modified bases usually have different current signals. Compared with the sodium bisulfite approach, no extra chemical treatment is required, which helps to reduce potential biases. Currently exist nanopore methylation detection methods can be categorized into two types. One is testing-based (e.g.,Tombo (Stoiber et al., 2016)), the other is model-based (e.g., nanopolish (Simpson et al., 2017), deepMod(Liu et al., 2019) and deepSignal (Ni et al., 2019)). A testing- based approach performs statistical test on paired signals (candidate and reference) and does not require any training process. Also, it can be applied for any chemical modifications. A model-based approach trains a model 1 .license CC-BY-NC-ND 4.0 Internationalpeer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a The copyright holder for this preprint (which was not certified bythis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430070doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430070 http://creativecommons.org/licenses/by-nc-nd/4.0/ http://creativecommons.org/licenses/by-nc-nd/4.0/ ✐ ✐ ✐ ✐ ✐ ✐ ✐ ✐ 2 Zhang et al. x x x x x1 2 i n-1 n ...... ......Embedding Attention Feed Forwad Attention Feed Forwad Attention Linear Methylation status Feed Forwad Linear C G 5mcC A T A 5’ 3’ DNA sequence x ix i-k x i+k W V − k , W K − k W V k , W K k ...... ............ ...... Attention Feed Forwad Attention Feed Forwad Attention Concate Methylation status Feed Forwad Attention Feed Forwad Attention Feed Forwad Attention Feed Forwad Linear (tanh) Attention Feed Forwad Attention Feed Forwad Attention Feed Forwad relative position constraint window x1 xn (a). Basic BERT for methyaltion detection (b). Refined BERT with relative position representation Fig. 1: Basic BERT’s and refined BERT’s model structure used for methylation detection. Compared with the basic BERT, enhanced constraints and additional edges are highlighted in red color. on known chemical modifications and makes predictions whether a signal sequence contains methylation signals or not. Sequential models, such as hidden Markov model (HMM) and bidirectional recurrent neural network (biRNN), are commonly used in the model-based approach. Although model-based approaches have already achieved competitive results, the sequential computational order makes them difficult to be optimized in parallel for fast inference. Meanwhile, finding discriminative signal patterns for identifying methylated signals is also important for developing novel detection algorithms. In this work, based on the bidirectional encoder representations from transformers (BERT), we explore the non-recurrent modeling approach for nanopore methylation detection. Though analyzing nucleotide sequences with both methylated and unmethylated signals, we profile positional signal-shift for different motifs and methyltransferases. We find ±3bp region surrounding the center methylation candidate shows significant signal-shifts. Different methylation types, such as 5-methylcytosine (5mC) and N6-methyladenine (6mA), also demonstrate different signal-shift patterns. We hence propose a refined BERT model to take account of signal-shift patterns in the modeling. We evaluate the proposed methods on the publicly available benchmark dataset. In both in-sample and cross-sample evaluation, the proposed refined BERT model achieves a competitive or even better result when compared with the state-of-the-art biRNN model, while its model inference speed is about 6x faster. In the cross-sample evaluation, BERT models also demonstrate their transfer learning ability across different datasets. 2 Methods In this section, we introduce BERT (Devlin et al., 2018) and refined BERT applied for nanopore methylation detection. The BERT is built on the base of Transformer (Vaswani et al., 2017), which employs self-attention as the core module in its stacked network structure. It is proposed to replace recurrent and convolution operation with purely attention mechanisms. A typical transformer network consists of encoding and decoding module. BERT only uses the encoding module of a typical transformer for pre- training on the unsupervised data. BERT has achieved break-through results on many natural language understanding tasks. In this work, we explore applying the BERT model for the nanopore methylation detection task to leverage the power of advanced deep learning models. 2.1 BERT and refined BERT model Figure 1 shows the model structures of BERT models used for nanopore methylation detection. We explore two types of BERT models. One is the most commonly used BERT (Figure 1(a)), the other is the refined BERT (Figure 1(b)), which is optimized for nanopore methylation detection. 2.1.1 Embedding module Given extracted features for each position in a sequence, the embedding layer maps input vectors into hidden spaces. In the embedding layer, besides event embedding, positional embedding (PE) is also included. As a BERT is used to learn bidirectional contextual information, positional information is important in the modeling. The original PE (Vaswani et al., 2017) uses a sinusoid embedding, which is fixed and not learnable. PE(pos, 2i) = sin pos 100002i/dmodel PE(pos, 2i + 1) = cos pos 100002i/dmodel , where pos is the position and i is the embedding dimension. For any fixed offset k, PEpos+k can be represented as a linear function of PEpos. According to the recent progress (Huang et al., 2020), learnable PE and relative position embedding can help to further improve BERT’s performances. Therefore, in the refined BERT model, we use learnable PE and relative position representation. The learnable PE takes positional embedding vectors as parameters, which are updated during the learning process. .license CC-BY-NC-ND 4.0 Internationalpeer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a The copyright holder for this preprint (which was not certified bythis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430070doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430070 http://creativecommons.org/licenses/by-nc-nd/4.0/ http://creativecommons.org/licenses/by-nc-nd/4.0/ ✐ ✐ ✐ ✐ ✐ ✐ ✐ ✐ BERT for nanopore methylation detection 3 2.1.2 Self-attention module Following the embedding layer, there are three stacked transformer blocks. Each transformer block consists of a multi-head self-attention layer and position-wise fully connected feed-forward network. The self-attention mechanism is a modeling approach of describing context information for different positions of inputs under a deep learning framework. The self- attention mechanism imitates the human sight mechanism and provides a model with the ability to zoom in or out in a particular position of an input sequence. It demonstrates the effectiveness in many different tasks including natural language understanding, image recognition, and several bioinformatics applications. Attention function is described as mapping Q and a set of key-value (K, V ) pairs to an output. Formally, for an input x = (x1, ..., xn) of n elements where xi ∈ Rdx , we calculate query Q, key K and value V vectors of dimension dk based on the embedding vector of embed(x). The attention module generates a new sequence z = (z1, ..., zn) of the same length as x. zi is calculated as a weighted sum of linearly transformed input elements as follows: zi = n∑ j=1 aij(xjW V ) aij = exp eij∑n k=1 exp eik eij = (xiW Q)(xjW T )T √ dz , where W Q, W K, W T ∈ Rdx×dz are parameter matrices. The self-attention computes a pairwise correlation of embed(xi) and embed(xj), which can be calculated in a parallel way. While in a biRNN, recurrent hidden units are required to be calculated successively. This architecture difference makes BERT can be optimized for fast inference. 2.1.3 Relative position representation in self-attention heads For nanopore sequencing, signals are supposed to be more affected by the nucleotide passing through the pore. Its surrounding nucleotides may also have effects on the current signals. For those nucleotides that are too far away in a context window, it is intuitive to assume they have less effect on the detected current signals. In the refined BERT model, we add relative position representation in the attention module following the method proposed by Shaw et al. (2018). For any two input elements xi and xj , the relative position information is modeled with two distinct edge representations aVij , a K ij . For linear sequences, those edges are used to capture the relative position differences between input elements. As the precise relative position is not useful beyond a certain distance, we clip the maximum distance (e.g. ±3bp) in calculating attention aij ∈ A. a K ij = W K clip(j−i,k) a V ij = W V clip(j−i,k) clip(x, k) = max(−k, min(k, x)) 2.1.4 Final full connection layer After the stacked transformer blocks, hidden units of the center position feed to a full connection linear layer that makes the final prediction of whether a given input contains a methylated motif or not. In the refined BERT, besides the hidden units of the center position, hidden units in its surrounding window (e.g., ±3bp) are concatenated as the input of the final full connection layer. 2.2 Applying BERT models for nanopore methylation detection The BERT models are then applied to replace different classification models (e.g. biRNN) in a typical model-based methylation detection framework. In this framework, raw signals of each read are first translated into nucleotide sequences (basecalling). Signals are then aligned to corresponding reference nucleotides through the re-squiggle process. After that, the target motif (e.g. CpG) and its context regions are localized through nucleotide matching and signals in a context window of a fixed length (e.g. 21bp) are transformed into event-based features as the input of methylation callers. Typical event-based features include signal mean, signal standard deviation, event length, and nucleotide information (Liu et al., 2019). Here, we utilize the framework of deepMOD and perform the same pre-process for the data. We use Tombo (Ver 1.5.1) to perform re- squiggling and utilize Minimap2 (Ver 2.17-r941) to align events to the reference genome. Here, we use E.coli K-12 MG1655 and H.Sapiens GRCh38 as the reference genomes. 3 Experiments We compare BERT models with the state-of-the-art biRNN model, which is used as the basic network structure in DeepMOD (Liu et al., 2019) and DeepSignal (Ni et al., 2019). To compare with other non-deep-learning- based methods, we utilized the CpG benchmark pipeline (Yuen et al., 2020) as a pivot. 3.1 Data and model parameters We train and test the models on the public accessible 5mC (Stoiber et al., 2016; Simpson et al., 2017) and 6mA (Stoiber et al., 2016) datasets. The datasets include samples of E.coli K-12 MG1655, K-12 ER2925, and H.sapiens NA12878. Negative control samples are amplified with PCR and no modified bases are included. Positive control samples are synthetically introduced by specific enzymes after PCR amplification, which includes SssI, Hhal, MpeI methylases for 5mC, and TaqI, EcoRI, and Dam for 6mA modification. We use the samples that are sequenced with Oxford Nanopore R9 flow cells. For each dataset, we randomly shuffle reads in positive and negative controls and construct the training, validate and test set according to a split proportion of 80/10/10 for in-sample evaluation. For the cross-sample evaluation, we train models on one dataset and test on the other dataset. BiRNN uses the default model architecture and parameter setting of DeepMOD, which consists of three stacked bi-directional recurrent layers (hidden_size=100) and one full connection layer for the center position. The total number of biRNN parameters is 570,802 for an input length of 21bp. BERTs use three attention layers (hidden_size=100, attention_head=4) and one full connection layer. For the refined BERT, learnable positional encoding, attention with relative position representation and center-hidden-concatenation are used. For BERT and refined BERT, there are total of 364,902 and 368,202 parameters, which are around 35% less than that of biRNN. More detailed information on the model structures is described in the supplement material. We implement the three models using Pytorch. All the models are optimized using Adam optimizer (Kingma and Ba, 2014) with the learning rate of 1e − 4 and maximum iteration epoch of 50. Model parameters are selected based on the minimum validation loss. 3.2 Exploring differentiated signal positions in the context window surrounding target motifs Ideally, we assume a modified nucleotide (e.g., the center position of XXXXXXXXXXC5mCGXXXXXXXXX) has different current signals, .license CC-BY-NC-ND 4.0 Internationalpeer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a The copyright holder for this preprint (which was not certified bythis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430070doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430070 http://creativecommons.org/licenses/by-nc-nd/4.0/ http://creativecommons.org/licenses/by-nc-nd/4.0/ ✐ ✐ ✐ ✐ ✐ ✐ ✐ ✐ 4 Zhang et al. (a1) Stoiber-E.coli_Cg_SssI (a2) Stoiber-E.coli_Cg_MpeI (a3) Stoiber-E.coli_gCgc_Hhal (b1) Simpson-E.coli_Cg_SssI (b2) Simpson-H.Sapiens_Cg_SssI (c1) Stoiber-E.coli_gaAttc_EcoRI (c2) Stoiber-E.coli_tcgA_TaqI (c3) Stoiber-E.coli_gAtc_Dam Fig. 2: Boxplot of positional signal-shift for 5mC and 6mA datasets of the specific motif and methyltransferase. (a1),(a2) and (a3) are on Stoiber’s E.coli 5mC dataset. (b1) and (b2) are on Simpson’s 5mC dataset. (c1), (c2) and (c3) are on Stoiber’s E.coli 6mA dataset. Each dataset is represented in a format of dataSource_motif_methltansferase. when compared with the unmodified one. As the boundary of nucleotide/k- mer signals are not rigorous and surrounding nucleotides may also be affected, it is worthwhile investigating signal-shift patterns related to methylation in a large context. To identify signal-shift affected by methylation for a specific dataset, we use a simple quantification approach to calculate significant signal changes of each position in the context window. Given a dataset of a specific motif and methyltransferase, we first cluster instances with the same nucleotide sequence to avoid the effect of nucleotide sequences. We reserve sequence clusters that contain both methylation and unmethylation instances (≥ 1). For each sequence cluster, we normalize event signal values of methylation samples with their according unmodified averaged event signal values for each position. The i-th positional signal-shift is then calculated as smethi − avg(s unmeth i ). For those normalized methylation samples, we calculate basic statistics of signal-shift for each position and draw boxplots for 5mC and 6mA training sets. Shown in Figure 2, for all datasets, we can observed positions of significantly signal-shift are located in a range of ±3bp to the center position (the 11th) in which the target nucleotide is located. For the rest off-center positions, the averaged signal-shift values are close to 0. This indicates a modified nucleotide not only affect its corresponding current signals but also the signals of its surrounding nucleotides. Besides, 5mC and 6mA datasets show different positional-signal-shift patterns. Specific positions, such as -2bp position (9th) in the 5mC dataset and +1bp position (12th) in the 6mA dataset, have larger averaged signal- shift values. Such pattern can be generalized across the different dataset with the same motif and methyltransferase. For example, Figure 2 (a1), (b1) and (b2) show a similar positional signal-shift pattern. For different methyltransferases, such as Hhal (Figure 2(a3)) also shows a similar pattern as in SssI, while MpeI does not have a similar pattern obviously (Figure 2(a2)). Those positional signal patterns can be directly modeled by a biRNN, while for the basic BERT, they are not specifically considered in its model structure. In a biRNN, such as the implementation of deepMOD, the last full connection layer uses hidden units of the center time step as the input. Meanwhile, the bi-directional structure and the information decay from both ends to the center position render the model focusing more on center positions. For the basic BERT, as any arbitrary time- step pair is processed with the same attention module, the importance of center positions are not specifically considered in the model. Therefore, we propose a refined BERT model to solve this problem. We incorporate relative-position attention and center-hidden-units concatenation to enable a BERT model to pay more attention to center positions. 3.3 In-sample evaluation To evaluate model performance, we first perform the in-sample evaluation on 5mC and 6mA datasets. The predictions of different models are evaluated on the read and genomic level. For the genomic level evaluation, we group all reads aligned to the same genomic coordinate, and uses a threshold of prediction methylation percentage ≥ 0.1 (same as deepMOD) as a genomic position prediction. In general, on the five 5mC datasets, the AUC performance of the three models are relatively close on both read level and genomic level. The basic BERT model does not work as well as the biRNN model that AUC scores are lower. The refined BERT model achieves equivalent or better AUC scores on the genomic- level. Note that on the dataset Stoiber_E.coli_CG_MpeI and .license CC-BY-NC-ND 4.0 Internationalpeer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a The copyright holder for this preprint (which was not certified bythis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430070doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430070 http://creativecommons.org/licenses/by-nc-nd/4.0/ http://creativecommons.org/licenses/by-nc-nd/4.0/ ✐ ✐ ✐ ✐ ✐ ✐ ✐ ✐ BERT for nanopore methylation detection 5 Dataset Species Motif_Methyltransferase Model Single (read-level) Group (>=1, genomic-level) AUC Precision Recall AUC Precision Recall Stoiber E.coli GCGC_HhaI biRNN 0.9205 0.9545 0.8593 0.9322 0.9320 0.9134 BERT_basic 0.9183 0.9528 0.8556 0.9305 0.9299 0.9113 BERT_refined 0.9239 0.9563 0.8655 0.9351 0.9341 0.9177 CG_MpeI BiRNN 0.7184 0.8943 0.4555 0.7482 0.8764 0.5452 BERT_basic 0.7045 0.8682 0.4316 0.7312 0.8494 0.5211 BERT_refined 0.717 0.9017 0.4511 0.7482 0.8848 0.5412 CG_SssI BiRNN 0.9017 0.9576 0.8097 0.9127 0.9508 0.8420 BERT_basic 0.9001 0.9534 0.8071 0.9107 0.9463 0.8395 BERT_refined 0.9068 0.9509 0.821 0.9162 0.9433 0.852 Simpson E. coli CG_SssI BiRNN 0.9514 0.9512 0.9316 0.9284 0.8805 0.9854 BERT_basic 0.9477 0.9469 0.9268 0.9227 0.8718 0.9845 BERT_refined 0.9464 0.9656 0.9124 0.9456 0.9135 0.9803 H.Sapiens CG_SssI BiRNN 0.9004 0.8891 0.9230 0.9010 0.8900 0.9240 BERT_basic 0.8962 0.8813 0.9248 0.8969 0.8823 0.9256 BERT_refined 0.9045 0.9143 0.8984 0.9053 0.9147 0.9003 Table 1. In-sample evaluation of different deep learning models on 5mC datasets. The best score of each dataset is highlighted in bold. Dataset Species Motif_Methyltransferase Model Single (read-level) Group (>=1, genomic level) AUC Precision Recall AUC Precision Recall Stoiber E.coli gaAttc_EcoRI BiRNN 0.8524 0.8088 0.7497 0.8429 0.7797 0.8035 BERT_basic 0.8607 0.8151 0.7653 0.8591 0.7969 0.8277 BERT_refined 0.8611 0.8826 0.7473 0.8655 0.8596 0.7987 tcgA_TaqI BiRNN 0.7722 0.7922 0.5750 0.7750 0.7789 0.6290 BERT_basic 0.7573 0.8168 0.5392 0.7653 0.8063 0.5937 BERT_refined 0.7857 0.7788 0.6064 0.7843 0.7643 0.6586 gAtc_Dam BiRNN 0.6123 0.7656 0.247 0.6337 0.7631 0.3241 BERT_basic 0.6128 0.7329 0.2529 0.631 0.7311 0.3305 BERT_refined 0.6188 0.7513 0.2634 0.6385 0.7471 0.3421 Table 2. In-sample evaluation of different deep learning models on 6mA datasets.The best score of each dataset is highlighted in bold. Simpson_E.coli_CG_SssI, although the read-level AUC of the refined BERT are 0.0014 and 0.005 lower than that of biRNN, the genomic-level performance of the refined BERT is equal or significantly better than biRNN. This can be explained by the more accurate prediction in several low read-coverage regions. On the 6mA dataset, the refined BERT model achieves the best AUC performance on both read-level and genomic-level. The performance of the basic BERT model is variant and unstable. On Stobier_E.coli_gaAttc_EcoRI and Stoiber_E.coli_gAtc_Dam, the basic BERT performs slightly better than biRNN on the read-level AUC, but has a large performance gap on Stoiber_E.coli_gaAttc_EcoRI. In summary, in the in-sample evaluation, the refined BERT model can achieve competitive or better results when compared with the biRNN model on benchmark 5mC and 6mA datasets. 3.4 Cross-sample evaluation We then conduct the cross-sample evaluation. To compare with other non- deep-learning based methods, we utilize the benchmark pipeline (Yuen et al., 2020) as a pivot. We test models on the same benchmark dataset1, which is generated based on Simpson’s E.coli dataset with different methylation levels. In the dataset, 100 arbitrary sites are selected, which contain singleton CpG in a window of 10nt from both methylated and unmethylated instances in the Simpson’s E.coli dataset. Yuen et al. created 11 specific mixtures of methylated and unmethylated reads, containing 0%, 10%, ..., 100% of methylated reads. Each mixture contains approximately 2400 reads. More detailed information can be found in (Yuen et al., 2020). Different from the deepMOD model used in the original benchmark pipeline, which is pre-trained on a mixture dataset of all 5mC positive (Cg_SssI, Cg_MpeI, and gCgc_Hhal) and negative controls (UMR, con1, and con2). Here, we test two different models trained on a single dataset with the same methyltransferase to reduce potential overlapping between the training and testing set. All three models are trained on Stoiber_Ecoli_CG_SssI and Simpson_Hsapiens_CG_SssI, separately. Simpson_Hsapiens_CG_SssI is sequenced by the same group on different species, while Stoiber_Ecoli_CG_SssI is sequenced by a different group on the same species. We use METEORE pipeline (Yuen et al., 2020) to generate violin plots for model predictions on each mixture. The Pearson’s correlation r, coefficient of determination r2 and root mean square error (RMSE) are used as the evaluation metrics for each model. With the training data of Simpson_Hsapiens_CG_SssI, all three models achieve performances ranked next to the best reported results of Megalodon (r=0.9860, r2 = 0.9723, RMSE=0.0758) on the dataset (Yuen et al., 2020). BiRNN achieves the best Pearson correlation r=0.9828 and r2=0.9658, while refine BERT achieves minimal RMSE of 0.0732 among the evaluated three models. When using Stoiber_Ecoli_CG_SssI for training models, the performances of all three models decrease. This indicates the challenge of using datasets sequenced by different research groups. Here, both BERT models show better performances than biRNN, as in Figure 3b. The refined .license CC-BY-NC-ND 4.0 Internationalpeer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a The copyright holder for this preprint (which was not certified bythis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430070doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430070 http://creativecommons.org/licenses/by-nc-nd/4.0/ http://creativecommons.org/licenses/by-nc-nd/4.0/ ✐ ✐ ✐ ✐ ✐ ✐ ✐ ✐ 6 Zhang et al. (a) Models trained with Simpson_Hsapiens_CG_SssI dataset. (b) Models trained with Stoiber_Ecoli_CG_SssI dataset. Fig. 3: Violin plots of prediction results of models trained on different datasets. BERT achieves the best r=0.9446, r2=0.8924 and RMSE of 0.1449 among the three models, which demonstrate the generalization ability on datasets sequenced by different research groups. Based on the reported benchmark results, the Pearson correlation ranks between reported deepMOD and deepSignal (Megalodon > DeepMODmixModel (0.9467) > refined BERT > DeepSignalhuman_hx1 (0.9420) >Guppy>Nanopolish>Tombo). 3.5 Model inference speed The main motivation of applying BERT models is to use a non-recurrent modeling approach for the nanopore methylation detection task to improve the model inference speed. We performed a speed test on a server with 24 CPU cores (Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz) and .license CC-BY-NC-ND 4.0 Internationalpeer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a The copyright holder for this preprint (which was not certified bythis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430070doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430070 http://creativecommons.org/licenses/by-nc-nd/4.0/ http://creativecommons.org/licenses/by-nc-nd/4.0/ ✐ ✐ ✐ ✐ ✐ ✐ ✐ ✐ BERT for nanopore methylation detection 7 Model Model inference time Total running time biRNN 162.91 s 711.56 s BERT_basic 22.71 s 615.36 s BERT_refined 27.29 s 622.73 s Table 3. Model inference and total running time on the benchmark dataset1 for all 26402 reads. one V100 NIVIDA GPU card. In the running, CPUs are responsible for data loading and feature extraction, while GPU works for model inference. We tested the model inference time and total running time of the three models on the benchmark dataset1. For each mixture split, we repeated 5 times running and took the averaged value. As shown in Table 3, the model inference speed of BERT models is around 6x∼7x faster than biRNN model (BERT_refined:5.96x, BERT_basic:7.16x). The inference time of refined BERT is only slightly slower than the basic BERT model. The gap of the total time is not that large (BERT_refined:1.14x, BERT_basic:1.16x), as the data I/O and feature extraction take major time. In the current implementation of BERT, we use reads as the basic data unit and integrate the data pre-processing part during a read-batch loading process. The data I/O and feature extraction part can be further accelerated. 4 Discussion A BERT commonly works in a pre-training and fine-tuning approach. In the pre-training phase, a BERT learns bi-directional representations from unlabeled data. After that, learned feature representations are used on task- specific data for further fine-tuning. It has lead to several state-of-the-art results on many downstream tasks in language understanding. According to the data scale, the number of BERT parameters is usually large, and training such a model requires a huge amount of computational resources. For example, the BERT used for natural language modeling has a parameter scale ranging from 110M to 340M (Devlin et al., 2018). In this work, we did not follow this schema. Instead, we utilized the model architecture of BERT to provide a lightweight and non-recurrent solution to replace the recurrent biRNN model. In our experiment, the BERT uses three attention layers with 4 attention heads and 100 hidden units for each layer. The total number of model parameters is around 0.37M, which is even less than that of biRNN (0.57M). In the future, when more nanopore methylation data becomes available, a larger BERT model and pre-training and fine-tuning scheme can be further explored. 5 Conclusion In this work, we explored applying BERT models for nanopore methylation detection, which aims to use a non-recurrent modeling approach for fast inference. We quantified positional signal-shift related to methylation for different datasets of specific motif/methylase and found patterns across datasets. In the process of evaluation, we found the original BERT architecture does not work as well as biRNN. We proposed a refined BERT considering task-specific characters into the modeling. Compared with the original BERT, the refined BERT uses learnable positional encoding and self-attention with relative position representation, and focuses more on the center positions in a ±3bp range. The experiment results show that the refined BERT can achieve competitive and even better results than the state- of-the-art biRNN model on a set of 5mC and 6mA benchmark datasets, while the model inference speed is about 6x faster. On the cross-sample evaluation, for the case that train and test data from different research groups, BERTs (include the original BERT) show a better performance than biRNN. Acknowledgements We would like to thank Marcus Stoiber and Jared Simpson for making nanopore methylation data publicly available, Zaka Wing-Sze Yuen for providing the benchmark dataset and pipeline, authors of deepMOD and deepSignal for providing their source codes. References Devlin, J. et al. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Huang, Z. et al. (2020). Improve transformer models with better relative position embeddings. arXiv preprint arXiv:2009.13658. Kim, D. et al. (2020). The architecture of sars-cov-2 transcriptome. Cell, 181(4), 914–921. Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Liu, Q. et al. (2019). Detection of dna base modifications by deep recurrent neural network on oxford nanopore sequencing data. Nature communications, 10(1), 1–11. Ni, P. et al. (2019). Deepsignal: detecting dna methylation state from nanopore sequencing reads using deep-learning. Bioinformatics, 35(22), 4586–4595. Shaw, P. et al. (2018). Self-attention with relative position representations. arXiv preprint arXiv:1803.02155. Simpson, J. T. et al. (2017). Detecting dna cytosine methylation using nanopore sequencing. Nature methods, 14(4), 407. Stoiber, M. H. et al. (2016). De novo identification of dna modifications enabled by genome-guided nanopore signal processing. BioRxiv, page 094672. Vaswani, A. et al. (2017). Attention is all you need. pages 5998–6008. Yuen, Z. W.-S. et al. (2020). Systematic benchmarking of tools for cpg methylation detection from nanopore sequencing. bioRxiv. .license CC-BY-NC-ND 4.0 Internationalpeer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a The copyright holder for this preprint (which was not certified bythis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430070doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430070 http://creativecommons.org/licenses/by-nc-nd/4.0/ http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2021_02_08_430270 ---- Scalable Bias-corrected Linkage Disequilibrium Estimation Under Genotype Uncertainty Scalable Bias-corrected Linkage Disequilibrium Estimation Under Genotype Uncertainty David Gerard Department of Mathematics and Statistics, American University, Washington, DC, 20016, USA Abstract Linkage disequilibrium (LD) estimates are often calculated genome-wide for use in many tasks, such as SNP pruning and LD decay estimation. However, in the presence of genotype uncertainty, naive approaches to calculating LD have extreme attenuation biases, incorrectly suggesting that SNPs are less dependent than in reality. These biases are particularly strong in polyploid organisms, which often exhibit greater levels of genotype uncertainty than diploids. A principled approach using maximum likelihood estimation with genotype likelihoods can reduce this bias, but is prohibitively slow for genome-wide applications. Here, we present scalable moment-based adjustments to LD estimates based on the marginal posterior distributions of the genotypes. We demonstrate, on both simulated and real data, that these moment-based estimators are as accurate as maximum likelihood estimators, and are almost as fast as naive approaches based only on posterior mean genotypes. This opens up bias-corrected LD estimation to genome-wide applications. Additionally, we provide standard errors for these moment-based estimators. All methods are implemented in the ldsep R package on GitHub https://github. com/dcgerard/ldsep. 1 Introduction Pairwise linkage disequilibrium (LD), the statistical association between alleles at two different loci, has applications in genotype imputation [Wen and Stephens, 2010], genome-wide association studies [Zhu and Stephens, 2018], genomic prediction [Wientjes et al., 2013], population genetics [Slatkin, 2008], and many other tasks [Sved and Hill, 2018]. LD is often estimated from next-generation sequencing technologies, where the genotypes and haplotypes are not known with certainty [Gerard et al., 2018]. Thus, researchers typically use estimated genotypes, such as posterior mean genotypes [Fox et al., 2019], to estimate LD. However, this can cause biased LD estimates, attenuated toward zero, implying loci are less dependent than in reality. This bias is particularly strong in polyploids, and so in Gerard [2020] we derived maximum likelihood estimates (MLEs) that have lower bias and are consistent estimates of LD. Unfortunately, the MLE approach is prohibitively slow. Researchers typically calculate pairwise LD at genome-wide scales, and the MLE approach takes on the order of a tenth of a second. Thus, for many genome-wide applications, containing millions of SNPs, LD estimation using the MLE approach would take years of computation time. This is not conducive to large-scale applications. Keywords and phrases: attenuation bias, genotype likelihood, linkage disequilibrium, polyploidy, reliability ratio. 1 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://github.com/dcgerard/ldsep https://github.com/dcgerard/ldsep https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ Here, we derive scalable approaches to estimate LD that account for genotype uncertainty (Sec- tion 2). Our methods use only the first two moments of the marginal posterior genotype distribution for each individual at each locus, which are often provided or easily obtainable from many geno- typing programs. We calculate sample moments from these posterior moments, and use these to multiplicatively inflate naive LD estimates. We show, through simulations (Section 3.1) and real data (Section 3.2), that our estimates can reduce attenuation bias and improve LD estimates when genotypes are uncertain. All calculations have computational complexities that are linear in the sample size, and so these estimates are scalable to genome-wide applications. 2 Methods In this section, we will define moment-based estimators of the LD coefficient ∆ [Lewontin and Kojima, 1960], the standardized LD coefficient ∆′ [Lewontin, 1964], and the Pearson correlation ρ [Hill and Robertson, 1968]. We will only consider estimating the “composite” versions of these LD measures which, advantageously, are appropriate LD measures for generic autopolyploid, allopoly- ploid, and segmental allopolyploid populations, even in the absence of Hardy-Weinberg equilibrium [Gerard, 2020]. We will also only consider biallelic loci, where the genotype for each individual is the dosage (from 0 to the ploidy) of one of the two alleles. To define our estimators of LD, we assume the user provides the posterior means and variances for the genotypes for each individual at two loci. The full posterior genotype distribution for each individual is often provided by genotyping software [Gerard et al., 2018, Gerard and Ferrão, 2019, e.g.], from which these posterior moments can be obtained. If genotype posteriors are not provided, genotype likelihoods may be normalized to posterior probabilities (assuming a uniform prior) and used in what follows. Let XiA and XiB be the posterior means at loci A and B for individual i ∈ {1, . . . ,n}. Let YiA and YiB be the posterior variances at loci A and B for individual i. Our estimators are based entirely on the following sample moments of these posterior moments, which may be calculated in linear time in the sample size, n. uxA := 1 n n∑ i=1 XiA, uxB := 1 n n∑ i=1 XiB, (1) vxA := 1 n− 1 n∑ i=1 (XiA −uxA)2, vxB := 1 n− 1 n∑ i=1 (XiA −uxB)2, (2) cx := 1 n− 1 n∑ i=1 (XiA −uxA)(XiB −uxB), (3) uyA := 1 n n∑ i=1 YiA, and uyB := 1 n n∑ i=1 YiB. (4) For a K-ploid species, our LD estimators, which we derive in Section S1, are as follows. The estimated LD coefficient is ∆̂ := ( uyA + vxA vxA )( uyB + vxB vxB )(cx K ) . (5) 2 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ The estimated Pearson correlation is ρ̂ := √ uyA + vxA vxA √ uyB + vxB vxB cx√ vxAvxB . (6) Note that cx/ √ vxAvxB is the sample Pearson correlation between posterior mean genotypes. The estimated standardized LD coefficient is ∆̂′ := ∆̂/∆̂m, where (7) ∆̂m := { min{uxAuxB, (K −uxA)(K −uxB)}/K2 if cx < 0, and min{uxA(K −uxB), (K −uxA)uxB}/K2 if cx > 0. (8) Equations (5)–(7) take the naive estimators most researchers use in practice (the sample covari- ance/correlation of posterior means) and inflate these by a multiplicative effect. Such multiplicative effects are sometimes called “reliability ratios” in the measurement error models literature [Fuller, 2009]. Due to sampling variability, this inflation could result in estimates that lie beyond the theo- retical bounds of the parameters being estimated. In such cases, we apply the following truncations. ρ̃ := { max{ρ̂,−1} if ρ̂ < 0 min{ρ̂, 1} if ρ̂ > 0 (9) ∆̃ := { max{∆̂,− √ (vxA + uyA)(vxB + uyB)/K} if ∆̂ < 0 min{∆̂, √ (vxA + uyA)(vxB + uyB)/K} if ∆̂ > 0 (10) ∆̃′ := { max{∆̂′,−K} if ∆̂′ < 0 min{∆̂′,K} if ∆̂′ > 0 (11) Standard errors are important for hypothesis testing [Brown, 1975], read-depth suggestions [Maruki and Lynch, 2014], and shrinkage [Dey and Stephens, 2018]. Because estimators (5)–(7) are functions of sample moments, deriving their standard errors can be accomplished by appealing to the central limit theorem, followed by an application of the delta method (Section S2). Additional considerations for improving our estimates of the reliability ratios, such as using hierarchical shrinkage [Stephens, 2016], are considered in Section S3. All methods are implemented in the ldsep R package on GitHub https://github.com/dcgerard/ ldsep. 3 Results 3.1 Simulations We compared our moment-based estimators (5)–(7) to those of the MLE of Gerard [2020] as well as the naive estimator that calculates the sample covariance and sample correlation between posterior mean genotypes at two loci. Each replication, we generated genotypes for n ∈{10, 100, 1000} indi- viduals with ploidy K ∈{2, 4, 6, 8} under Hardy-Weinberg equilibrium at two loci with major allele frequencies (pA,pB) ∈{(0.5, 0.5), (0.5, 0.75), (0.9, 0.9)} and Pearson correlation ρ ∈{0, 0.5, 0.9}. We then used updog’s rflexdog() function [Gerard et al., 2018, Gerard and Ferrão, 2019] to generate read-counts at read-depths of either 10 or 100, a sequencing error rate of 0.01, an overdispersion 3 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://github.com/dcgerard/ldsep https://github.com/dcgerard/ldsep https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ value of 0.01, and no allele bias. Updog was then used to generate genotype likelihoods and genotype posterior distributions for each individual at each SNP. These were then fed into ldsep to obtain the MLE, our new moment-based estimator, and the naive estimator. Simulations were replicated 200 times for each unique combination of simulation parameters. The accuracy of estimating ρ2 when pA = pB = 0.5 at a read-depth of 10 is presented in Figure 1. The results for other scenarios are similar and may be found on GitHub (https:// github.com/dcgerard/ldfast_sims). We see that the moment-based estimator and the MLE perform comparably, even for small read-depth and sample size. The naive estimator has a strong attenuation bias toward zero. This bias is particularly prominent for higher ploidy levels. For example, for an octoploid species where the true ρ2 is 0.81, the naive estimator appears to converge to a ρ2 estimate of around 0.25. This bias does not disappear with increasing sample size. Estimated standard errors are reasonably well-behaved, except for ρ̂ and ρ̂2 when the sample size is small and the LD is large (Figure 2). 3.2 LD estimates for Solanum tuberosum We evaluated our methods on the autotetraploid potato (Solanum tuberosum, 2n = 4x = 48) genotyping-by-sequencing data from Uitdewilligen et al. [2013]. We used updog [Gerard et al., 2018, Gerard and Ferrão, 2019] to obtain the posterior moments for each individual’s genotype at each SNP on a single super scaffold (PGSC0003DMB000000192). To remove monoallelic SNPs, we filtered out SNPs with allele frequencies either greater than 0.95 or less than 0.05, and filtered out SNPs with a variance of posterior means less than 0.05. This resulted in 2108 SNPs. We then estimated the squared correlation between each SNP using either the naive approach of calculating the sample Pearson correlation between posterior means, or using our new moment-based approach (6). Our estimators are scalable. On a 1.9 GHz quad-core PC running Linux with 32 GB of memory, it took a total of 1.9 seconds to estimate all pairwise correlations using our new moment-based approach, which is a small increase over the 0.7 seconds it took to estimate all pairwise correlations using the naive approach. In Gerard [2020], we found that the MLE approach took about 0.1 seconds for each pair of SNPs for a tetraploid individual. Extrapolating this to 2108 SNPs would indicate that the MLE approach would take about 2.5 days of computation time to calculate all pairwise LD estimates on this dataset ( ( 2108 2 ) ×0.1sec×1min/60sec×1hr/60min×1d/24hr = 2.57d). The histogram of estimated reliability ratios are presented in Figure 3. We see there that the reliability ratios of most SNPs only increase their correlation estimates by less than 10%. But a not insignificant portion have reliability ratios that increase the correlation estimates by more than 10%. To evaluate the LD estimates of high reliability ratio SNPs, we calculated the MLEs for ρ2 between the twenty SNPs with the largest reliability ratios. A pairs plot for ρ2 estimates between the three approaches is presented in Figure 4. We see there that the MLE and new moment-based approach result in very similar ρ2 estimates, while the naive approach using posterior means results in much smaller ρ2 estimates. 4 Discussion It has been known since at least the time of Spearman that the sample correlation coefficient (or, similarly, the ordinary least squares estimator in simple linear regression) is attenuated in the presence of uncertain variables [Spearman, 1904]. Methods to adjust for this bias include assuming prior knowledge on the measurement variances or the ratio of measurement variances (resulting 4 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://github.com/dcgerard/ldfast_sims https://github.com/dcgerard/ldfast_sims https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ from, for example, repeated measurements on the same individuals) [Koopmans, 1937, Degracie and Fuller, 1972], using instrumental variables [Carter and Fuller, 1980], and using distributional assumptions [Pal, 1980]. See Fuller [2009] for a detailed introduction to this vast field. Our solution was to use sample moments of marginal posterior moments which, to our knowledge, has never been proposed before. It is natural to ask if our methods could be used to account for uncertain genotypes in genome- wide association studies. However, the moment-based techniques we used in this manuscript, when applied to simple linear regression with an additive effects model (where the SNP effect is pro- portional to the dosage), result in the standard ordinary least squares estimates when using the posterior mean as a covariate (Section S4). This supports using the posterior mean as a covariate in simple linear regression with an additive effects model. This is not to say, however, that using the posterior mean is also appropriate for more complicated models of gene action [Rosyara et al., 2016], or for non-linear models. Acknowledgments Most analyses were performed using the R statistical language [R Core Team, 2020]. Data availability All methods discussed in this manuscript are implemented in the ldsep R package, available on GitHub (https://github.com/dcgerard/ldsep) under a GPL-3 license. Scripts to reproduce the results of this research are available on GitHub (https://github.com/dcgerard/ldfast_sims). All datasets used in this manuscript are publicly available [Uitdewilligen et al., 2013] and may be downloaded from: • https://doi.org/10.1371/journal.pone.0062355.s004 • https://doi.org/10.1371/journal.pone.0062355.s007 • https://doi.org/10.1371/journal.pone.0062355.s009 • https://doi.org/10.1371/journal.pone.0062355.s010 5 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://github.com/dcgerard/ldsep https://github.com/dcgerard/ldfast_sims https://doi.org/10.1371/journal.pone.0062355.s004 https://doi.org/10.1371/journal.pone.0062355.s007 https://doi.org/10.1371/journal.pone.0062355.s009 https://doi.org/10.1371/journal.pone.0062355.s010 https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ 5 Figures 0 0.25 0.81 2 4 6 8 10 100 1000 10 100 1000 10 100 1000 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Sample Size ρ̂2 method MLE MoM Naive Figure 1: Estimate of ρ2 (y-axis) for the maximum likelihood estimator [Gerard, 2020] (MLE), our new moment-based estimator (6) (MoM), and the naive squared sample correlation coefficient between posterior mean genotypes (Naive). The x-axis indexes the sample size, the row-facets index the ploidy, and the column-facets index the true ρ2, which is also presented by the horizontal dashed red line. These simulations were performed using a read-depth of 10, and major allele frequencies of 0.5 at each locus. The naive estimator presents a strong attenuation bias toward 0, particularly for higher ploidy regimes. 6 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ ẑ ∆̂′ ∆̂ ρ̂ ρ̂ 2 0.0 0.2 0.4 0.6 0 1 2 0.00 0.05 0.10 0.15 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.0 0.5 1.0 1.5 0.00 0.05 0.10 0.00 0.25 0.50 0.75 1.00 MAD of Estimates M e d ia n o f S ta n d a rd E rr o rs n and ρ other n = 10, ρ = 0.9 Figure 2: Median of estimated standard errors (y-axis) versus median absolute deviations (x-axis) of each of the moment-based LD estimators (facets). The line is the y = x line, and points above this line indicate that the estimated standard errors are typically larger than the true standard errors. Estimated standard error are reasonably unbiased except for ρ̂ and ρ̂2 in scenarios with small sample sizes (n = 10) and a large levels of LD (ρ = 0.9) (color and shape). 0 200 400 600 1.00 1.25 1.50 1.75 2.00 Reliability Ratio Estimate co u n t Figure 3: Histogram of estimated reliability ratios (S69) using the data from Uitdewilligen et al. [2013]. 7 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ MLE MoM Naive M L E M o M N a ive 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 0.0 0.2 0.4 0.6 0.8 0 20 40 60 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Figure 4: Pairs plot for ρ2 estimates between the twenty SNPs from Uitdewilligen et al. [2013] with the largest estimated reliability ratios when using either maximum likelihood estimation (MLE) [Gerard, 2020], our new moment-based approach (6) (MoM), or the naive approach using just posterior means (Naive). The dashed line is the y = x line. The MLE and the moment-based approach result in much more similar LD estimates. 8 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ Supplementary Material S1 Derivation of LD estimators In this section, we derive estimators (5)–(7). We do this by assuming a normal model on the data and the genotypes. This is obviously not appropriate when using genotypes and sequencing data, but our simulations in Section 3.1 were also accomplished using sequencing data and resulted in very good performance. Let Gi = (GiA,GiB) ᵀ be the genotype for individual i at loci A and B. Let Zi = (ZiA,ZiB) ᵀ be the data for individual i at loci A and B. Then we let Gi ∼ N2(µ, Σ), and (S1) Zi|Gi ∼ N2(Gi,S), where (S2) µ = (µ1,µ2) ᵀ, (S3) Σ = ( σ11 σ12 σ12 σ22 ) , and (S4) S = ( s11 0 0 s22 ) . (S5) To interpret these terms, µ1/K and µ2/K are the allele frequencies at each locus, σ11 and σ22 are the variances of the genotypes at each locus, s11 and s22 are the variances of the genotyping errors at each locus, and σ12 is covariance between genotypes. By elementary methods, we have the well-known result that, marginally, Zi ∼ N2(µ, Σ + S). (S6) We assume the user has provided posterior moments on the genotypes XiA = E[GiA|ZiA],XiB = E[GiB|ZiB],YiA = var(GiA|ZiA), and YiB = var(GiB|ZiB). (S7) These posterior moments are marginal in that they only condition on either ZiA or ZiB, but not both. Thus, we assume they are well-approximated by the model GiA ∼ N(µ1,σ11) (S8) ZiA|GiA ∼ N(GiA,s11) (S9) GiB ∼ N(µ2,σ22) (S10) ZiB|GiB ∼ N(GiB,s22). (S11) By standard methods, this results in GiA|ZiA ∼ N [( 1 σ11 + 1 s11 )−1 ( 1 σ11 µ1 + 1 s11 ZiA ) , ( 1 σ11 + 1 s11 )−1] , and (S12) GiB|ZiB ∼ N [( 1 σ22 + 1 s22 )−1 ( 1 σ22 µ2 + 1 s22 ZiB ) , ( 1 σ22 + 1 s22 )−1] . (S13) 9 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ Treating only Zi as random from distribution (S6), we have uxA ≈ E [( 1 σ11 + 1 s11 )−1 ( 1 σ11 µ1 + 1 s11 ZiA )] (S14) = ( 1 σ11 + 1 s11 )−1 ( 1 σ11 µ1 + 1 s11 E[ZiA] ) (S15) = ( 1 σ11 + 1 s11 )−1 ( 1 σ11 µ1 + 1 s11 µ1 ) (S16) = µ1. (S17) Similarly, uxB ≈ µ2. (S18) Furthermore, vxA ≈ var [( 1 σ11 + 1 s11 )−1 ( 1 σ11 µ1 + 1 s11 ZiA )] (S19) = ( 1 σ11 + 1 s11 )−2 1 s211 var(ZiA) (S20) = ( 1 σ11 + 1 s11 )−2 σ11 + s11 s211 (S21) = ( 1 σ11 + 1 s11 )−1 σ11 s11 . (S22) Similarly, vxB ≈ ( 1 σ22 + 1 s22 )−1 σ22 s22 . (S23) Now, using the posterior variances, we have uyA ≈ ( 1 σ11 + 1 s11 )−1 , and (S24) uyB ≈ ( 1 σ22 + 1 s22 )−1 . (S25) 10 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ Finally, the expectation of the sample covariance of posterior means is cx ≈ cov [( 1 σ11 + 1 s11 )−1 ( 1 σ11 µ1 + 1 s11 ZiA ) , ( 1 σ22 + 1 s22 )−1 ( 1 σ22 µ2 + 1 s22 ZiB )] (S26) = ( 1 σ11 + 1 s11 )−1 ( 1 σ22 + 1 s22 )−1 1 s11 1 s22 cov(ZiA,ZiB) (S27) = ( 1 σ11 + 1 s11 )−1 ( 1 σ22 + 1 s22 )−1 1 s11 1 s22 σ12. (S28) Using a method-of-moments approach, we now have a system of five equations and five un- knowns: vxA = ( 1 σ11 + 1 s11 )−1 σ11 s11 , (S29) vxB = ( 1 σ22 + 1 s22 )−1 σ22 s22 , (S30) uyA = ( 1 σ11 + 1 s11 )−1 , (S31) uyB = ( 1 σ22 + 1 s22 )−1 , and (S32) cx = ( 1 σ11 + 1 s11 )−1 ( 1 σ22 + 1 s22 )−1 1 s11 1 s22 σ12. (S33) Solving for s11, s22, σ11, σ22, and σ12, we obtain: ŝ11 = uyA(uyA + vxA) vxA (S34) ŝ22 = uyB(uyB + vxB) vxB (S35) σ̂11 = uyA + vxA (S36) σ̂22 = uyB + vxB (S37) σ̂12 = uyA + vxA vxA uyB + vxB vxB cx. (S38) Using (S14)–(S18), we also have µ̂1 = uxA, and (S39) µ̂2 = uxB. (S40) The LD coefficient estimates (5)–(7) can be obtained by substituting in parameter estimates in 11 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ the following equations [Gerard, 2020] ∆ = σ12/K, (S41) ρ = σ12/ √ σ11σ22, and (S42) ∆′ = ∆/∆m, where (S43) ∆m = { min{µ1µ2, (K −µ1)(K −µ2)}/K2 if ∆ < 0, and min{µ1(K −µ2), (K −µ1)µ2}/K2 if ∆ > 0. (S44) S2 Derivation of standard errors Let Mi := (XiA,X 2 iA,XiB,X 2 iB,XiAXiB,YiA,YiB) ᵀ. (S45) Then, by the central limit theorem, we have for M̄ := 1 n n∑ i=1 Mi, (S46) that √ nM̄ is asymptotically multivariate normal with some limiting covariance, say, Ω. Finite variances are guaranteed by the finite support of the genotypes. We can estimate Ω with the sample covariance matrix Ω̂ := 1 n− 1 n∑ i=1 (Mi −M̄)(Mi −M̄)ᵀ. (S47) Estimators (5)–(7) are approximately functions of M̄. Namely ∆̂ ≈ ( M̄6 + M̄2 −M̄21 M̄2 −M̄21 )( M̄7 + M̄4 −M̄23 M̄4 −M̄23 )( M̄5 −M̄1M̄3 K ) (S48) ρ̂ ≈ (√ M̄6 + M̄2 −M̄21 M̄2 −M̄21 )(√ M̄7 + M̄4 −M̄23 M̄4 −M̄23 )( M̄5 −M̄1M̄3 ) (S49) ∆̂′ ≈ ( M̄6 + M̄2 −M̄21 M̄2 −M̄21 )( M̄7 + M̄4 −M̄23 M̄4 −M̄23 )( M̄5 −M̄1M̄3 K ) /∆̂m, where (S50) ∆̂m = { min{M̄1M̄3, (K −M̄1)(K −M̄3)}/K2 if M̄5 −M̄1M̄3 < 0, and min{M̄1(K −M̄3), (K −M̄1)M̄3}/K2 if M̄5 −M̄1M̄3 > 0. (S51) These are smooth functions of M̄ (except on a space of Lebesgue measure zero), and so admit the 12 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ following gradients, calculated in Mathematica [Wolfram Research, Inc., 2020]: g∆ := d∆̂ dM̄ =   −( M̄41 M̄3−2M̄1M̄5M̄6+M̄ 2 1 M̄3(−2M̄2+M̄6)+M̄2M̄3(M̄2+M̄6))(M̄ 2 3−M̄4−M̄7) K(M̄21−M̄2) 2 (M̄23−M̄4) (M̄1M̄3−M̄5)M̄6(M̄23−M̄4−M̄7) K(M̄21−M̄2) 2 (M̄23−M̄4) −( M̄21−M̄2−M̄6)(−2M̄3M̄5M̄7+M̄1(M̄ 4 3 +M̄ 2 3 (−2M̄4+M̄7)+M̄4(M̄4+M̄7))) K(M̄21−M̄2)(M̄ 2 3−M̄4) 2 (M̄1M̄3−M̄5)(M̄21−M̄2−M̄6)M̄7 K(M̄21−M̄2)(M̄ 2 3−M̄4) 2 (−M̄21 +M̄2+M̄6)(−M̄ 2 3 +M̄4+M̄7) K(M̄21−M̄2)(M̄ 2 3−M̄4) (−M̄1M̄3+M̄5)(−M̄23 +M̄4+M̄7) K(M̄21−M̄2)(M̄ 2 3−M̄4) (−M̄1M̄3+M̄5)(−M̄21 +M̄2+M̄6) K(M̄21−M̄2)(M̄ 2 3−M̄4)   , (S52) gρ := dρ̂ dM̄ =   (M̄31 M̄5+M̄ 2 1 M̄3(−M̄2+M̄6)+M̄2M̄3(M̄2+M̄6)−M̄1M̄5(M̄2+2M̄6)) √ −M̄23 +M̄4+M̄7 (M̄21−M̄2) 2 (M̄23−M̄4) √ −M̄21 +M̄2+M̄6 (M̄1M̄3−M̄5)(M̄21−M̄2−2M̄6) √ −M̄23 +M̄4+M̄7 2(M̄21−M̄2) 2 (M̄23−M̄4) √ −M̄21 +M̄2+M̄6 − √ −M̄21 +M̄2+M̄6(M̄1M̄ 2 3 (M̄4−M̄7)−M̄1M̄4(M̄4+M̄7)+M̄3M̄5(−M̄ 2 3 +M̄4+2M̄7)) (M̄21−M̄2)(M̄ 2 3−M̄4) 2√ −M̄23 +M̄4+M̄7 (M̄1M̄3−M̄5) √ −M̄21 +M̄2+M̄6(M̄ 2 3−M̄4−2M̄7) 2(M̄21−M̄2)(M̄ 2 3−M̄4) 2√ −M̄23 +M̄4+M̄7√ −M̄21 +M̄2+M̄6 √ −M̄23 +M̄4+M̄7 (M̄21−M̄2)(M̄ 2 3−M̄4) (−M̄1M̄3+M̄5) √ −M̄23 +M̄4+M̄7 2(M̄21−M̄2)(M̄ 2 3−M̄4) √ −M̄21 +M̄2+M̄6 (−M̄1M̄3+M̄5) √ −M̄21 +M̄2+M̄6 2(M̄21−M̄2)(M̄ 2 3−M̄4) √ −M̄23 +M̄4+M̄7   , (S53) 13 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ and g∆′ := d∆̂ dM̄ = g∆/∆̂m −A, where (S54) A =   ∆̂C1(M̄1,M̄3,M̄5)/∆̂ 2 m 0 ∆̂C3(M̄1,M̄3,M̄5)/∆̂ 2 m 0 0 0 0   (S55) C1(M̄1,M̄3,M̄5) =   M̄3/K 2 if M̄5 < M̄1M̄3 and M̄1M̄3 < (K −M̄1)(K −M̄3) −(K −M̄3)/K2 if M̄5 < M̄1M̄3 and M̄1M̄3 > (K −M̄1)(K −M̄3) −M̄3/K2 if M̄5 > M̄1M̄3 and M̄1(K −M̄3) > (K −M̄1)M̄3 (K −M̄3)/K2 if M̄5 > M̄1M̄3 and M̄1(K −M̄3) < (K −M̄1)M̄3 (S56) C3(M̄1,M̄3,M̄5) =   M̄1/K 2 if M̄5 < M̄1M̄3 and M̄1M̄3 < (K −M̄1)(K −M̄3) −(K −M̄1)/K2 if M̄5 < M̄1M̄3 and M̄1M̄3 > (K −M̄1)(K −M̄3) (K −M̄1)/K2 if M̄5 > M̄1M̄3 and M̄1(K −M̄3) > (K −M̄1)M̄3 −M̄1/K2 if M̄5 > M̄1M̄3 and M̄1(K −M̄3) < (K −M̄1)M̄3 (S57) Though these gradients are rather complicated, they are not computationally intensive and may be calculated in constant time in the sample size. The asymptotic variances of ∆̂, ρ̂, and ∆̂′ are 1 n g ᵀ ∆Ω̂g∆, 1 n gᵀρΩ̂gρ, and 1 n g ᵀ ∆′ Ω̂g∆′, (S58) respectively. To accommodate missing data, we use only pairwise complete observations for the sample covariance matrix (S47). This ensures that Ω̂ is positive definite and, thus, the resulting stan- dard errors are non-negative. However, we use all non-missing observations for M̄. That is, let 14 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ ΘA, ΘB ⊆{1, 2, . . . ,n} be the index sets of non-missing values at loci A and B, respectively. Then M̄1 = 1 |ΘA| ∑ i∈ΘA XiA (S59) M̄2 = 1 |ΘA| ∑ i∈ΘA X2iA (S60) M̄3 = 1 |ΘB| ∑ i∈ΘB XiB (S61) M̄4 = 1 |ΘB| ∑ i∈ΘB X2iB (S62) M̄5 = 1 |ΘA ∩ ΘB| ∑ i∈ΘA∩ΘB XiAXiB (S63) M̄6 = 1 |ΘA| ∑ i∈ΘA YiA (S64) M̄7 = 1 |ΘB| ∑ i∈ΘB YiB (S65) M̄ ∗ = 1 |ΘA ∩ ΘB| ∑ i∈ΘA∩ΘB Mi (S66) Ω̂ = 1 |ΘA ∩ ΘB|− 1 ∑ i∈ΘA∩ΘB (Mi −M̄ ∗ )(Mi −M̄ ∗ )ᵀ (S67) The asymptotic variances of ∆̂, ρ̂, and ∆̂′ are then 1 |ΘA ∩ ΘB| g ᵀ ∆Ω̂g∆, 1 |ΘA ∩ ΘB| gᵀρΩ̂gρ, and 1 |ΘA ∩ ΘB| g ᵀ ∆′ Ω̂g∆′, (S68) respectively. S3 Adjusting the reliability ratios S3.1 Adaptive shrinkage on the reliability ratios Each SNP has an estimated reliability ratio, bj := uyj + vxj vxj , (S69) which corresponds to the multiplicative adjustment to all LD estimates that include that SNP (see (5)). These reliability ratios might have high variance due to (i) lower sequencing depth or (ii) containing fewer individuals with non-missing data. Thus, some reliability ratios may be noisy. Hierarchical shrinkage is a statistical technique that allows high-variance observations to borrow strength from low-variance observations and thus improve estimation performance. Adap- tive shrinkage (ash) [Stephens, 2016] is a recently proposed general-purpose hierarchical shrinkage 15 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ technique that we can use to model the distribution of reliability ratios flexibly, only constraining them to be unimodal. In this section, we will use ash to improve our reliability ratio estimates. We will now describe the procedure for applying ash to shrink the reliability ratios. Our strategy will be to derive the standard errors for the log of the reliability ratios (S69) and apply ash on the log-scale using these standard errors. To begin, let Xij be the posterior mean for individual i at SNP j. Let Yij be the posterior variance for individual i at SNP j. Finally, let Mij = (Xij,X 2 ij,Yij), (S70) M̄j = 1 n n∑ i=1 Mij, so (S71) M̄j1 = 1 n n∑ i=1 Xij, (S72) M̄j2 = 1 n n∑ i=1 X2ij, and (S73) M̄j3 = 1 n n∑ i=1 Yij. (S74) Then the log of the reliability ratio for SNP j is Lj := log ( M̄j3 + M̄j2 −M̄2j1 M̄j2 −M̄2j1 ) (S75) = log(M̄j3 + M̄j2 −M̄2j1) − log(M̄j2 −M̄ 2 j1). (S76) Let the sample covariance be Ω̂j := 1 n− 1 n∑ i=1 (Mij −M̄j)(Mij −M̄j)ᵀ. (S77) Then we have by the central limit theorem that √ nM̄j is asymptotically multivariate normal, and we can use Ω̂j as the estimate of the covariance matrix. The gradients for (S75) are gj1 := dLj dM̄j1 = −2M̄j1 M̄j3 + M̄j2 −M̄2j1 + 2M̄j1 M̄j2 −M̄2j1 (S78) gj2 := dLj dM̄j2 = 1 M̄j3 + M̄j2 −M̄2j1 − 1 M̄j2 −M̄2j1 (S79) gj3 := dLj dM̄j3 = 1 M̄j3 + M̄j2 −M̄2j1 (S80) Then, with gj := (gj1,gj2,gj3) ᵀ, the variance for Lj is ŝ2j := 1 n g ᵀ j Ω̂jgj. (S81) 16 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ We apply ash to (L1, ŝ1), . . . , (Lm, ŝm) to obtain shrunken log reliability ratios L̂1, . . . , L̂m. Because ash’s grid-based scheme for estimating the mode is not the most computationally efficient, we used the half-sample mode estimator of Robertson and Cryer [1974] prior to running ash. This procedure seems to result in improved performance for SNPs with unusually variable reli- ability ratios (Figure S1). S3.2 Thresholding the reliability ratios If a researcher accidentally provides a monoallelic SNP, its reliability ratio could explode due to having a denominator close to zero in (S69). For example, the right panel of Figure S2 contains a monoallelic SNP (PotVar0080327) whose reliability ratio estimate (S69) is 100.92. This can provide unstable estimates of LD as some SNPs will, due to sampling variability, have correlations with these monoallelic SNPs on the order of 0.01. For example, the sample correlation between posterior means of PotVar0080327 and PotVar0078678 (left facet of Figure S2) -0.0098. But due to the extreme reliability ratio of PotVar0080327, the genotype-error adjusted correlation estimate is -1. This is, of course, unsettling. So by default, our software will take all reliability ratio estimates (S69) above a user-provided value (default of 10) and assign these to have reliability ratios of the median reliability ratio in the dataset. S4 Genome-wide Association Studies In this section, we demonstrate that the techniques used in Section S1, when applied to simple linear regression with an additive effects model [Rosyara et al., 2016], result in the standard ordinary least squares estimate when using the posterior mean as a covariate. This indicates that for genome-wide association studies, using the posterior mean is appropriate in a linear regression context when using an additive model for gene action. Let Gi be the genotype for individual i at a locus. Let Zi be the data that lead to the genotyping for individual i at the same locus. Let Wi be some quantitative trait of interest for individual i. Then we let Wi|Gi ∼ N(β0 + β1Gi,σ2) (S82) Zi|Gi ∼ N(Gi,s2) (S83) Gi ∼ N(µ,τ2). (S84) We suppose the user is only provided the posterior means and variances of each Gi|Zi. Let Xi = E[Gi|Zi] and Yi = var(Gi|Zi). From elementary methods, we have Zi ∼ N(µ,s2 + τ2) (S85) Gi|Zi ∼ N [( 1 τ2 + 1 s2 )−1 ( 1 τ2 µ + 1 s2 Zi ) , ( 1 τ2 + 1 s2 )−1] . (S86) 17 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ Let uw = 1 n n∑ i=1 Wi (S87) ux = 1 n n∑ i=1 Xi (S88) cxw = 1 n− 1 n∑ i=1 (Wi −uw)(Xi −ux) (S89) vx = 1 n− 1 n∑ i=1 (Xi −ux)2 (S90) vw = 1 n− 1 n∑ i=1 (Wi −uw)2. (S91) We have that cwx ≈ cov(Wi,Xi) (S92) ≈ cov ( Wi, ( 1 τ2 + 1 s2 )−1 ( 1 τ2 µ + 1 s2 Zi )) (S93) = ( 1 τ2 + 1 s2 )−1 1 s2 cov(Wi,Zi) (S94) = ( 1 τ2 + 1 s2 )−1 1 s2 β1 var(Gi) (S95) = ( 1 τ2 + 1 s2 )−1 τ2 s2 β1. (S96) We also have from (S19)–(S22) that vx ≈ ( 1 τ2 + 1 s2 )−1 τ2 s2 . (S97) Using method of moments with equations (S96) and (S97), we have the following estimator for β1 β̂1 = cwx/vx (S98) = cwx√ vxvw √ vw√ vx . (S99) Equation (S99) is the sample correlation between the Wi’s and the Xi’s (cwx/ √ vxvw) multiplied by the ratio of the sample standard deviations of the Wi’s and the Xi’s ( √ vw/ √ vx). This is the well-known formula for the ordinary least squares estimate of β1 from a regression of Wi on Xi. 18 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ S5 Supplementary figures 0.0 0.1 0.2 0.3 0.0 0.2 0.4 0.6 Log of Reliability Ratio E st im a te d S ta n d a rd E rr o r (A) 0 25 50 75 100 125 0 25 50 75 100 125 Alternative Counts R e fe re n ce C o u n ts (B) 0 40 80 120 0 40 80 120 Alternative Counts R e fe re n ce C o u n ts (C) 1.00 1.25 1.50 1.75 2.00 1.00 1.25 1.50 1.75 2.00 Raw Reliability Ratio S h ru n ke n R e lia b ili ty R a tio (D) Figure S1: (A) The log of the reliability ratios (x-axis) versus their estimated standard errors (y-axis). The two highlighted points do not seem to fit the trend. When we plot the read-counts for these highlighted points ((B) and (C)), we notice that these two SNPs are almost monoallelic, providing doubts on their unusually large reliability ratios. We plot the shrunken reliability ratios (y-axis) against their original values (x-axis) in (D), noting that the problem SNPs (color) have their reliability ratios highly adjusted. 19 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ PotVar0078678 PotVar0080327 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Alternative Counts R e fe re n ce C o u n ts Figure S2: Plots of read-counts of two SNPs (facets) from Uitdewilligen et al. [2013]. Alternative counts lie on the x-axis and reference counts lie on the y-axis. The right SNP is monoallelic and because of this the estimated correlation between the two SNPs using raw reliability ratios is -1, even though the sample correlation between posterior means is only -0.0098. 20 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ References A. Brown. Sample sizes required to detect linkage disequilibrium between two or three loci. Theoreti- cal Population Biology, 8(2):184 – 201, 1975. ISSN 0040-5809. doi: 10.1016/0040-5809(75)90031-3. R. L. Carter and W. A. Fuller. Instrumental variable estimation of the simple errors-in- variables model. Journal of the American Statistical Association, 75(371):687–692, 1980. doi: 10.1080/01621459.1980.10477534. J. S. Degracie and W. A. Fuller. Estimation of the slope and analysis of covariance when the concomitant variable is measured with error. Journal of the American Statistical Association, 67 (340):930–937, 1972. doi: 10.1080/01621459.1972.10481321. K. K. Dey and M. Stephens. CorShrink: Empirical Bayes shrinkage estimation of correlations, with applications. bioRxiv, 2018. doi: 10.1101/368316. E. A. Fox, A. E. Wright, M. Fumagalli, and F. G. Vieira. ngsLD: evaluating linkage disequilibrium using genotype likelihoods. Bioinformatics, 35(19):3855–3856, 03 2019. ISSN 1367-4803. doi: 10.1093/bioinformatics/btz200. W. A. Fuller. Measurement error models. John Wiley & Sons, 2009. D. Gerard. Pairwise linkage disequilibrium estimation for polyploids. bioRxiv, 2020. doi: 10.1101/2020.08.03.234476. D. Gerard and L. F. V. Ferrão. Priors for genotyping polyploids. Bioinformatics, 36(6):1795–1800, 11 2019. ISSN 1367-4803. doi: 10.1093/bioinformatics/btz852. bioRxiv: 751784. D. Gerard, L. F. V. Ferrão, A. A. F. Garcia, and M. Stephens. Genotyping polyploids from messy sequencing data. Genetics, 210(3):789–807, 2018. ISSN 0016-6731. doi: 10.1534/ge- netics.118.301468. W. Hill and A. Robertson. Linkage disequilibrium in finite populations. Theoretical and applied genetics, 38(6):226–231, 1968. doi: 10.1007/BF01245622. T. C. Koopmans. Linear regression analysis of economic time series, volume 20. De erven F. Bohn nv, 1937. R. Lewontin. The interaction of selection and linkage. i. general considerations; heterotic models. Genetics, 49(1):49, 1964. URL https://www.genetics.org/content/49/1/49. R. C. Lewontin and K.-i. Kojima. The evolutionary dynamics of complex polymorphisms. Evolution, 14(4):458–472, 1960. doi: 10.1111/j.1558-5646.1960.tb03113.x. T. Maruki and M. Lynch. Genome-wide estimation of linkage disequilibrium from population- level high-throughput sequencing data. Genetics, 197(4):1303–1313, 2014. ISSN 0016-6731. doi: 10.1534/genetics.114.165514. M. Pal. Consistent moment estimators of regression coefficients in the presence of errors in vari- ables. Journal of Econometrics, 14(3):349 – 364, 1980. ISSN 0304-4076. doi: 10.1016/0304- 4076(80)90032-9. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2020. URL https://www.R-project.org/. T. Robertson and J. D. Cryer. An iterative procedure for estimating the mode. Journal of the Amer- ican Statistical Association, 69(348):1012–1016, 1974. doi: 10.1080/01621459.1974.10480246. U. R. Rosyara, W. S. De Jong, D. S. Douches, and J. B. Endelman. Software for genome-wide association studies in autopolyploids and its application to potato. The Plant Genome, 9(2), 2016. doi: 10.3835/plantgenome2015.08.0037. M. Slatkin. Linkage disequilibrium-understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics, 9(6):477, 2008. doi: 10.1038/nrg2361. 21 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint http://doi.org/10.1016/0040-5809(75)90031-3 http://doi.org/10.1080/01621459.1980.10477534 http://doi.org/10.1080/01621459.1980.10477534 http://doi.org/10.1080/01621459.1972.10481321 http://doi.org/10.1101/368316 http://doi.org/10.1093/bioinformatics/btz200 http://doi.org/10.1093/bioinformatics/btz200 http://doi.org/10.1101/2020.08.03.234476 http://doi.org/10.1101/2020.08.03.234476 http://doi.org/10.1093/bioinformatics/btz852 http://doi.org/10.1534/genetics.118.301468 http://doi.org/10.1534/genetics.118.301468 http://doi.org/10.1007/BF01245622 https://www.genetics.org/content/49/1/49 http://doi.org/10.1111/j.1558-5646.1960.tb03113.x http://doi.org/10.1534/genetics.114.165514 http://doi.org/10.1534/genetics.114.165514 http://doi.org/10.1016/0304-4076(80)90032-9 http://doi.org/10.1016/0304-4076(80)90032-9 https://www.R-project.org/ http://doi.org/10.1080/01621459.1974.10480246 http://doi.org/10.3835/plantgenome2015.08.0037 http://doi.org/10.1038/nrg2361 https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ C. Spearman. The proof and measurement of association between two things. The American journal of psychology, 15(1):72–101, 1904. doi: 10.2307/1422689. M. Stephens. False discovery rates: a new deal. Biostatistics, 18(2):275–294, 10 2016. ISSN 1465- 4644. doi: 10.1093/biostatistics/kxw041. J. A. Sved and W. G. Hill. One hundred years of linkage disequilibrium. Genetics, 209(3):629–636, 2018. ISSN 0016-6731. doi: 10.1534/genetics.118.300642. J. G. A. M. L. Uitdewilligen, A.-M. A. Wolters, B. B. D’hoop, T. J. A. Borm, R. G. F. Visser, and H. J. van Eck. A next-generation sequencing method for genotyping-by-sequencing of highly heterozygous autotetraploid potato. PLOS ONE, 8(5):1–14, 05 2013. doi: 10.1371/jour- nal.pone.0062355. X. Wen and M. Stephens. Using linear predictors to impute allele frequencies from summary or pooled genotype data. The annals of applied statistics, 4(3):1158–1182, 2010. ISSN 1932-6157. doi: 10.1214/10-aoas338. Y. C. J. Wientjes, R. F. Veerkamp, and M. P. L. Calus. The effect of linkage disequilibrium and family relationships on the reliability of genomic prediction. Genetics, 193(2):621–631, 2013. ISSN 0016-6731. doi: 10.1534/genetics.112.146290. Wolfram Research, Inc. Mathematica, Version 12.2, 2020. URL https://www.wolfram.com/ mathematica. Champaign, IL. X. Zhu and M. Stephens. Large-scale genome-wide enrichment analyses identify new trait-associated genes and pathways across 31 human phenotypes. Nature communications, 9(1):1–14, 2018. doi: 10.1038/s41467-018-06805-x. 22 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint http://doi.org/10.2307/1422689 http://doi.org/10.1093/biostatistics/kxw041 http://doi.org/10.1534/genetics.118.300642 http://doi.org/10.1371/journal.pone.0062355 http://doi.org/10.1371/journal.pone.0062355 http://doi.org/10.1214/10-aoas338 http://doi.org/10.1534/genetics.112.146290 https://www.wolfram.com/mathematica https://www.wolfram.com/mathematica http://doi.org/10.1038/s41467-018-06805-x http://doi.org/10.1038/s41467-018-06805-x https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ Introduction Methods Results Simulations LD estimates for Solanum tuberosum Discussion Figures Derivation of LD estimators Derivation of standard errors Adjusting the reliability ratios Adaptive shrinkage on the reliability ratios Thresholding the reliability ratios Genome-wide Association Studies Supplementary figures 10_1101-2021_02_08_430275 ---- Next-generation sequencing-based bulked segregant analysis without sequencing the parental genomes DR AF T Next-generation sequencing-based bulked segregant analysis without sequencing the parental genomes Jianbo Zhanga,� and Dilip R. Pantheea,� a Department of Horticultural Science, North Carolina State University, Mountain Horticultural Crops Research and Extension Center, 455 Research Drive, Mills River, NC 28759, USA This manuscript was compiled on November 24, 2020 The genomic region(s) that controls a trait of interest can be rapidly identified using BSA-Seq, a technology in which next-generation se- quencing (NGS) is applied to bulked segregant analysis (BSA). We recently developed the significant structural variant method for BSA- Seq data analysis that exhibits higher detection power than standard BSA-Seq analysis methods. Our original algorithm was developed to analyze BSA-Seq data in which genome sequences of one par- ent served as the reference sequences in genotype calling, and thus required the availability of high-quality assembled parental genome sequences. Here we modified the original script to allow for the ef- fective detection of the genomic region-trait associations using only bulk genome sequences. We analyzed a public BSA-Seq dataset us- ing our modified method and the standard allele frequency and G- statistic methods with and without the aid of the parental genome sequences. Our results demonstrate that the genomic region(s) as- sociated with the trait of interest could be reliably identified only via the significant structural variant method without using the parental genome sequences. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 BSA-Seq | PyBSASeq | QTL | genomic region-trait association Bulked segregant analysis (BSA) was developed for the1 quick identification of genetic markers associated with a2 trait of interest (1, 2). For a particular trait, two groups of3 individuals with contrasting phenotypes are selected from a4 segregating population. Equal amounts of DNA are pooled5 from each individual within a group. The pooled DNA samples6 are then subjected to analysis, such as restriction fragment7 length polymorphism (RFLP) or random amplification of poly-8 morphic DNA (RAPD). Fragments unique to either group are9 potential genetic markers that may link to the gene(s) that10 control phenotypic expression for the trait of interest. Can-11 didate markers are further tested against the population to12 verify the marker-trait associations. With the recent dramatic13 reductions in cost, next-generation sequencing (NGS) has been14 applied to more and more BSA studies (3–7). This new tech-15 nology is referred to as BSA-Seq. In BSA-Seq, pooled DNA16 samples are not subjected to RFLP/RAPD analysis, but are17 directly sequenced instead. Genome-wide structural variants18 between bulks, such as single nucleotide polymorphisms (SNP)19 and small insertions/deletions (InDel), are identified based20 on the sequencing data. Genomic regions linked to the trait-21 controlling gene(s) are then identified based on the enrichment22 of the SNP/InDel alleles in those regions in each bulk. The23 time-consuming and labor-intensive marker development and24 genetic mapping steps are eliminated in the BSA-Seq method.25 Moreover, SNPs/InDels can be detected genome-wide via NGS,26 which allows for the reliable identification of trait-associated27 genomic regions across the entire genome.28 For each SNP/InDel in a BSA-Seq dataset, the base (or29 oligo in the case of an InDel) that is the same as in the reference30 genome is termed the reference base (REF), and the other31 base is termed the alternative base (ALT). Because each bulk 32 contains many individuals, the vast majority of SNP loci in 33 the dataset have both REF and ALT bases. For each SNP, 34 the number of reads of its REF/ALT alleles is termed allele 35 depth (AD). Because of the phenotypic selection via bulking, 36 for trait-associated SNPs, the ALT allele should be enriched 37 in one bulk while the REF allele should be enriched in the 38 other. However, for SNPs not associated with the trait, both 39 ALT and REF alleles would be randomly segregated in both 40 bulks, and neither enriched in either bulk. Hence these four 41 AD values can be used to assess how likely a SNP/InDel is 42 associated with the trait. 43 We have previously developed the significant structural 44 variant method for BSA-Seq data analysis (8). In this method, 45 a SNP/InDel is assessed with Fisher’s exact test using the AD 46 values of both bulks. A SNP/InDel is considered significant 47 if the P-value of Fisher’s exact test is lower than a specific 48 cut-off value, e.g., 0.01. A genomic region normally contains 49 many SNPs/InDels. The ratio of the significant structural 50 variants to the total structural variants is used to judge if 51 this genomic region is associated with the trait of interest. 52 We tested this method using the BSA-Seq data of a rice cold- 53 tolerance study (9). One of the parents in this study was rice 54 cultivar Oryza sativa ssp. japonica cv. Nipponbare. Its high- 55 quality assembled genome sequences were used as the reference 56 sequences for SNP/InDel calling as well, which makes the 57 genotype calling and SNP/InDel filtering very straightforward: 58 any locus in any bulk that is different from the REF allele is 59 a valid SNP/InDel (8). 60 Only high-quality assembled genome sequences can serve as 61 the reference sequences in genotype calling, an essential step 62 in BSA-Seq data analysis. For most species, however, such 63 sequences are available for only a single or limited number of 64 lines. If lines without high-quality assembled genome sequences 65 are used as the parents in BSA-Seq studies, the parental 66 genomes are often sequenced via NGS for the determination 67 Significance Statement BSA-Seq can be utilized to rapidly identify structural variant- trait associations, and our modified significant structural variant method allows the detection of such associations without se- quencing the parental genomes, leading to further lower the sequencing cost and making BSA-Seq more accessible to the research community and more applicable to the species with a large genome. Author contributions: JZ and DRP conceived the study. JZ developed the algorithm, wrote the Python code, analyzed the data, and wrote and edited the manuscript. DRP edited the manuscript and supervised the project. The authors declare no conflict of interest. �To whom correspondence should be addressed. E-mail: dilip_panthee@ncsu.edu or zhang.jianbo@gmail.com https://doi.org/10.1101/654137 bioRχiv | November 24, 2020 | 1 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint https://doi.org/10.1101/654137 https://doi.org/10.1101/2021.02.08.430275 http://creativecommons.org/licenses/by-nc-nd/4.0/ DR AF T of the parental origin of SNP alleles and the identification68 of parental heterozygous SNPs. Modification of our original69 method to allow the analysis of BSA-Seq data in the absence of70 assembled or NGS-generated parental genome sequences would71 provide greater flexibility and significantly reduce sequencing72 costs. Hence, we modified our original script to allow for73 the identification of the false-positive SNPs/InDels and part74 of the heterozygous loci in the parents without the aid of75 the parental genome sequences. Using the modified script,76 along with the scripts for the standard G-statistic and allele77 frequency methods (10, 11), we analyzed a public BSA-Seq78 dataset using either the genome sequences of both the parents79 and the bulks, or the bulk genome sequences alone. The80 results revealed that reliable detection of genomic region-trait81 associations can be achieved only via our modified script when82 using only the bulk genome sequences.83 Materials and Methods84 The sequencing data used in this study were generated by Lahari et85 al. (12). Using the allele frequency method, the authors identified86 a single locus for root-knot nematode resistance in rice. In that87 study, the parents of the F2 population were LD24 and VialoneNano,88 yielding an F2 population size of 178 (plants), and both the resistant89 bulk and the susceptible bulk contained 23 plants each. The DNA90 samples of both the parents and the bulks were sequenced using91 Illumina MiSeq Sequencing System and MiSeq v3 chemistry.92 The BSA-Seq sequencing data (ERR2696318: parent LD24;93 ERR2696319: parent VialoneNano; ERR2696321: the resistant bulk94 from the F2 population; ERR2696322: the susceptible bulk from95 the F2 population) were downloaded from the European Nucleotide96 Archive (ENA) using the Linux program wget, and the rice reference97 sequence (Release 47) was downloaded from https://plants.ensembl.98 org/Oryza_sativa/Info/Index. Sequencing data preprocessing and SNP99 calling were performed as described previously (8). When analyzing100 the BSA-Seq data with the genome sequences of both the parents101 and the bulks, bulk/parent SNP calling was performed separately.102 The common SNPs of the two SNP datasets were used for the103 downstream analysis.104 The SNP dataset generated via SNP calling was processed with105 our Python script to identify significant SNP-trait associations.106 A single script containing all the three methods is available on107 the website https://github.com/dblhlx/PyBSASeq. The workflow of the108 scripts is as follows:109 1. Read the .tsv input file generated via SNP calling into a Pandas110 DataFrame.111 2. Perform SNP filtering on the Pandas DataFrame.112 3. Identify the significant SNPs (sSNPs) via Fisher’s exact test113 (the significant structural variant method), calculate the ΔAF114 (allele frequency difference between bulks) values (the allele115 frequency method), or calculate the G-statistic values (the116 G-statistic method) using the four AD values (ADref1 and117 ADalt1 of bulk 1 and ADref2 and ADalt2 of bulk 2) of each118 SNP in the filtered Pandas DataFrame.119 4. Use the sliding window algorithm to plot the sSNP/totalSNP120 ratios, the ΔAF values, or the G-statistic values against their121 genomic positions.122 5. Estimate the threshold of the sSNP/totalSNP ratio, the ΔAF,123 or the G-statistic via simulation. The thresholds are used to124 identify the significant peaks/valleys in the plots generated in125 step 4.126 Identification of the sSNPs, calculation of the sSNP/totalSNP127 ratios, the G-statistic values, or the ΔAF values, and estimation128 of their thresholds were carried out as described previously (8).129 The 99.5th percentile of 10 000 simulated sSNP/totalSNP ratios130 or G-statistic values was used as the threshold for the significant131 structural variant method or the G-statistic method, and the 99%132 confidence interval of 10 000 simulated ΔAF values was used as133 the threshold for the allele frequency method. For all methods,134 the size of the sliding windows is 2 Mb and the incremental step is135 10 kb. In our previous work, a parent was the japonica rice cultivar136 nipponbare, and its genome sequences were used as the reference 137 sequences for SNP/InDel calling. In the current dataset, the parents 138 were LD24 and VialoneNano; many false-positive SNPs/InDels and 139 heterozygous loci in the parents would be included in the dataset if 140 analyzing the BSA-Seq data using the original script. Hence, SNP 141 filtering is carried out a little differently from previously described 142 (8), and its details are below (see Table S1 for examples): 143 • Unmapped SNPs or SNPs mapped to the mitochondrial or 144 chloroplast genome 145 • SNPs with an ‘NA’ value in any column of the DataFrame 146 • SNPs with zero REF read and a single ALT allele in both 147 bulks/parents 148 • SNPs with three or more ALT alleles in any bulk/parent 149 • SNPs with two ALT alleles and its REF read is not zero in 150 any bulk/parent 151 • SNPs in which the bulk/parent genotypes do not agree with 152 the REF/ALT bases 153 • SNPs in which the bulk/parent genotypes are not consistent 154 with the AD values 155 • SNPs with a genotype quality (GQ) score less than 20 in any 156 bulk 157 • SNPs with very high reads 158 • SNPs heterozygous in any parent when parental genome se- 159 quences are available 160 Additionally, for SNPs with two ALT alleles and zero REF read 161 in both bulks/parents, the REF allele is replaced with the first allele 162 in the ‘ALT’ field, its ALT allele is replaced with the second allele 163 in the original ‘ALT’ field. The REF read, and a comma after it, 164 are removed from both the allele depth (AD) fields (one for each 165 bulk/parent). This step is carried out before checking the genotype 166 agreement between bulks and the REF/ALT fields. When parental 167 genome sequences are involved, the common SNP set is identified 168 before filtering out the SNPs with a low GQ score in the parental 169 SNP dataset. 170 The tightly linked SNP alleles from the same parent tend to 171 segregate together and should have a similar extent of allele enrich- 172 ment, and thus similar AD values. In a SNP dataset, the genotypes 173 of each bulk/parent are represented as ‘GTref/GTalt’ when a SNP 174 contains both the REF base and the ALT base in the genotype 175 (GT) field, and the AD values in each bulk/parent is represented as 176 ‘ADref,ADalt’. The genotype and the AD value of the REF allele are 177 always placed first in both fields. For a SNP locus in the .tsv input 178 file, the allele having the same genotype as that in the reference 179 genome is defined as the REF allele. However, it is highly unlikely 180 that all of the SNP alleles in a parent are the same as those in 181 the reference genome, except in instances where reference genome 182 sequences used in SNP calling are from one of the parents as in 183 the case of the cold-tolerance study as mentioned above (9). It is 184 necessary to place the genotypes and AD values of all SNP alleles 185 from one parent (e.g., LD24) in the REF position, and those from 186 the other parent (e.g., VialoneNano) to the ALT position in the 187 GT and AD fields to make the bulk dataset consistent. Thus, for 188 a particular SNP, if the REF base in the .tsv file is different from 189 the genotype of LD24 (either parent will work), its GT/AD values 190 would be swapped, e.g., ‘G/A’ to ‘A/G’ and ‘19,9’ to ‘9,19’. AD/GT 191 swapping is performed following SNP filtering and is performed only 192 when the parental genome sequences are used to aid BSA-Seq data 193 analysis. Equation 1 is used for ΔAF calculation. AD swapping 194 ensures that adjacent SNPs have similar ΔAF values. 195 ∆AF = ADalt2 ADref 2 + ADalt2 − ADalt1 ADref 1 + ADalt1 [1] 196 197 Results 198 The original sequence reads were 3.9G, 3.8G, 3.4G, and 3.5G; 199 they became 3.8G, 3.6G, 3.3G, and 3.4G after quality con- 200 trol, respectively, in ERR2696318 (parent LD24), ERR2696319 201 (parent VialoneNano), ERR2696321 (the resistant bulk), and 202 ERR2696322 (the susceptible bulk), which correspond to 8.8×, 203 8.5×, 7.6×, and 7.9× coverage, respectively (12). The prepro- 204 cessed sequences were used for SNP calling to generate a SNP 205 2 | https://doi.org/10.1101/654137 Zhang et al. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint https://plants.ensembl.org/Oryza_sativa/Info/Index https://plants.ensembl.org/Oryza_sativa/Info/Index https://plants.ensembl.org/Oryza_sativa/Info/Index https://github.com/dblhlx/PyBSASeq https://doi.org/10.1101/654137 https://doi.org/10.1101/2021.02.08.430275 http://creativecommons.org/licenses/by-nc-nd/4.0/ DR AF T dataset, which was analyzed using the modified significant206 structural variant method, the G-statistic method, and the207 allele frequency method with or without the aid of the parental208 genome sequences.209 BSA-Seq data analysis using the genome sequences of210 both the parents and the bulks. The SNP calling-generated211 parent/bulk SNP dataset was processed with the Python212 script PyBSASeq_WP.py. SNP filtering was performed as213 described in the Materials and Methods section. The parental214 SNP dataset was processed first, and the SNPs heterozygous215 in any parent were eliminated because all algorithms assume216 all SNP loci are homozygous in the parental lines. Threshold217 estimation is based on this assumption. Although most rice218 breeding lines should be homozygous in most loci, more219 than 7% heterozygous SNP loci (2 011 062 homozygous and220 153 000 heterozygous) were identified in the parental SNP221 dataset. However, the GATK’s variant calling tools are222 designed to be very lenient in order to achieve a high degree223 of sensitivity (https://gatk.broadinstitute.org/hc/en-us/articles/224 360035535932-Germline-short-variant-discovery-SNPs-Indels-),225 we cannot rule out the possibility that some of the heterozy-226 gous loci were caused by sequencing artifacts. The bulk SNP227 dataset was processed second. The SNPs with the same228 chromosome ID and the same genomic coordinate in both229 datasets were considered common SNPs. Common SNPs in230 the bulk dataset were used to detect SNP-trait associations231 for all three methods.232 Table 1. Chromosomal distribution of SNPs - using the genome sequences of both the parents and the bulks Chromosome sSNPs TotalSNPs sSNP/totalSNP 1 1170 139 910 0.0084 2 310 125 129 0.0025 3 459 102 331 0.0045 4 330 89 577 0.0037 5 372 84 706 0.0044 6 1581 83 605 0.0189 7 378 94 371 0.0040 8 258 80 617 0.0032 9 1292 67 157 0.0192 10 363 56 681 0.0064 11 2765 88 287 0.0313 12 241 87 145 0.0028 Genome-wide 9519 1 099 516 0.0087 The significant structural variant method: Each SNP in the233 dataset was tested via Fisher’s exact test using its four AD234 values, and SNPs with P-values less than 0.01 were defined235 as sSNPs. The chromosomal distributions of the sSNPs and236 the total SNPs are summarized in Table 1. Using the sliding237 window algorithm, the genomic distribution of the sSNPs, the238 total SNPs, and the sSNP/totalSNP ratios of sliding windows239 were plotted against their genomic position (Figure 1a and240 Figure 1b). A genome-wide threshold was estimated as 0.0538241 via simulation as described previously (8). Two peaks above242 the threshold were identified: a minor one on chromosome243 9 and a major one on chromosome 11. The position of the244 peak on chromosome 9 was at 1.11 Mb, the sliding window245 contained 230 sSNPs and 3738 total SNPs, corresponding246 to an sSNP/totalSNP ratio of 0.0615; the position of the247 peak on chromosome 11 was at 26.44 Mb, the sliding window248 contained 675 sSNPs and 1139 total SNPs, corresponding to an249 sSNP/totalSNP ratio of 0.5926. The sliding window-specific 250 threshold was estimated for each peak via simulation, and 251 the values were 0.0551 and 0.0623, respectively, indicating 252 both peaks were significant. Both values are higher than the 253 genome-wide threshold, probably due to the lower amounts of 254 total SNPs in these sliding windows. The average SNPs per 255 sliding window was 5893. 256 The G-statistic method: The G-statistic value of each SNP 257 in the dataset was calculated, and its threshold was estimated 258 via simulation as described previously (8). Using the sliding 259 window algorithm, the G-statistic value of each sliding win- 260 dow, the average G-statistic values of all SNPs in that sliding 261 window, was plotted against its genomic position (Figure 1c), 262 and the curve pattern was very similar to that in Figure 1b. A 263 significant peak was identified on chromosome 11; its position 264 was at 26.49 Mb, its G-statistic value was 12.8120, well above 265 the threshold 9.0224 (99.5th percentile). 266 The allele frequency method: The ΔAF value of each SNP in 267 the dataset was calculated, and the ΔAF threshold of the SNP 268 was estimated via simulation as described previously (8). Using 269 the sliding window algorithm, the ΔAF value of each sliding 270 window, the average ΔAF values of all SNPs in that sliding 271 window, was plotted against its genomic position (Figure 1d). 272 A significant peak on chromosome 11 was identified, the peak 273 position was located at 26.45 Mb, its ΔAF value was 0.7173, 274 and the 99% confidence interval was −0.6508 to 0.6497. 275 BSA-Seq data analysis using only the bulk genome se- 276 quences. The SNP calling-generated bulk SNP dataset was 277 processed with the Python script PyBSASeq.py. All the meth- 278 ods and parameters were the same as above; the only difference 279 was that the parental SNP dataset was not used. 280 The significant structural variant method: The chromoso- 281 mal distribution of the sSNPs and total SNPs are summarized 282 in Table 2. The total number of SNPs was 1 346 185 here, 283 much higher than the above, which was 1 099 516. The ge- 284 nomic distribution of the sSNPs, the total SNPs, and the 285 sSNP/totalSNP ratios of the sliding windows are presented 286 in Figure 2a and Figure 2b. The patterns of the curves were 287 very similar to those in Figure 1a and Figure 1b. One of the 288 obvious differences was that sSNP/totalSNP ratios of the slid- 289 ing windows were much lower than those in Figure 1b, leading 290 to missing the minor locus on chromosome 9. Only the peak 291 on chromosome 11 was significant; it was located at 26.96 Mb, 292 a 520 kb shift compared to Figure 1b. The sliding window 293 contained 1122 sSNPs and 2945 total SNPs, corresponding to 294 a 0.3810 sSNP/totalSNP ratio, well above the genome-wide 295 threshold (0.0535) and the sliding window specific threshold 296 (0.0601). The average SNPs per sliding window was 7215. 297 The G-statistic method: The patterns of the G-statistic 298 value plot (Figure 2c) were very similar to that in Figure 1c, 299 but the G-statistic values were significantly lower than those 300 in Figure 1c, and the threshold did not change much. Only 301 a single sliding window was above the threshold (8.8953), its 302 position was at 29.96 Mb, and its G-statistic value was 8.9060. 303 The allele frequency method: Without the aid of the 304 parental genome sequences, the pattern of the ΔAF curve 305 of chromosome 11 (Figure 2d), especially the genomic region 306 associated with the trait, was drastically different from that in 307 Figure 1d. Differences in the curve patterns were observed in 308 other chromosomes as well, but they were relatively minor. All 309 ΔAF values were within the 99% confidence interval, although 310 Zhang et al. bioRχiv | November 24, 2020 | 3 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint https://github.com/dblhlx/PyBSASeq/blob/master/PyBSASeq_WP.py https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels- https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels- https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels- https://github.com/dblhlx/PyBSASeq/blob/master/PyBSASeq.py https://doi.org/10.1101/2021.02.08.430275 http://creativecommons.org/licenses/by-nc-nd/4.0/ DR AF T 0 2000 4000 6000 8000 10000 N um be r o f S N P s Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Chr9 Chr10 Chr11 Chr12 0.0 0.2 0.4 0.6 sS N P /to ta lS N P 0 5 10 G -s ta tis tic 0 1 2 3 4 0.5 0.0 0.5 A F 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 Genomic position (×10 Mb) A B C D Figure 1. BSA-Seq data analysis using the genome sequences of both the parents and the bulks. The red lines/curves are the thresholds. (A) Genomic distributions of sSNPs (blue) and totalSNPs (black). (B) Genomic distributions of sSNP/totalSNP ratios. (C) Genomic distributions of G-statistic values. (D) Genomic distributions of ΔAF values. Table 2. Chromosomal distribution of SNPs - using only the bulk genome sequences Chromosome sSNPs TotalSNPs sSNP/totalSNP 1 1335 163 260 0.0082 2 391 146 877 0.0027 3 578 120 319 0.0048 4 442 110 952 0.0040 5 481 103 362 0.0047 6 1724 103 416 0.0167 7 459 114 564 0.0040 8 373 103 385 0.0036 9 1410 82 744 0.0170 10 572 78 206 0.0073 11 3120 112 719 0.0277 12 281 106 381 0.0026 Genome-wide 11 166 1 346 185 0.0083 AD swapping was performed on only 67 396 SNPs, 6.1% of 311 total SNPs. 312 Discussion 313 We tested how parental genome sequences affected the detec- 314 tion of SNP-trait associations via BSA-Seq using a dataset 315 of the rice root-knot nematode resistance. Using the genome 316 sequences of both the parents and bulks, a major locus on 317 chromosome 11 and a minor locus on chromosome 9 were de- 318 tected via the significant structural variant method. However, 319 only the major locus was detected via the G-statistic method 320 and the allele frequency method. The positions of the peaks 321 detected via different methods were not the same, but they 322 were very close to each other. Using only the bulk genome 323 sequences, the major locus can be detected via only the signif- 324 icant structural variant and G-statistic methods. The allele 325 frequency method uses the ΔAF value of a SNP to measure 326 allele (REF/ALT) enrichment in the SNP locus, and the G- 327 statistic method uses the G-statistic value of a SNP to measure 328 the allele enrichment; ΔAF and G-statistic are parameters 329 at the SNP level, therefore, both methods use a SNP level 330 parameter to identify significant sliding windows for the detec- 331 tion of the genomic region-trait associations. The significant 332 structural variant method, however, uses the sSNP/totalSNP 333 ratio, a parameter at the sliding window level, to measure the 334 sSNP enrichment in a sliding window for the identification of 335 the trait-associated genomic regions. A SNP normally has less 336 than 100 reads because of the cost concern, while a sliding 337 4 | https://doi.org/10.1101/654137 Zhang et al. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint https://doi.org/10.1101/654137 https://doi.org/10.1101/2021.02.08.430275 http://creativecommons.org/licenses/by-nc-nd/4.0/ DR AF T 0 2500 5000 7500 10000 12500 N um be r o f S N P s Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Chr9 Chr10 Chr11 Chr12 0.0 0.2 0.4 sS N P /to ta lS N P 2.5 5.0 7.5 G -s ta tis tic 0 1 2 3 4 0.5 0.0 0.5 A F 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 Genomic position (×10 Mb) A B C D Figure 2. BSA-Seq data analysis using only the bulk genome sequences. The red lines/curves are the thresholds. (A) Genomic distributions of sSNPs (blue) and totalSNPs (black). (B) Genomic distributions of sSNP/totalSNP ratios. (C) Genomic distributions of G-statistic values. (D) Genomic distributions of ΔAF values. window normally contains thousands of SNPs. Thus, the sig-338 nificant structural variant method has much higher statistical339 power, which is consistent with our observation. Our results340 revealed that the parental genome sequences did not much341 affect the plot patterns of the sSNP/totalSNP ratios and the342 G-statistic values. However, the plot patterns of the ΔAF343 value of chromosome 11 were altered dramatically when the344 parental genome sequences were not used.345 The significant structural variant method assesses if a SNP346 is likely associated with the trait via Fisher’s exact test. The347 greater the ALT proportion differences between the bulks, the348 less the P-value of the Fisher’s exact test, and the more likely349 the SNP is associated with the trait. Fisher’s exact test takes a350 numpy array or a Python list as its input, the same P-value will351 be obtained with either [[ADref1, ADalt1], [ADref2, ADalt2]] or352 [[ADalt1, ADref1], [ADalt2, ADref2]] as its input. The G-statistic353 method assesses if a SNP is likely associated with the trait354 via the G-test; the greater the G-statistic value of a SNP, the355 more likely it contributes to the trait phenotype (11). The G-356 statistic values are the same with either input [[ADref1, ADalt1],357 [ADref2, ADalt2]] or [[ADalt1, ADref1], [ADalt2, ADref2]]. The358 order of the AD values (REF/ALT reads) in bulks does not359 affect the P-value of Fisher’s exact test or the G-statistic value360 of G-test, which is why the parental genome sequences-guided361 AD swapping does not alter the curve patterns of both methods.362 Therefore, theoretically, parental genome sequences are not363 required to identify genomic region-trait associations in either364 the significant structural variant method or the G-statistic365 method. 366 When the parental genome sequences were used, AD value 367 swapping was performed for the SNPs in which the genotype of 368 LD24 was different from the REF base, and the ΔAF values of 369 these SNPs were calculated based on the swapped AD values 370 using equation 1. AD swapping makes the adjacent SNP alleles 371 from the same parent have similar AD values and similar ΔAF 372 values. The ΔAF values of such SNPs were calculated using 373 equation 2 if not performing AD swapping. Equation 2 can 374 be converted to equation 3, which produces an opposite value 375 relative to that produced by equation 1. For two adjacent 376 SNPs in LD24, where one SNP has the same genotype as the 377 REF base while the other has the same genotype as the ALT 378 base, they would have opposite ΔAF values if AD swapping 379 is not performed. For the SNPs that do not contribute to 380 the trait phenotype and are not linked to any trait-associated 381 genomic regions, their ΔAF value should fluctuate around 382 zero. The parental genome sequences will have less effect 383 on the ΔAF value of the sliding windows containing such 384 SNPs. However, for trait-associated SNPs, adjacent SNPs 385 with opposite ΔAF values would cancel each other out and 386 lower the ΔAF value of the sliding window significantly, which 387 is the case observed on chromosome 11 in Figure 2d. 388 ∆AF = ADref 2 ADref 2 + ADalt2 − ADref 1 ADref 1 + ADalt1 [2] 389 Zhang et al. bioRχiv | November 24, 2020 | 5 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430275 http://creativecommons.org/licenses/by-nc-nd/4.0/ DR AF T ∆AF = ADalt1 ADref 1 + ADalt1 − ADalt2 ADref 2 + ADalt2 [3]390 When the parental genome sequences were not used, the391 sSNP/totalSNP ratios and the G-statistic values were signifi-392 cantly lower. The peak sSNP/totalSNP ratio on chromosome393 11 was 0.5926 in Figure 1b, while it was 0.3810 in Figure 2b; it394 was similar for the peak G-statistic values. The decreasing of395 sSNP/totalSNP ratio and the G-statistic value is likely caused396 by sequencing artifacts and heterozygosity in the parental397 lines. There were 1 345 185 SNPs in the bulk dataset when398 not using the parental genome sequences, while there were399 1 099 516 SNPs in the dataset with the aid of the parental400 genome sequences. Comparison of the two SNP dataset re-401 vealed that 109 445 SNPs were unique to the bulks. Because402 all the SNPs in the bulks are derived from the parental lines,403 crossing should not generate new SNPs; thus this category404 of SNPs was most likely caused by sequencing artifacts. The405 sequencing coverage in the bulk was less than eight, which is406 very low. Higher sequencing coverage would help decrease the407 number of SNPs derived from sequence artifacts. Additionally,408 137 224 SNP were heterozygous in the parental lines. Without409 the parental genome sequences, this category of SNPs could410 not be filtered out from the bulk SNP dataset. However, these411 SNPs can be decreased via selfing the parental line more gener-412 ations: five-generations selfing can decrease the heterozygosity413 of both parental lines to a maximum of 6.25%.414 To determine how parental heterozygosity and sequenc-415 ing artifacts affected the detection of genomic region-trait416 associations, we removed the heterozygous SNPs or the bulk-417 specific SNPs from the bulk SNP dataset, and analyzed the418 data separately. By removing the heterozygous SNPs, the419 peak on chromosome 11 was shifted to 26.28 Mb for both420 the sSNP/totalSNP ratio and the G-statistic value, and the421 sSNP/totalSNP ratio of the peak was increased to 0.4835,422 well above the sliding window-specific threshold 0.0603. The423 G-statistic value of the peak was 10.8411, significantly higher424 than the threshold 8.9532 as well. By removing bulk-specific425 SNPs, the peak on chromosome 11 shifted to 26.49 Mb for426 both the sSNP/totalSNP ratio and the G-statistic value. The427 sSNP/totalSNP ratio of the peak and the sliding window-428 specific threshold were 0.4302 and 0.0637, respectively, and429 the G-statistic value of the peak and the threshold were 9.7591430 and 8.9092, respectively. Although both the sSNP/totalSNP431 ratio and the G-statistic value were lower than above, they432 were still higher than their corresponding thresholds. While433 seemed the heterozygous SNPs affected the sSNP/totalSNP434 ratio and the G-statistic value a little more than the bulk-435 specific SNPs, it is more likely that both produced similar436 levels of noise for the sSNP/totalSNP ratio and the G-statistic437 value considering that the former was 27 779 greater than438 the latter. When using only the bulk genome sequences, the439 sSNP/totalSNP peak position on chromosome 11 was shifted440 0.52 Mb (26.44 Mb to 26.96 Mb) due to the presence of the441 bulk-specific SNPs and the heterozygous SNPs in the dataset,442 but this is a very short distance for genetic mapping. Although443 only a single dataset was examined here, the genome-wide444 similarity of the sSNP/totalSNP curve patterns in Figure 1b445 and Figure 2b suggests that the significant structural method446 is highly reproducible using only the bulk genome sequences.447 Conclusions 448 The plotting pattern of the ΔAF values in the trait-associated 449 genomic region was very different when using only the bulk 450 genome sequences. Without the aid of the parental genome 451 sequences, the ΔAF values of the sliding windows could not 452 be correctly calculated; thus, the allele frequency method 453 cannot be used to identify SNP-trait association. In contrast, 454 the parental genome sequence does not affect the plotting 455 patterns of both the significant structural variant method and 456 the G-statistic method, but the sSNP/totalSNP ratios and the 457 G-statistic values decreased significantly due to sequencing 458 artifacts and/or heterozygosity of the parental lines. Because 459 of its high detection power, major SNP-trait associations can 460 still be reliably detected via the significant structural variant 461 method even the sequence coverage was very low. 462 Acknowledgments. JZ was supported by the National Science 463 Foundation grant [IOS-1546625 to DRP]. We are grateful to Lahari 464 et al. for generating the sequencing data and making it available 465 to the public. We thank Irene E. Palmer for critical review and 466 thank Nathan Lynch for valuable comments. The manuscript was 467 prepared using a modified version of the PNAS LATEX template. 468 Bibliography 469 1. RW Michelmore, I Paran, RV Kesseli, Identification of markers linked to disease-resistance 470 genes by bulked segregant analysis: a rapid method to detect markers in specific genomic re- 471 gions by using segregating populations. Proc. Natl. Acad. Sci. U.S.A. 88, 9828–9832 (1991). 472 2. JJ Giovannoni, RA Wing, MW Ganal, SD Tanksley, Isolation of molecular markers from spe- 473 cific chromosomal intervals using DNA pools from existing mapping populations. Nucleic 474 Acids Res 19, 6553–6568 (1991). 475 3. I Imerovski, et al., BSA-seq mapping reveals major QTL for broomrape resistance in four 476 sunflower lines. Mol Breed. 39, 41 (2019). 477 4. S Arikit, et al., QTL-seq identifies cooked grain elongation QTLs near soluble starch synthase 478 and starch branching enzymes in rice ( Oryza sativa L.). Sci Rep 9, 1–10 (2019). 479 5. Q Chen, et al., Identification and genetic mapping for rht-DM, a dominant dwarfing gene in 480 mutant semi-dwarf maize using QTL-seq approach. Genes Genomics 40, 1091–1099 (2018). 481 6. J Clevenger, et al., Mapping Late Leaf Spot Resistance in Peanut (Arachis hypogaea) Using 482 QTL-seq Reveals Markers for Marker-Assisted Selection. Front Plant Sci 9, 83 (2018). 483 7. F Duveau, et al., Mapping small effect mutations in Saccharomyces cerevisiae: impacts of 484 experimental design and mutational properties. G3 (Bethesda) 4, 1205–1216 (2014). 485 8. J Zhang, DR Panthee, PyBSASeq: a simple and effective algorithm for bulked segregant 486 analysis with whole-genome sequencing data. BMC Bioinforma. 21, 99 (2020). 487 9. Z Yang, et al., Mapping of Quantitative Trait Loci Underlying Cold Tolerance in Rice Seedlings 488 via High-Throughput Sequencing of Pooled Extremes. PLOS ONE 8, e68433 (2013). 489 10. H Takagi, et al., QTL-seq: rapid mapping of quantitative trait loci in rice by whole genome 490 resequencing of DNA from two bulked populations. Plant J. 74, 174–183 (2013). 491 11. PM Magwene, JH Willis, JK Kelly, The Statistics of Bulk Segregant Analysis Using Next Gen- 492 eration Sequencing. PLOS Comput. Biol. 7, e1002255 (2011). 493 12. Z Lahari, et al., QTL-seq reveals a major root-knot nematode resistance locus on chromo- 494 some 11 in rice (Oryza sativa L.). Euphytica 215, 117 (2019). 495 6 | https://doi.org/10.1101/654137 Zhang et al. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint https://doi.org/10.1101/654137 https://doi.org/10.1101/2021.02.08.430275 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2021_02_08_430280 ---- SALTS – SURFR (sncRNA) And LAGOOn (lncRNA) Transcriptomics Suite SALTS – SURFR (sncRNA) And LAGOOn (lncRNA) Transcriptomics Suite Mohan V Kasukurthi1,§, Dominika Houserova2,§, Yulong Huang2, Addison A. Barchie3, Justin T. Roberts4, Dongqi Li1, Bin Wu5,*, Jingshan Huang1,2,6,*, and Glen M Borchert2,3,* 1 School of Computing, University of South Alabama, Mobile, AL, 36688, USA 2 Department of Pharmacology, University of South Alabama, Mobile, AL, 36688, USA 3 Department of Biology, University of South Alabama, Mobile, AL, 36688, USA 4 Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO, 80045, USA 5 First Affiliated Hospital, Kunming Medical University, Kunming, Yunnan, China 6 Qilu University of Technology (Shandong Academy of Science), Jinan, Shandong, China § The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. * The authors wish it to be known that, in their opinion, the last three authors should be regarded as joint Corresponding Authors. To whom correspondence should be addressed: Tel: +1 251 461 1367; Email: borchert@southalabama.edu, Tel: +1 251 460 7612; Email: huang@southalabama.edu, Tel: +86 871 65334106; Email: wu.bin.kmu@qq.com .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint mailto:borchert@southalabama.edu mailto:huang@southalabama.edu mailto:wu.bin.kmu@qq.com https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ ABSTRACT The widespread utilization of high-throughput sequencing technologies has unequivocally demonstrated that eukaryotic transcriptomes consist primarily (>98%) of non-coding RNA (ncRNA) transcripts significantly more diverse than their protein-coding counterparts. ncRNAs are typically divided into two categories based on their length. (1) ncRNAs less than 200 nucleotides (nt) long are referred as small non-coding RNAs (sncRNAs) and include microRNAs (miRNAs), piwi-interacting RNAs (piRNAs), small nucleolar RNAs (snoRNAs), transfer ribonucleic RNAs (tRNAs), etc., and the majority of these are thought to function primarily in controlling gene expression. That said, the full repertoire of sncRNAs remains fairly poorly defined as evidenced by two entirely new classes of sncRNAs only recently being reported, i.e., snoRNA-derived RNAs (sdRNAs) and tRNA-derived fragments (tRFs). (2) ncRNAs longer than 200 nt long are known as long ncRNAs (lncRNAs). lncRNAs represent the 2nd largest transcriptional output of the cell (behind only ribosomal RNAs), and although functional roles for several lncRNAs have been reported, most lncRNAs remain largely uncharacterized due to a lack of predictive tools aimed at guiding functional characterizations. Importantly, whereas the cost of high-throughput transcriptome sequencing is now feasible for most active research programs, tools necessary for the interpretation of these sequencings typically require significant computational expertise and resources markedly hindering widespread utilization of these datasets. In light of this, we have developed a powerful new ncRNA transcriptomics suite, SALTS, which is highly accurate, markedly efficient, and extremely user-friendly. SALTS stands for SURFR (sncRNA) And LAGOOn (lncRNA) Transcriptomics Suite and offers platforms for comprehensive sncRNA and lncRNA profiling and discovery, ncRNA functional prediction, and the identification of significant differential expressions among datasets. Notably, SALTS is accessed through an intuitive Web-based interface, can be used to analyze either user- generated, standard next-generation sequencing (NGS) output file uploads (e.g., FASTQ) or existing NCBI Sequence Read Archive (SRA) data, and requires absolutely no dataset pre-processing or knowledge of library adapters/oligonucleotides. SALTS constitutes the first publically available, Web-based, comprehensive ncRNA transcriptomic NGS analysis platform designed specifically for users with no computational background, providing a much needed, powerful new resource capable of enabling more widespread ncRNA transcriptomic analyses. The SALTS WebServer is freely available online at http://salts.soc.southalabama.edu. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint http://salts.soc.southalabama.edu/ https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ GENERAL INTRODUCTION Cellular metabolism and survival are greatly dependent on how quickly and efficiently the cell can respond to internal and external stimuli. This process often requires tightly orchestrated genome-wide changes in gene expression. With rapid technological advancements in both genomics and transcriptomics, particularly the development of robust deep sequencing, it is ever more apparent that many regulatory non-coding RNAs (ncRNAs) that help coordinate gene expression changes remain elusive and the networks created thereof are far more complex than previously thought(1). As many of these are dynamic and their presence or absence is highly conditional (i.e., environmental stress, disease, tissue type, etc.), their identification poses a challenge and many remain undescribed(2). As such, we have developed a set of guidelines and parameters to help confidently identify and characterize these molecules. Importantly, by implementing alternative strategies for next-generation sequencing (NGS) analysis based on examining conditional changes in expression and/or fragmentation patterns from individual genomic loci rather than depending on pre-existing annotations, we find previously elusive ncRNAs can now be readily identified via our platform. In addition to this, we have also developed an array of downstream analyses to more fully characterize identified ncRNAs and predict their functional roles (e.g., molecular targets). To date several platforms aimed at either small non-coding RNA (sncRNA) or long non-coding RNA (lncRNA) characterization have been developed(3). Although each of these existing platforms possess some unique advantages, each also carry their own critical limitations (detailed herein). That said, to our knowledge, SALTS is the first-ever resource designed to determine ncRNA expressions in both short ncRNA-Seq and standard RNA- Seq datasets and to provide functional predictions for ncRNAs identified in either. Perhaps most importantly, however, in addition to being highly accurate and efficient, SALTS has been developed to require absolutely no computational background in order to enable widespread ncRNA transcriptomic analysis by a much broader community of researchers. Of note, a clear, step-by-step user manual for the SALTS platform is provided in Supplemental Information File 1. SECTION 1. SALTS Tool for Small non-coding RNA Analysis: SURFR ncRNAs less than 200 nucleotides (nt) in length are referred to as small non-coding RNAs (sncRNAs) and include microRNAs (miRNAs), piwi-interacting RNAs (piRNAs), small nucleolar RNAs (snoRNAs), transfer ribonucleic RNAs (tRNAs), etc.(4). One striking example of the regulatory capabilities of sncRNAs comes from a group of small yet potent RNAs called microRNAs (miRNAs). MiRNAs are ~22 nt RNAs excised from longer pre-miRNA hairpins that function through associating with the RNA-induced silencing complex (RISC) in order to bind to the 3’ UTRs of their target mRNAs and repress their translational activities(5). In just the past two decades, thousands of miRNAs have been identified and implicated in regulating cell growth, differentiation, and apoptosis(6), as well as contributing to tumorigenesis(7) and chemoresistance(8). As this group has been thoroughly examined due to its relevance to various types of cancer(9), it is now widely accepted that a single miRNA is capable of altering the expression of whole cohorts of protein coding genes(4). Importantly, studies aimed at evaluating the transcriptomic changes of miRNAs have revealed the existence of miRNA-like fragments derived from other ncRNA biotypes and suggest similar regulatory capacities may be associated with these novel sncRNAs(10–13). As such, we suggest that the SURFR resource described herein represents an intuitive, high throughput platform capable of revisiting old NGS datasets and identifying novel, relevant miRNA-like fragments derived from other types of ncRNAs that were previously overlooked. Comparably sized, miRNA-like fragments excised from many other types of ncRNAs have now been reported and many of these shown to similarly regulate gene expressions and/or chromatin compaction (e.g., piRNAs, rasiRNAs, rRNAs, scRNAs, snoRNAs, snRNAs, RNase P, tRNAs, Y RNAs, and Vault RNAs)(10–13). That said, the expressions and functions of the vast majority of specific sncRNA fragments excised from anything other .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ than annotated miRNAs remain largely undefined, although fragments from snoRNAs (sdRNAs) and tRNAs (tRFs) have recently begun to receive considerably more attention(12, 13). In 2008, Ender et al. were the first to report a small RNA fragment originating from a snoRNA, ACA45(14). Despite the principle snoRNA function being long characterized as guiding rRNA modifications, they showed that this snoRNA-derived RNA (sdRNA) was not only processed by Dicer-like regular miRNAs but also capable of silencing CDC2L6 gene in miRNA- like manner. Since then various other studies have described similar fragments arising from other snoRNAs (reviewed in (15)) as well as from other types of ncRNAs. Notably, tRNA-derived fragments (tRFs) have recently gained attention due to their differential abundance under highly specific conditions, such as developmental stage(16), stress(17), or viral infection(18). Moreover, regulatory capacity of some tRFs has been observed; Zhou et al., for example, showed that a fragment excised from 5’ end of tRNA-Glu regulates BCAR3 expression in ovarian cancer(19). It is now clear that ncRNA-derived miRNA-like fragments are precisely processed out of various types of ncRNA transcripts, and that this processing is evolutionarily conserved across species(10–13). While an increasing body of evidence suggests specifically excised sncRNA fragments from an array of ncRNAs exist and are functionally relevant, there are currently no Web-based, user-friendly resources that offer comprehensive sncRNA fragment profiling and discovery, functional prediction, and the identification of significant differential expressions of fragments among datasets. To address this gap we present SURFR. SURFR refers to our Short Uncharacterized RNA Fragment Recognition tool that identifies all miRNA, snoRNA, and tRNA fragments (as well as fragments from all other ncRNAs annotated in Ensembl) specifically excised in a given transcriptome provided as either a raw user-generated RNA-Seq dataset or NCBI SRR file identifier. In addition, SURFR can also compare individual fragment expressions among as many as 30 distinct datasets (as well as compare the expressions of full length (non-fragmented) sncRNAs). SURFR Features  Identifies fragments specifically excised from all miRNAs, tRNAs, rRNAs, scaRNAs, scRNAs, snoRNAs, sRNAs, vault RNAs, and any other ncRNAs annotated in the current Ensembl assembly(20) in individual small RNA-Seq datasets.  Ten files can be processed at once then up to 30 individual files compared after processing for ncRNA fragment differential expression analysis.  SURFR can also determine and compare the expressions of all full length (non-fragmented) sncRNAs in a given transcriptome.  SURFR results are stored on the server indefinitely, protected by powerful state-of-the-art cryptographic algorithms, and can be instantly recalled by the user via entering their session key in the “Get Results” tab on the SURFR home page.  OmniSearch-based miRNA analysis of annotated miRNAs(21).  Direct, intuitive ncRNA visualization of individual ncRNA fragmentation.  Easily downloadable Excel files of results from a single RNA-Seq file and/or comparisons among files. These files can be filtered (if desired) and list clearly defined, readily understandable, pertinent data (e.g., fragment expression, host gene links, and the exact fragment sequence excised).  Contains prepopulated ncRNA databases allowing the identification of ncRNA fragments and/or ncRNA expressions in 440 unique animal, plant, fungal, protist, and bacterial species. In addition, SURFR RNA fragment calls require considerably less processing time than previous ncRNA fragment identification pipelines for two principle reasons. We have: (1) developed a novel alignment strategy significantly faster than traditional methods (e.g., BLAST(22)) and (2) designed a novel method to locate the start and end positions of an ncRNA fragment using wavelets. Full details of these novel computational methodologies are described in length in Supplemental Information File 2. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ SURFR Workflow Figure 1. SURFR workflow. Sequence Input (left). The user provides up to ten unmodified small RNA-Seq datasets as input. These datasets can all be uploaded directly by the user or downloaded from the NCBI SRA database by entering SRA IDs. sncRNA Fragment Analysis (middle). SURFR identifies all ncRNA fragments (both annotated and novel) and their expressions in up to ten datasets per session. sncRNA Fragment Visualization (top right). Graphics of individual host ncRNAs and the fragments excised (along with the expressions at each nt position) are provided. In addition, tables comparing the expressions of all fragments within individual datasets and comparing fragment expressions across all datasets are generated. SURFR Cross Section Comparison (bottom right). The user can comprehensively compare all fragment expressions identified in up to 30 individual datasets by entering multiple SURFR session IDs from separate analyses. SURFR Input Under “Use SURFR”, the user first selects the organism corresponding to the sequences. SURFR small RNA databases have been prepopulated for 440 species including 286 metazoans, 62 plants, and 92 other fungi, protists, and bacteria. As indicated in Figure 1, the user then provides one to ten small RNA sequencing datasets as input. These datasets can be all uploaded directly by the user, or all downloaded from the NCBI SRA database(23) by entering SRA IDs (e.g., SRR6495855, SRR4217122), or any combination thereof (for example, three datasets uploaded by the user along with seven datasets downloaded from the NCBI SRA database). Importantly, a major strength of SURFR is that users can upload most raw small RNA-Seq files directly as original, unmodified, compressed FASTQ files (as provided by commercial sequencers) with absolutely no preprocessing and with no specifics about library generation, linkers, or oligonucleotides required. Allowable formats for uploading are uncompressed, standard FASTA or FASTQ files or any major compression of either. SURFR Output After the user uploads/specifies the small RNA-Seq datasets and clicks the “Let’s SURF” button, the browser is automatically redirected to a report page, progress indicators for each uploaded dataset are provided under the “Click Here To Choose Your File” drop down menu at the top of the page (Figure 2A) with individual datasets having completed analysis indicated by a checkmark. Following completion of analysis, results for the individual file selected are then displayed on the report page and organized into several sections (Figure 2). .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ Figure 2. SURFR report page. SURFR report example. (A) The “Click Here To Choose Your File” drop-down menu for selecting individual RNA-Seq files. (B) A summary of the overall composition of the selected small RNA-Seq dataset. (C) The “Create ncR Profile” button automatically populates the derived RNA Profile section at the bottom of the page. (D) The “Derived RNA Fragments” window detailing each fragment identified in the individual, selected small RNA-Seq dataset. (E) The user can download an Excel file detailing the full set of information presented in the “Derived RNA Fragments” window by pressing the “Download Results” button. (F) The “Differential Expression Vector (DEV)” window illustrates each nucleotide within a host gene and indicates the fragment called with a blue rectangle. The x-axis represents the position in the ncRNA selected (e.g., miR-29a), and the y-axis depicts the expression levels of the ncRNA at each position. (G) The “Selected ncRNA & Called RNA Fragment Sequences” window illustrates the full length host ncRNA (miR-29a) highlighting the SURFR-called fragment in yellow. (H) The “Derived RNA Profile” window details each fragment identified in any of the analyzed small RNA-Seq datasets and compares fragment expressions across samples. (I) The “OmniSearch for miRNAs” window lists the top 50 OmniSearch entries (reported targets and PubMed publications) for an individual miRNA selected in the “Derived RNA Profile” window. (J) The “Full Length ncRNA Expression Analyses” button in the upper center of the results page redirects the user to a SURFR window detailing the expressions of all full length sncRNAs in the provided datasets. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ A summary of the overall composition of the selected small RNA-Seq dataset, including the file size, total number of reads, number of mapped reads, and time taken for analysis is included just below the file selection window at the top of the page (Figure 2B). The user can compare fragment expressions across all datasets by pressing the “Create ncR Profile” button that automatically populates the derived RNA Profile section at the bottom of the page (Figure 2C). The “Derived RNA Fragments” window (Figure 2D) details the Ensembl Gene ID, Ensembl Transcript ID, gene annotation (name), the type of gene a fragment was excised from, the start and end positions of a fragment within its host gene, the expression of a fragment in reads per million (RPM), and the nucleotide sequence for each fragment identified in the individual, selected small RNA-Seq dataset. The “Derived RNA Fragments” window is an interactive table that allows users to view, sort, and filter small RNA fragments based on any column value. Users can also view host gene information available at the RNAcentral browser by selecting a fragment in the table and then clicking the “RNAcentral” button on the toolbar(24). The user can download an Excel file detailing the full set of information presented in the “Derived RNA Fragments” window (Figure 2D) for each fragment identified in the individual, selected small RNA-Seq dataset by pressing the “Download Results” button (Figure 2E). An Excel file containing the derived RNA fragment information in its entirety will be automatically downloaded to the user’s computer (Figure 3). Figure 3. Derived RNA Fragments “Download Results” File. The first few rows of an example “Download Results” Excel file detailing the full set of information presented in the “Derived RNA Fragments” window: Ensembl “Gene ID”, Ensembl “Transcript ID”, gene “Annotation” (name), the “Type” of gene a fragment was excised from, the start and end positions of a fragment within its host gene, the expression of a fragment in reads per million (RPM), and the nucleotide “Sequence” for each fragment identified in the selected small RNA-Seq dataset. The “Differential Expression Vector (DEV)” window (Figure 2F) details the expressions of each nucleotide within a host gene and indicates the fragment called with a blue rectangle. The x-axis in the graph shown in Figure 2F represents the position in the ncRNA selected (miR-29a), and the y-axis represents the expression levels of the ncRNA at each position. The user can also interactively view the expression at each individual nucleotide by panning over the image, zoom in or out using the buttons on the top right, and/or download DEV image files and an Excel file detailing expression at each nucleotide by selecting the menu button on the top right of the window. The "Selected ncRNA & Called RNA Fragment Sequences" window (Figure 2G) illustrates the full length host ncRNA highlighting the SURFR-called fragment in yellow just as depicted in the preceding DEV window (Figure 2F). The “Derived RNA Profile” window (Figure 2H) details the Ensembl Gene ID, Ensembl Transcript ID, gene annotation (name), the type of gene a fragment was excised from, the average start and end positions of a fragment within its host gene (To be considered the same fragment start and stop positions had to agree within 5 nts.) with corresponding nucleotide sequence for each “average” fragment listed, the start and end positions of a fragment within its host gene along with the fragment’s expression (RPM) in each individual small RNA-Seq dataset, and finally, the % standard deviation of the expression of individual fragments(20). Importantly, the full list of all fragments identified in any of the datasets is presented. The “Derived RNA Profile” window is an interactive Gene ID Transcript ID Annotation Type Fragment(start-end) Expression(RPM) Sequence ENSG00000199135.1 ENST00000362265.1 MIR101-1 miRNA 46 - 66 68602 TACAGTACTGTGATAACTGA ENSG00000284032.1 ENST00000362111.4 MIR29A miRNA 41 - 62 64394 TAGCACCATCTGAAATCGGTT ENSG00000207752.1 ENST00000385019.1 MIR199A1 miRNA 46 - 67 34071 ACAGTAGTCTGCACATTGGTT ENSG00000207638.1 ENST00000384906.1 MIR99A miRNA 12 - 33 33760 AACCCGTAGATCCGATCTTGT ENSG00000288462.1 ENST00000673161.1 MIR23A miRNA 44 - 63 13936 ATCACATTGCCAGGGATTT ENSG00000198973.4 ENST00000362103.4 MIR375 miRNA 39 - 60 6214 TTTGTTCGTTCGGCTCGCGTG ENSG00000199085.3 ENST00000362215.3 MIR148A miRNA 43 - 64 3774 TCAGTGCACTACAGAACTTTG ENSG00000199047.3 ENST00000362177.3 MIR378A miRNA 42 - 63 2166 ACTGGACTTGGAGTCAGAAGG ENSG00000207713.3 ENST00000384980.3 MIR200C miRNA 43 - 65 1713 TAATACTGCCGGGTAATGATGG ENSG00000277864.1 ENST00000516881.1 SCARNA15 scaRNA 65 - 86 1360 AGGTAGATAGAACAGGTCTTG ENSG00000277947.1 ENST00000619178.1 SNORD3D snoRNA 194 - 217 1304 GGAGAGAACGCGGTCTGAGTGGT .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ table that allows users to view, sort, and filter small RNA fragments based on any column value. Users can also view host gene information available at the RNAcentral browser(24) by selecting a fragment in the table and then clicking the “Search ncRNA In RNAcentral” button on the toolbar. The user can also download an Excel file detailing the full set of information presented in the “Derived RNA Profile” window by pressing the “Generate Report” button at the top right of the window. An Excel file containing the derived RNA profile information in its entirety will be automatically downloaded to the user’s computer (Figure 4). In addition, Excel file reports can be downloaded following the application of specific filters in the “Derived RNA Profile” window (e.g., only snoRNA fragments can be included or excluded). Figure 4. Derived RNA Profile “Generate Report” File. The first few rows of an example “Generate Report” Excel file detailing the full set of information presented in the “Derived RNA Profile” window. The “OmniSearch for miRNAs” window (Figure 2I) returns the top 50 OmniSearch entries(21) (reported targets and PubMed entries) for an individual miRNA selected in the preceding “Derived RNA Profile” window. And finally, when desired, the “Full Length ncRNA Expression Analyses” button (Figure 2J) redirects the user to a SURFR window detailing the expressions of all full length sncRNAs in the provided datasets regardless of fragmentation. Importantly, all pertinent features (e.g. expression table downloads) described above are similarly available for full length sncRNA analyses via this resource. SURFR Example Use/Case Study SURFR allows users to profile and compare the expressions of sncRNA fragments (both annotated and novel) across multiple small RNA-Seq experiments in order to identify the top sncRNA fragments significantly differentially expressed in a particular disease, tissue, developmental stage, etc.. Our group’s interest in fragments excised from ncRNAs other than miRNAs initially arose from an attempt to identify novel miRNA contributors to breast cancer(12). For this work, we performed small RNA sequencing on several breast cancer cells lines, and while we failed to identify any (traditional) miRNAs of interest, we did identify a snoRNA fragment (we deemed sdRNA-93) that was specifically and significantly overexpressed in MDA-MB-231 cells - a widely studied model of a highly invasive and metastatic human cancer. Next, as we found sdRNA-93 to be significantly overexpressed in these cells (≥75x compared to controls), we decided to determine if sdRNA-93 functionally contributed to the malignant phenotype. Stringently testing sdRNA-93 inhibitors and mimics in MDA-MB-231 cells across multiple time points revealed that sdRNA-93 gain- and loss- of-function showed profound effects on invasion within standard matrix-based (matrigel) chemoattractant assays. Remarkably, sdRNA-93 loss-of-function reduced cell invasion by >90% at 48 hours compared to control cells, whereas sdRNA-93 gain-of-function enhanced cell invasion by >100%. Thus, we showed a single sdRNA (sdRNA-93) strongly selectively regulates invasion of MDA-MB-231s. These findings link a specific sdRNA (sdRNA-93) to an aggressive malignant phenotype (invasion) within an established cancer cell model that is .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ widely used to study invasive behavior. We next employed a BLAST-based methodology to determine sdRNA- 93 expressions across small RNA-Seq datasets corresponding to 115 unique breast cancer patients and detected strong overexpression of sdRNA-93 in 92.8% of tumors classified as Luminal B Her2+, compared to normal tissue controls (extremely low expression) and other breast cancer subtypes (modest expression levels of 30- 40%). Thus, this work represented the first evidence demonstrating that sdRNAs that regulate specific malignant properties are differentially expressed within divergent molecular subtypes of human breast cancer(12). Importantly, our initial BLAST-based identification of sdRNA-93 as being significantly overexpressed in MDA- MB-231 cells was highly labor intensive taking days to complete. In contrast, when we uploaded our original unmodified FASTQ sequencing files to SURFR, sdRNA93 was readily identified as the most highly differentially expressed snoRNA fragment between our two cancer cell lines taking just 7.9 minutes (Figure 5). Figure 5. SURFR identification of sdRNA-93. (A) “Derived RNA Fragments” window showing SNORD93 derived sdRNA-93 was identified as the second most highly expressed sdRNA in the highly invasive breast cancer cell line MDA-MB-231. (B) Alignment among the human genome (GRCh38 Ch7:22856601:22856699:1) (top), snoRNA-93 (ENSG00000221740) (middle), and next generation small RNA sequence read (bottom) obtained by Illumina sequencing of MDA-MB-231 RNA as originally described in(12). All sequences are in the 5′ to 3′ direction. An asterisk indicates base identity between the snoRNA and genome. Vertical lines indicate identity across all three sequences. (C) “Derived RNA Profile” window comparing small RNA-Seq results for MCF-7 and MDA-MB- .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ 231 cells. Note SNORD93 derived sdRNA-93 was identified as the most significantly differentially expressed sdRNA between the weakly and highly invasive breast cancer cell lines. SURFR Comparison to other Existing Tools Numerous characterizations of significant regulatory roles for sncRNA fragments excised from various types of ncRNAs other than miRNAs have now been reported(10–13). As new high-throughput small RNA sequencing strategies(25) continue to make small RNA-Seq faster and less expensive, there is a clear need for tools capable of digesting large amounts of small RNA-Seq data in order to detect and characterize all small RNA genes including specifically-excised small RNA fragments. Most existing tools (e.g., miRDeep(26), miRSpring(27), miRanalyzer(28), etc.) focus almost exclusively on miRNAs and/or only evaluate existing sncRNA annotations and are not capable of fully defining small RNA-Seq ncRNA fragment profiles and differences among these datasets (sRNAnalyzer(29), Oasis2.0(30), SPAR(31), etc.). That said, most existing tools capable of characterizing novel ncRNA fragments and their expressions, such as FlaiMapper(32), SPORTS(33), and DEUS(34), require fairly extensive computational expertise for utilization, support only pre-aligned file inputs (BAM), and/or require standalone installation (Table 1). As such, we have designed SURFR to address the need for a user-friendly, Web-based, comprehensive small RNA fragment tool requiring no computational expertise to utilize. In stark contrast to most existing platforms, SURFR identifies fragments excised from all types of ncRNAs annotated in Ensembl(20) in a given transcriptome provided as either a raw user-generated RNA-Seq dataset or NCBI SRA file. In addition, SURFR can compare individual fragment expressions among as many as 30 distinct datasets, and we have included ncRNA databases for 440 unique animal, plant, fungal, protist, and bacterial species. Importantly, there are currently no Web-based, user-friendly resources that offer comprehensive sncRNA fragment profiling and discovery, functional prediction, and the identification of significant differential expressions among datasets comparable to SURFR. Although two platforms, sRNA toolbox(35) and sRNAtools(36), do offer many of SURFR’s features, SURFR distinguishes itself by providing significantly more intuitive, versatile, and user friendly results generated in less than 10% of the time required for data upload and processing by these tools. That said, because SURFR was developed specifically for ncRNA fragment identification, it does not provide expression analysis for full length ncRNAs. Table 1. SncRNA analysis platform feature comparison. Various features offered by SURFR were compared to other existing tools including sRNA toolbox(35), Oasis2.0(30), sRNAtools(36), CPSS2.0(37), SPAR(31), sRNAnalyzer, SPORTS1.0(33), DEUS(34), FlaiMapper(32), and featureCounts(38). Features examined were: “Online,” if tool is available online; “Input,” form of input RNA-Seq dataset - either raw (direct NGS output) or pre-processed (e.g., requires BAM file); “Clear, User-friendly Results/Output,” if interactive and user-friendly results are generated directly; ”Library Oligo Sequences Req,” if user knowledge of NGS oligo sequences is required; “TCGA, SRA, GEO, or Encode Input,” if publically available RNA-Seq datasets can be specified for examination based on identifier alone; “Known Full Length sncRNA Expressions,” detection and quantification of known sncRNAs; “Novel Full Length sncRNA Expressions,” detection and quantification of novel sncRNAs; “Novel sncRNA Fragment Discovery,” detection and quantification of novel ncRNA fragments; “Differential Expression,” ability of the tool to integrate expression data from multiple files (“sRNAde” denotes that expression analyses can be performed in parallel); and “Species,” number of species available for analysis. “user” denotes .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ that the tool has the capacity to perform given task however requires additional user input or user-directed change to program’s code and/or advanced settings. Notably, as a verification of SURFR’s accuracy, we recreated an analysis of ten prostate cancer small RNA-Seq files previously performed using FlaiMapper(39). Importantly, FlaiMapper-based ncRNA fragment discovery of these ten files originally identified 147 snoRNA-derived fragments that were 18 to 35 nt in length and expressed at > 10 RPM. Similarly, SURFR analysis of the same files identified 110 snoRNA-derived fragments expressed at > 10 RPM, and strikingly, 104 of these fragments were nearly identically identified (+/- 2 nts) by both methods. Notably, we find the majority of the FlaiMapper-identified sdRNA fragments not present in the SURFR calls were excluded based on SURFR’s 100% sequence identity requirement (in contrast to FlaiMapper’s 2 nt mismatch allowance). SECTION 2. SALTS Tool for Long non-coding RNA Analysis: LAGOOn ncRNAs longer than 200 nt in length are known as long ncRNAs (lncRNAs). This distinction, while somewhat arbitrary and based on technical aspects of RNA isolation methods, serves to distinguish lncRNAs from miRNAs and other sncRNAs. lncRNA loci are present in large numbers in eukaryotic genomes typically comparable to or exceeding that of protein coding genes. Many lncRNAs possess features reminiscent of protein-coding genes, such as having a 5′ cap and undergoing alternative splicing(40). In fact, many lncRNA genes have two or more exons(40), and about 60% of lncRNAs have polyA+ tails. In addition, although numerous long intergenic RNAs (lincRNAs)(41) including eRNAs from gene-distal enhancers have recently been reported(42), the majority of lncRNA genes identified to date are located within 10 kb of protein-coding genes and typically found to be antisense to coding genes or intronic(43). That said, many lncRNAs are expressed at relatively low levels in highly specific cell types(40) both explaining why the majority of lncRNAs were thought to be “transcriptional noise” until quite recently and also representing perhaps the single largest challenge in terms of lncRNA discovery and characterization. NGS has now identified tens of thousands of lncRNA loci in humans alone with the number of lncRNAs linked to human diseases quickly increasing. That said, lncRNA functionality is highly contentious, and the number of experimentally characterized and / or disease-associated lncRNAs remain in the low hundreds, or ≤1% of identified loci(44). This has led to a burgeoning focus on elucidating the molecular mechanisms that underlie lncRNA functions(45). Although only a minority of identified lncRNAs have been functionally characterized, several distinct modes of action for lncRNAs have now been described, including functioning as signals, decoys, scaffolds, guides, enhancer RNAs, and short peptide messages(46)(47). Importantly, however, there are currently no Web-based, user-friendly resources that offer comprehensive lncRNA profiling, functional prediction, and the identification of significant differential expressions among datasets. To address this gap we present LAGOOn. LAGOOn refers to our Long-noncoding and Antisense Gene Occurrence and Ontology tool that identifies all lncRNAs expressed in a given human transcriptome from either a user-provided RNA-Seq dataset or publically available SRA file(23). In addition, LAGOOn can also compare lncRNA expressions among datasets and predict likely functional roles for individual lncRNAs. LAGOOn Features  Direct, intuitive visualization of significant lncRNA expressions. Determines the expressions of all lncRNAs annotated in the current Ensembl assembly(20) in individual human RNA-Seq datasets.  Identifies differentially utilized lncRNA exons.  Up to three files can be processed at once then up to 15 individual files compared after processing for lncRNA differential expression analysis. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/  LAGOOn results are stored on the server indefinitely, protected by powerful state-of-the-art cryptographic algorithms, and can be instantly recalled by entering a previous session key in “Access Your Results” on the LAGOOn home page.  Easily downloadable Excel files of results profiling a single RNA-Seq file and/or comparisons among various files. These files can be filtered (if desired) and list clearly defined, readily understandable, and pertinent data (e.g., expression, lncRNA Ensembl ID, etc.).  Detailed, comprehensive lncRNA functional prediction detailing: o If a lncRNA serves as a host for a sncRNA(45). o Significant potentials for a lncRNA to serve as a specific miRNA sponge(48). o All overlaps between a given lncRNA and annotated enhancers(49). o Significant potentials for lncRNAs to serve as naturally occurring antisense silencers for genes located on the strand opposite to themselves(50). o Associations between individual lncRNAs and ribosomes suggesting microprotein production(51). Importantly, LAGOOn is the first Web-based, user-friendly resource that offers real-time lncRNA profiling, the identification of significant differential expressions among datasets, and an array of functional prediction assessments beyond standard mRNA interaction characterizations. Full details of these novel computational methodologies are described in length in Supplemental Information File 3. LAGOOn Workflow Figure 6. LAGOOn workflow. Sequence Input (left). The user provides up to two unmodified RNA-Seq files and one Ribo-Seq dataset (optional) as input. These datasets can all be uploaded directly by the user or downloaded from the NCBI SRA database by entering SRA IDs. lncRNA Exon Analysis (middle). LAGOOn enumerates all annotated lncRNA expressions in up to three datasets per session. lncRNA Expression and Functional Prediction Visualization (top right). An interactive table is generated comparing the expressions of all exons within individual datasets and comparing exon expressions across all datasets. Tables indicating putative lncRNA functions are also depicted. LAGOOn Cross Section Comparison (bottom right). The user can comprehensively compare all exon expressions identified in up to 15 individual datasets by entering multiple LAGOOn session IDs from separate analyses. LAGOOn Input As summarized in Figure 6, after selecting “Start New Analysis” on the LAGOOn homepage, the browser is redirected to the “Data Transfer Options” page where the user provides one or two RNA sequencing datasets as input and is given the chance to provide an optional, additional input, i.e., a Ribo-Seq dataset for determining microprotein coding potentials. These datasets can all be uploaded directly by the user, or all downloaded from the NCBI SRA database(23) by entering SRA IDs (e.g., SRR9729388, SRR6290085), or any combination thereof. Importantly, a major strength of LAGOOn is that users can upload most raw RNA-Seq files directly as original, unmodified, compressed FASTQ files (as provided by commercial sequencers) with absolutely no preprocessing and with no specifics about library generation, linkers, or oligonucleotides required. There is no limit on the size .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ of SRA files whereas individual user uploaded files are limited to 18 GB regardless of format meaning extremely large sequencing files exceeding even this size can be converted to FASTA format then compressed prior to being uploaded if necessary. Allowable uploaded formats are uncompressed, standard FASTA or FASTQ files or any major compression of either. In addition to this, the LAGOOn homepage provides links to: (1) “Access Your Results” where users can retrieve results from previous sessions via providing a session key and then compare results from up to five separate sessions. (2) “LAGOOn Search” where users can obtain detailed, comprehensive functional predictions for individual lncRNAs. And, (3) “Download Our Databases” where users can download databases containing all the lncRNAs and/or lncRNA exons employed by LAGOOn. LAGOOn Output After the user uploads/specifies the RNA-Seq datasets, the browser is automatically redirected to the LAGOOn report page (Figure 7). Initially, a summary of the size and composition of individual RNA-Seq datasets, the number of lncRNAs expressed in a dataset, and the top ten most highly expressed lncRNAs in the specified dataset are shown. Following selection of either one or all of the RNA-Seq files and the Ribo-Seq file (if included) analyzed from the file selection toolbar (Figure 7A), results for the file(s) selected are then displayed on the report page under the “Results” tab (Figure 7B), and organized into several distinct sections. Figure 7. LAGOOn report page. LAGOOn report example. (A) The file selection toolbar contains drop-down menus for selecting individual RNA-Seq and Ribo-Seq files. (B) The toolbar allowing selection of either the “Summary” or “Results” tab. (C) The lncRNA expression window displays a filterable table of all lncRNA exons expressed in any of the user-provided files. Full length lncRNA sequence, individual exon sequence, or Ensembl lncRNA gene information is obtained by selecting an exon in the table and then clicking the “lncRNA Sequence,” “Exon Sequence,” or “Search lncRNA in Ensembl” button on the toolbar. (D) The “Generate Report” button creates and automatically downloads an Excel file detailing the full set of information presented in the expression table window. (E) The “Exon Sponge to (miRNA)” window lists all miRNA complementarities of ten base pairs or greater occurring within the selected lncRNA exon (F) The “lncRNA host to” window lists all full length ncRNAs contained in any of the selected lncRNA’s exons. (G) The “Enhancer” window lists all overlaps between a selected lncRNA and GeneHancer annotated enhancer (as well as genes with expression linked to individual enhancers). (H) The “lncRNA Overlapping Genes” window lists all genes even partially overlapping a lncRNA locus on either strand. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ The table presented in Figure 7C details the Ensembl Gene ID, Ensembl Exon ID along with gene annotation (name), and expressions (RPM) of all lncRNA exons in each individual RNA-Seq dataset, and finally, the % standard deviation of the expression of individual exons(20). Importantly, the full list of all exons found to be expressed in any of the datasets is presented. In addition, the expression table is interactive and allows user to view, sort, and filter based on any column value by clicking the “Filter Table” button on the toolbar. Users can also obtain a full length lncRNA sequence, a specific exon sequence, or view the lncRNA gene information available at Ensembl by selecting an exon in the table and then clicking the “lncRNA Sequence,” “Exon Sequence,” or “Search lncRNA in Ensembl” button on the toolbar. The user can also download an Excel file detailing the full set of information presented in the expression table window by pressing the “Generate Report” button at the top right of the window (Figure 7D). An Excel file containing the expression table window information in its entirety will be automatically downloaded to the user’s computer (Figure 8). In addition, refined Excel file reports can be downloaded following the application of specific filters (e.g., lncRNAs with RPM > 1 in the Ribo-Seq dataset). Figure 8. lncRNA expression table “Generate Report” File. The first few rows of an example “Generate Report” Excel file detailing the full set of information presented in the lncRNA expression window. Finally, putative functional roles for lncRNAs/lncRNA exons selected in the expression table are depicted in Figure 7E-H. As lncRNAs frequently function as miRNA sponges that directly basepair with and effectively inactivate mature miRNAs(48), the “Exon Sponge to (miRNA)” window lists all miRNA complementarities of ten base pairs or greater occurring within the selected lncRNA exon (Figure 7E). Next, as numerous lncRNAs have been shown to encode sncRNAs (e.g., miRNAs and snoRNAs) in their exonic sequences, and sncRNA expression often relies on excision from the host lncRNA transcript(45), the “lncRNA host to” window lists all full length ncRNAs contained in any of the selected lncRNA’s exons (Figure 7F). In addition, as several lncRNAs have been reported to function through regulating the accessibility of transcriptional enhancers overlapping their genomic loci(49), all overlaps between a selected lncRNA and GeneHancer(52) annotated enhancer (and genes with expression linked to individual enhancers) are detailed in the “Enhancer” window (Figure 7G). And finally, in addition to lncRNA exonic sequences serving as sncRNA hosts, many sncRNAs are processed from lncRNA introns(45). Furthermore, many lncRNAs serve as naturally occurring antisense silencers of genes located on the strand opposite to themselves(50). For both of these reasons, as well as other potential regulatory relationships, all genes overlapping a lncRNA locus on either the positive or negative strand are detailed in the “lncRNA Overlapping Genes” window (Figure 7H). Importantly, a comprehensive report detailing each of the functional predictions is also available for individual lncRNAs by selecting the “LAGOOn Search” button on the homepage after entering a lncRNA Ensembl gene identifier. Notably, this search functionality does not require full LAGOOn analysis. lncRNA Exon SRR8730291 (RPM) SRR6290085 (RPM) SRR9729388 (RPM) % Standard Deviation ENSG00000230590 ENSE00003874886_FTX_-1_FTX transcript, XIST regulator [HGNC:37190]_lncRNA 1 128 30 102.23 ENSG00000230590 ENSE00003858311_FTX_-1_FTX transcript, XIST regulator [HGNC:37190]_lncRNA 1 128 30 102.23 ENSG00000230590 ENSE00003847528_FTX_-1_FTX transcript, XIST regulator [HGNC:37190]_lncRNA 1 128 30 102.23 ENSG00000225470 ENSE00003808225_JPX_1_JPX transcript, XIST activator [HGNC:37191]_lncRNA 1 128 30 102.23 ENSG00000230590 ENSE00003241026_FTX_-1_FTX transcript, XIST regulator [HGNC:37190]_lncRNA 1 128 30 102.23 ENSG00000230590 ENSE00003429313_FTX_-1_FTX transcript, XIST regulator [HGNC:37190]_lncRNA 1 128 30 102.23 ENSG00000230590 ENSE00003861803_FTX_-1_FTX transcript, XIST regulator [HGNC:37190]_lncRNA 1 128 30 102.23 ENSG00000230590 ENSE00003849720_FTX_-1_FTX transcript, XIST regulator [HGNC:37190]_lncRNA 1 128 30 102.23 ENSG00000283117 ENSE00003789008_AC004949.1_-1_novel transcript_lncRNA 1 35 19 76.18 ENSG00000284722 ENSE00003811861_AP003175.1_-1_novel transcript_lncRNA 1 118 14 118.23 ENSG00000259234 ENSE00002540221_ANKRD34C-AS1_-1_ANKRD34C antisense RNA 1 [HGNC:48618]_lncRNA 1 54 10 106.57 ENSG00000259234 ENSE00002554868_ANKRD34C-AS1_-1_ANKRD34C antisense RNA 1 [HGNC:48618]_lncRNA 1 54 10 106.57 ENSG00000259234 ENSE00002573893_ANKRD34C-AS1_-1_ANKRD34C antisense RNA 1 [HGNC:48618]_lncRNA 1 54 10 106.57 ENSG00000259234 ENSE00002557951_ANKRD34C-AS1_-1_ANKRD34C antisense RNA 1 [HGNC:48618]_lncRNA 1 54 10 106.57 ENSG00000259234 ENSE00002541714_ANKRD34C-AS1_-1_ANKRD34C antisense RNA 1 [HGNC:48618]_lncRNA 1 54 10 106.57 ENSG00000213904 ENSE00003224994_LIPE-AS1_1_LIPE antisense RNA 1 [HGNC:48589]_lncRNA 1 17 8 75.31 ENSG00000213904 ENSE00003062809_LIPE-AS1_1_LIPE antisense RNA 1 [HGNC:48589]_lncRNA 1 17 8 75.31 ENSG00000213904 ENSE00001552276_LIPE-AS1_1_LIPE antisense RNA 1 [HGNC:48589]_lncRNA 1 17 8 75.31 ENSG00000213904 ENSE00002995358_LIPE-AS1_1_LIPE antisense RNA 1 [HGNC:48589]_lncRNA 1 17 8 75.31 ENSG00000251259 ENSE00002021304_AC004069.1_-1_novel transcript_lncRNA 1 65 6 120.52 ENSG00000272430 ENSE00003695861_LINC02637_1_long intergenic non-protein coding RNA 2637 [HGNC:54120]_lncRNA 1 9 5 65.52 ENSG00000237491 ENSE00002920037_AL669831.5_1_novel transcript_lncRNA 1 4 3 47.86 ENSG00000237491 ENSE00001642276_AL669831.5_1_novel transcript_lncRNA 1 4 3 47.86 ENSG00000237491 ENSE00001741526_AL669831.5_1_novel transcript_lncRNA 1 4 3 47.86 ENSG00000251562 ENSE00003717116_MALAT1_1_metastasis associated lung adenocarcinoma transcript 1 [HGNC:29665]_lncRNA 1 2322 3 141.06 ENSG00000251562 ENSE00002080048_MALAT1_1_metastasis associated lung adenocarcinoma transcript 1 [HGNC:29665]_lncRNA 1 2322 3 141.06 ENSG00000251562 ENSE00003742980_MALAT1_1_metastasis associated lung adenocarcinoma transcript 1 [HGNC:29665]_lncRNA 1 2322 3 141.06 ENSG00000251562 ENSE00003753954_MALAT1_1_metastasis associated lung adenocarcinoma transcript 1 [HGNC:29665]_lncRNA 1 2146 3 141.03 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ LAGOOn Example Use/Case Study ncRNAs are becoming major players in disease pathogenesis such as cancer. Metastasis Associated Lung Adenocarcinoma Transcript 1 (MALAT1) is a nuclear enriched lncRNA that is generally overexpressed in patient tumors and metastases. Overexpression of MALAT1 has been shown to be positively correlated with tumor progression and metastasis in a large number of tumor types including breast tumors. Furthermore, an earlier study evaluating breast cancer patient samples showed that MALAT1 expression is higher in breast tumors as compared to adjacent normal tissues (reviewed in (53)). As such we elected to compare lncRNA expressions in a breast cancer cell line (MDA-MB-231) RNA-Seq dataset (SRR12101868) with those of a human bone tissue RNA-Seq dataset (SRR12101882) in order to identify significantly differentially expressed lncRNAs and their putative functions, including screening a Ribo-Seq of the BRX-142 cell line (SRR12101882) established from circulating tumor cells collected from a woman with advanced HER2-negative breast cancer(54) for potential MALAT1 microprotein production. Strikingly, the total time for download and analysis of these three NGS datasets by LAGOOn was only 3 min 52 sec. More importantly, however, LAGOOn identified MALAT1 as the most highly expressed lncRNA in MDA- MB-231 breast cancer cells (Figure 9). In agreement with previous demonstrations that MALAT-1 functions (in part) as a miR-145-5p sponge in numerous malignancies including breast cancer(55), LAGOOn identified MALAT1 as a probable miR-145-5p sponge (Figure 9A, top right). In addition, LAGOOn also found MALAT1 overlaps with, and may therefore potentially be involved in regulating, several distinct genomic enhancers and sncRNAs (Figure 9A, lower windows). Finally, similarly in agreement with previous analyses(56), LAGOOn also identified MALAT1 as one of three lncRNAs significantly represented in the BRX-142 cell Ribo-Seq dataset strongly suggesting MALAT1 encodes at least one micropeptide. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ Figure 9. LAGOOn identification of MALAT1 overexpression in breast cancer. (A) The “Results” window showing MALAT1 was identified as the most highly expressed lncRNA in the highly invasive breast cancer cell line MDA-MB-231 (SRR12101868). (B) The “Generate Report” Excel file showing MALAT1 (yellow) was identified as the most highly expressed lncRNA in MDA-MB-231 cells. Both windows indicate MALAT1 is present in the breast cancer Ribo-Seq dataset (SRR10883792). LAGOOn Comparison to other Existing Tools LncRNAs represent the largest single class of ncRNAs. However, unlike sncRNAs, which are thought to mostly function in gene regulation through complementary basepairing other RNAs, the mechanisms through which lncRNAs function are highly diverse. lncRNA relatively low expressions and tissue specificity have significantly hindered lncRNA discovery, our understanding of lncRNA regulations, and characterizations of lncRNA functional mechanisms to date(44)(45)(46)(47). That said, initiatives such as ENCODE(57), FANTOM(58), and GENCODE(40) have now predicted over 60,000 human lncRNAs and identified associations between many of these and specific diseases. Thus far, however, only a handful of these lncRNAs have been examined in the literature, with even fewer being assigned any specific mechanistic function. Expression data often constitutes the first level of information of use in studying lncRNAs as differential expression analysis is clearly of value in prioritizing candidates for further examination. Differential expression, however, provides little in the way of functional insights. That said, the majority of computational platforms currently available are primarily aimed at either detecting and quantifying lncRNAs (e.g., lncRNA-screen(59), RNA-CODE(60), lncRScan(61), etc.) or predicting lncRNA:mRNA and/or lncRNA:protein interactions (e.g., PLAIDOH(62), LncRNA2Function(63), circlncRNAnet(64), etc.) (Table 2). In contrast, LAGOOn was designed to comprehensively evaluate lncRNA expression as well as the potential for lncRNAs to function through other characterized mechanisms including serving as sncRNA hosts, miRNA sponges, antisense RNAs, microprotein transcripts, and/or regulators of genomic enhancers (as well as providing links to predicted lncRNA:mRNA and/or lncRNA:protein interactions). In short, LAGOOn wholly distinguishes itself from available tools by filling a major gap in available lncRNA functional prediction platforms and eliminating the need of the user to switch platforms during the analysis process. Table 2. lncRNA analysis platform feature comparison. Various features offered by LAGOOn were compared to other existing tools including lncRNA-screen(59), RNA-CODE(60), lncRScan(61), iSeeRNA(65), Annocript(66), UClncR(67), LncRNA2Function(63), and circlncRNAnet(64). Features examined were: “Online”, if tool is available online; “Input”, form of input RNA-Seq dataset - either raw (direct NGS output) or pre-processed (e.g., requires BAM file); “TCGA, SRA, or GEO”, if publically available RNA-Seq datasets can be specified for examination based on identifier alone; “Known lncRNA”, detection and quantification of known lncRNAs; “Novel lncRNA”, detection and quantification of novel lncRNAs; “Differential Expression”, ability of the tool to integrate expression data from multiple files; “ChIP-Seq / Ribo-Seq”, if identified lncRNA occurrences in ChIP-Seq and/or Ribo-Seq datasets can be determined; “Functional Prediction”, if potential functional roles of identified lncRNAs are assessed; and “Interactive Results”, if interactive and user-friendly results are generated directly. Online Input TCGA, SRA, or GEO Input Known lncRNA Novel lncRNA Differential Expression ChIP-seq / Ribo-seq Functional Prediction Interactive Results LAGOOn yes raw yes yes no yes yes yes yes lncRNA-screen no raw yes yes yes yes yes no yes RNA-CODE no raw no yes no yes no no no lncRScan no raw no no yes yes no no no iSeeRNA yes raw no no yes yes no no yes Annocript no raw no no yes no no no no UClncR no pre-processed no no no no no no no LncRNA2Function yes pre-processed no yes no limited no yes yes circlncRNAnet yes pre-processed no yes no yes no yes yes .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ DISCUSSION Despite a mounting body of evidence supporting the physiological relevance of ncRNAs, most studies performed to date have focused primarily on proteins themselves or deciphering the pathways associated with annotated ncRNAs. Moreover, due to the perceived insurmountability of the sheer amount of data generated by NGS/TGS analyses, the full extent of regulatory networks created by ncRNAs often gets overlooked(68). In addition, whereas the cost of RNA-seq is now reasonable for most active research programs, tools necessary for the interpretation of these sequencing datasets typically require significant computational expertise and resources markedly hindering widespread utilization of these tools. As such, the necessity for development of real-time, user-friendly platforms capable of making the identification and characterization of the ncRNAome accessible to biologists lacking significant computational expertise becomes clear. In light of this, we have developed SALTS a highly accurate, super efficient, and extremely user-friendly one-stop shop for ncRNA transcriptomics. Notably, SALTS is accessed through an intuitive Web-based interface, can analyze either user-generated, standard NGS file uploads (e.g., FASTQ) or existing NCBI SRA datasets, and requires absolutely no dataset pre-processing or knowledge of library adapters/oligonucleotides. In short, SALTS constitutes the first publically available, Web- based, comprehensive ncRNA transcriptomic NGS analysis platform designed specifically for users with no computational background, providing a much needed, powerful new resource enabling more widespread ncRNA transcriptomic analysis. That said, an array of platforms and pipelines, each geared towards a specific type of transcript/ncRNA class, have previously been developed. Regardless of the platform, the core of ncRNA transcriptome expression analysis consists of two main steps: transcript detection and expression quantification(1)(3). The first step in this process involves aligning, or mapping, the NGS reads to a reference sequence(s), which can be either ncRNA sequence library or an entire reference genome. Most standard pipelines use alignment programs such as Bowtie2(69), BWA(70), NCBI’s BLAST(22) or other implementations of existing alignment algorithms like Smith-Waterman (SW)(71), Needleman-Wunsch (NW)(72), and Burrows Wheeler Transform (BWT)(73). These aligners often differ in how alignment mis-matches and gaps are scored and as such need to be taken into account when dealing with data containing high sequence variability between the individual transcripts originating from the same genomic locus or between the reads and the reference. In the second step, aligned reads are further analyzed to determine the expression, or the number of reads assigned to individual loci or library entries. This step often includes or is followed by various statistical analysis to determine differential expression and/or variance between replicates (i.e., baySeq(74) or DESeq2(75)). That said, the strikingly high accuracy and efficiency achieved by our tools as compared to existing platforms is primarily due to a novel computational approach to RNA-Seq alignment and an innovative analysis based on Hilbert and Vector spaces developed in the course of this work. Brief overviews of the primary constructs critical to toolkit implementation are described below with more in- depth descriptions detailed in Supplemental Information Files 2 and 3. SALTS toolkit implementation. Of note, both SURFR and LAGOOn were developed into real-time processing systems using the following technology stack: Programming languages used: Python 3.7, Visual C++ 2015, Erlang, JavaScript, PHP, and SPARQL. Database engines: Mongo DB 4.4 Servers: Apache Web Server, 30+ background servers composed using Master-Worker model to parallelize the workload, and Apache Jena Fuseki. Other tools and supporting technologies: Rabbit MQ, Flask, Redis, Vue JS, Dropzone JS, Apexcharts JS, Bootstrap 4, IBM Aspera, Axios JS, Moment JS, Tabulator, Matplotlib, NumPy, SciPy, and HTML5. Architecture: Microservices. Hardware Specs: Intel® Xeon® CPU-E5-2609 v4 @ 1.70GHz, 64GB RAM, 4TB Hard disk, Windows Server 2016. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ SURFR implementation. With SURFR, users with no computational background can quickly and easily analyze, visualize, and compare small RNA-Seq datasets in order to generate clear, informative results. With an interactive, user-friendly interface, SURFR is the first Web-based resource that provides users the ability to upload unmodified NGS datasets and/or provide SRA identifiers to perform comprehensive novel ncRNA and ncRNA fragment identifications and expression analyses in real-time. This is achieved through employing the following three key components: (1) Hilbert Space (HS). In mathematics, a HS is an abstract vector space (with up to infinite dimensions) representing the current physical state of a continuous system routinely applied in Quantum mechanics. HSs are highly useful in describing the relationship among Vector spaces, Wavelets, and wave functions(76)(77). For our analyses, the term “Gene Expression” is considered a higher dimensional function representing the activity of the RNA across its length where, within a RNA, expression is represented using four vectors (for A, C, T, and G) and understood using HSs. (2) MoVaK alignment. Based on utilization of the aforementioned HSs, we introduced two new data structures, namely, Similarity Vectors (SVs) and Differential Expression Vectors (DEVs). MoVaK alignment combines SVs and DEVs to profile the exact transcriptomic activity of a given RNA-Seq dataset and then retrieves a HS for each RNA that is expressed in a sample. And (3) SURFR algorithm. By defining the changes in the gene expression using the above HS interpretation, we assign a wavelet function with scales of 18 to 38 to each sncRNA micro-like behavior, i.e., miRNA-like RNAs with lengths ranging from 18 to 38 nt. Importantly, our novel methodology carries several advantages over existing computational methods: 1. Compared to current, purely string comparison methods, DEVs take significantly less time to obtain. 2. Better visualization of ncRNAs processing. 3. SURFR data structures consume very little memory thus allowing real-time calculations. 4. Calculus-based modeling can be directly applied to DEVs to understand ncRNA behavior thus providing a mathematical means to study transcriptomic functionality. 5. Our methodology is highly effective and accurate. To be more specific, our wavelet-based analysis on HS typically identifies ncRNA-derived RNA start and end positions with >=95% identity (within 2 nt) to experimentally validated databases like miRbase as opposed to the state-of-the-art methods based on BAM files such as FlaiMapper, which have been reported to correctly predict 89% of miRNA start positions and 54% of miRNA end positions(78). 6. We have extended our computational methodology to 400+ organisms and all of their sncRNAs without the necessity to change any algorithmic criteria. 7. Our method can address the dynamism associated with transcriptomic analysis using topological interpretation. LAGOOn implementation. Similar to SURFR, with LAGOOn, users with no computational background can quickly and easily analyze and compare raw RNA-Seq datasets to comprehensively evaluate lncRNA expressions as well as the potential for lncRNAs to function as sncRNA hosts, miRNA sponges, antisense RNAs, microprotein transcripts, and/or regulators of genomic enhancers. In short, LAGOOn distinguishes itself from existing platforms through offering parallel, real-time expression analysis and functional prediction. Of note, LAGOOn is essentially based on an extended version of MoVaK alignment that similarly employs SVs to perform sequence alignments. In LAGOOn, however, the algorithm was modified during extension in order to trade time and space complexities within the alignment. A detailed explanation regarding these modifications is provided in Supplemental Information File 3. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ REFERENCES 1. Veneziano,D., Nigita,G. and Ferro,A. (2015) Computational approaches for the analysis of ncRNA through deep sequencing techniques. Front. Bioeng. Biotechnol., 3. 2. Uchida,S. and Bolli,R. (2018) Short and Long Noncoding RNAs Regulate the Epigenetic Status of Cells. Antioxidants Redox Signal., 29, 832–845. 3. Wolfien,M., Brauer,D.L., Bagnacani,A. and Wolkenhauer,O. (2019) Workflow development for the functional characterization of ncRNAs. In Methods in Molecular Biology. Humana Press Inc., Vol. 1912, pp. 111–132. 4. Ulitsky,I. (2018) Interactions between short and long noncoding RNAs. FEBS Lett., 592, 2874–2883. 5. Nakahara,K. and Carthew,R.W. (2004) Expanding roles for miRNAs and siRNAs in cell regulation. Curr. Opin. Cell Biol., 16, 127–133. 6. Cheng,A.M., Byrom,M.W., Shelton,J. and Ford,L.P. (2005) Antisense inhibition of human miRNAs and indications for an involvement of miRNA in cell growth and apoptosis. Nucleic Acids Res., 33, 1290–1297. 7. Hwang,H.W. and Mendell,J.T. (2006) MicroRNAs in cell proliferation, cell death, and tumorigenesis. Br. J. Cancer, 94, 776–780. 8. Singh,S., Chitkara,D., Mehrazin,R., Behrman,S.W., Wake,R.W. and Mahato,R.I. (2012) Chemoresistance in prostate cancer cells is regulated by miRNAs and Hedgehog pathway. PLoS One, 7. 9. Visone,R. and Croce,C.M. (2009) MiRNAs and cancer. Am. J. Pathol., 174, 1131–1138. 10. Rother,S. and Meister,G. (2011) Small RNAs derived from longer non-coding RNAs. Biochimie, 93, 1905– 1915. 11. Martens-Uzunova,E.S., Olvedy,M. and Jenster,G. (2013) Beyond microRNA--novel RNAs derived from small non-coding RNA and their implication in cancer. Cancer Lett, 340, 201–211. 12. Patterson,D.G., Roberts,J.T., King,V.M., Houserova,D., Barnhill,E.C., Crucello,A., Polska,C.J., Brantley,L.W., Kaufman,G.C., Nguyen,M., et al. (2017) Human snoRNA-93 is processed into a microRNA-like RNA that promotes breast cancer cell invasion. NPJ Breast Cancer, 3, 25. 13. Olvedy,M., Scaravilli,M., Hoogstrate,Y., Visakorpi,T., Jenster,G. and Martens-Uzunova,E.S. (2016) A comprehensive repertoire of tRNA-derived fragments in prostate cancer. Oncotarget, 7, 24766–24777. 14. Ender,C., Krek,A., Friedländer,M.R., Beitzinger,M., Weinmann,L., Chen,W., Pfeffer,S., Rajewsky,N. and Meister,G. (2008) A Human snoRNA with MicroRNA-Like Functions. Mol. Cell, 32, 519–528. 15. Martens-Uzunova,E.S., Olvedy,M. and Jenster,G. (2013) Beyond microRNA--novel RNAs derived from small non-coding RNA and their implication in cancer. Cancer Lett, 340, 201–211. 16. Hirose,Y., Ikeda,K.T., Noro,E., Hiraoka,K., Tomita,M. and Kanai,A. (2015) Precise mapping and dynamics of tRNA-derived fragments (tRFs) in the development of Triops cancriformis (tadpole shrimp). BMC Genet., 16. 17. Durdevic,Z. and Schaefer,M. (2013) TRNA modifications: Necessary for correct tRNA-derived fragments during the recovery from stress? BioEssays, 35, 323–327. 18. Wu,W., Choi,E.J., Lee,I., Lee,Y.S. and Bao,X. (2020) Non-coding RNAs and their role in respiratory syncytial virus (RSV) and human metapneumovirus (hMPV) infections. Viruses, 12. 19. Zhou,K., Diebel,K.W., Holy,J., Skildum,A., Odean,E., Hicks,D.A., Schotl,B., Abrahante,J.E., Spillman,M.A. and Bemis,L.T. (2017) A tRNA fragment, tRF5-Glu, regulates BCAR3 expression and proliferation in ovarian cancer cells. Oncotarget, 8, 95377–95391. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ 20. Yates,A., Akanni,W., Amode,M.R., Barrell,D., Billis,K., Carvalho-Silva,D., Cummins,C., Clapham,P., Fitzgerald,S., Gil,L., et al. (2016) Ensembl 2016. Nucleic Acids Res., 44, D710-6. 21. Huang,J., Gutierrez,F., Strachan,H.J., Dou,D., Huang,W., Smith,B., Blake,J.A., Eilbeck,K., Natale,D.A., Lin,Y., et al. (2016) OmniSearch: a semantic search system based on the Ontology for MIcroRNA Target (OMIT) for microRNA-target gene interaction data. J. Biomed. Semantics, 7, 25. 22. Camacho,C., Coulouris,G., Avagyan,V., Ma,N., Papadopoulos,J., Bealer,K. and Madden,T.L. (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10, 421. 23. Leinonen,R., Sugawara,H. and Shumway,M. (2011) The sequence read archive. Nucleic Acids Res., 39. 24. Kalvari,I., Argasinska,J., Quinones-Olvera,N., Nawrocki,E.P., Rivas,E., Eddy,S.R., Bateman,A., Finn,R.D. and Petrov,A.I. (2018) Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res, 46, D335–D342. 25. Desgranges,E., Caldelari,I., Marzi,S. and Lalaouna,D. (2020) Navigation through the twists and turns of RNA sequencing technologies: Application to bacterial regulatory RNAs. Biochim. Biophys. Acta - Gene Regul. Mech., 1863. 26. Friedländer,M.R., Chen,W., Adamidi,C., Maaskola,J., Einspanier,R., Knespel,S. and Rajewsky,N. (2008) Discovering microRNAs from deep sequencing data using miRDeep. Nat. Biotechnol., 26, 407–415. 27. Humphreys,D.T. and Suter,C.M. (2013) MiRspring: A compact standalone research tool for analyzing miRNA-seq data. Nucleic Acids Res., 41. 28. Hackenberg,M., Rodríguez-Ezpeleta,N. and Aransay,A.M. (2011) MiRanalyzer: An update on the detection and analysis of microRNAs in high-throughput sequencing experiments. Nucleic Acids Res., 39. 29. Wu,X., Kim,T.K., Baxter,D., Scherler,K., Gordon,A., Fong,O., Etheridge,A., Galas,D.J. and Wang,K. (2017) SRNAnalyzer-A flexible and customizable small RNA sequencing data analysis pipeline. Nucleic Acids Res., 45, 12140–12151. 30. Rahman,R.U., Gautam,A., Bethune,J., Sattar,A., Fiosins,M., Magruder,D.S., Capece,V., Shomroni,O. and Bonn,S. (2018) Oasis 2: Improved online analysis of small RNA-seq data. BMC Bioinformatics, 19. 31. Kuksa,P.P., Amlie-Wolf,A., Katanić,Ž., Valladares,O., Wang,L.S. and Leung,Y.Y. (2018) SPAR: Small RNA-seq portal for analysis of sequencing experiments. Nucleic Acids Res., 46, W36–W42. 32. Hoogstrate,Y., Jenster,G. and Martens-Uzunova,E.S. (2015) FlaiMapper: computational annotation of small ncRNA-derived fragments using RNA-seq high-throughput data. Bioinformatics, 31, 665–673. 33. Shi,J., Ko,E.A., Sanders,K.M., Chen,Q. and Zhou,T. (2018) SPORTS1.0: A Tool for Annotating and Profiling Non-coding RNAs Optimized for rRNA- and tRNA-derived Small RNAs. Genomics, Proteomics Bioinforma., 16, 144–151. 34. Jeske,T., Huypens,P., Stirm,L., Höckele,S., Wurmser,C.M., Böhm,A., Weigert,C., Staiger,H., Klein,C., Beckers,J., et al. (2019) DEUS: An R package for accurate small RNA profiling based on differential expression of unique sequences. Bioinformatics, 35, 4834–4836. 35. Aparicio-Puerta,E., Lebrón,R., Rueda,A., Gómez-Martín,C., Giannoukakos,S., Jaspez,D., Medina,J.M., Zubkovic,A., Jurak,I., Fromm,B., et al. (2019) SRNAbench and sRNAtoolbox 2019: intuitive fast small RNA profiling and differential expression. Nucleic Acids Res., 47, W530–W535. 36. Liu,Q., Ding,C., Lang,X., Guo,G., Chen,J. and Su,X. (2019) Small noncoding RNA discovery and profiling with sRNAtools based on high-throughput sequencing. Brief. Bioinform., 10.1093/bib/bbz151. 37. Wan,C., Gao,J., Zhang,H., Jiang,X., Zang,Q., Ban,R., Zhang,Y. and Shi,Q. (2017) CPSS 2.0: A computational platform update for the analysis of small RNA sequencing data. Bioinformatics, 33, 3289– .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ 3291. 38. Liao,Y., Smyth,G.K. and Shi,W. (2014) FeatureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30, 923–930. 39. Martens-Uzunova,E.S., Hoogstrate,Y., Kalsbeek,A., Pigmans,B., Vredenbregt-van den Berg,M., Dits,N., Nielsen,S.J., Baker,A., Visakorpi,T., Bangma,C., et al. (2015) C/D-box snoRNA-derived RNA production is associated with malignant transformation and metastatic progression in prostate cancer. Oncotarget, 6, 17430–44. 40. Derrien,T., Johnson,R., Bussotti,G., Tanzer,A., Djebali,S., Tilgner,H., Guernec,G., Martin,D., Merkel,A., Knowles,D.G., et al. (2012) The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome Res., 22, 1775–1789. 41. Ulitsky,I. and Bartel,D.P. (2013) XLincRNAs: Genomics, evolution, and mechanisms. Cell, 154, 26. 42. Lam,M.T.Y., Li,W., Rosenfeld,M.G. and Glass,C.K. (2014) Enhancer RNAs and regulated transcriptional programs. Trends Biochem. Sci., 39, 170–182. 43. Rinn,J.L. and Chang,H.Y. (2012) Genome regulation by long noncoding RNAs. Annu. Rev. Biochem., 81, 145–166. 44. Uszczynska-Ratajczak,B., Lagarde,J., Frankish,A., Guigó,R. and Johnson,R. (2018) Towards a complete map of the human long non-coding RNA transcriptome. Nat. Rev. Genet., 19, 535–548. 45. Mercer,T.R., Dinger,M.E. and Mattick,J.S. (2009) Long non-coding RNAs: Insights into functions. Nat. Rev. Genet., 10, 155–159. 46. Li,X., Wu,Z., Fu,X. and Han,W. (2014) LncRNAs: Insights into their function and mechanics in underlying disorders. Mutat. Res. - Rev. Mutat. Res., 762, 1–21. 47. Moran,V.A., Perera,R.J. and Khalil,A.M. (2012) Emerging functional and mechanistic paradigms of mammalian long non-coding RNAs. Nucleic Acids Res., 40, 6391–6400. 48. Wang,J., Liu,X., Wu,H., Ni,P., Gu,Z., Qiao,Y., Chen,N., Sun,F. and Fan,Q. (2010) CREB up-regulates long non-coding RNA, HULC expression through interaction with microRNA-372 in liver cancer. Nucleic Acids Res., 38, 5366–5383. 49. Chen,H., Du,G., Song,X. and Li,L. (2017) Non-coding Transcripts from Enhancers: New Insights into Enhancer Activity and Gene Expression Regulation. Genomics, Proteomics Bioinforma., 15, 201–207. 50. Malecová,B. and Morris,K. V. (2010) Transcriptional gene silencing through epigenetic changes mediated by non-coding RNAs. Curr. Opin. Mol. Ther., 12, 214–222. 51. Stein,C.S., Jadiya,P., Zhang,X., McLendon,J.M., Abouassaly,G.M., Witmer,N.H., Anderson,E.J., Elrod,J.W. and Boudreau,R.L. (2018) Mitoregulin: A lncRNA-Encoded Microprotein that Supports Mitochondrial Supercomplexes and Respiratory Efficiency. Cell Rep., 23, 3710-3720.e8. 52. Fishilevich,S., Nudel,R., Rappaport,N., Hadar,R., Plaschkes,I., Iny Stein,T., Rosen,N., Kohn,A., Twik,M., Safran,M., et al. (2017) GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database (Oxford)., 2017. 53. Arun,G. and Spector,D.L. (2019) MALAT1 long non-coding RNA and breast cancer. RNA Biol., 16, 860– 863. 54. Jordan,N.V., Bardia,A., Wittner,B.S., Benes,C., Ligorio,M., Zheng,Y., Yu,M., Sundaresan,T.K., Licausi,J.A., Desai,R., et al. (2016) HER2 expression identifies dynamic functional states within circulating breast cancer cells. Nature, 537, 102–106. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ 55. Huang,X.J., Xia,Y., He,G.F., Zheng,L.L., Cai,Y.P., Yin,Y. and Wu,Q. (2018) MALAT1 promotes angiogenesis of breast cancer. Oncol. Rep., 40, 2683–2689. 56. Ruiz-Orera,J., Messeguer,X., Subirana,J.A. and Alba,M.M. (2014) Long non-coding RNAs as a source of new peptides. Elife, 3, 3523. 57. Davis,C.A., Hitz,B.C., Sloan,C.A., Chan,E.T., Davidson,J.M., Gabdank,I., Hilton,J.A., Jain,K., Baymuradov,U.K., Narayanan,A.K., et al. (2018) The Encyclopedia of DNA elements (ENCODE): Data portal update. Nucleic Acids Res., 46, D794–D801. 58. Lizio,M., Abugessaisa,I., Noguchi,S., Kondo,A., Hasegawa,A., Hon,C.C., De Hoon,M., Severin,J., Oki,S., Hayashizaki,Y., et al. (2019) Update of the FANTOM web resource: Expansion to provide additional transcriptome atlases. Nucleic Acids Res., 47, D752–D758. 59. Gong,Y., Huang,H.T., Liang,Y., Trimarchi,T., Aifantis,I. and Tsirigos,A. (2017) lncRNA-screen: An interactive platform for computationally screening long non-coding RNAs in large genomics datasets. BMC Genomics, 18. 60. Yuan,C. and Sun,Y. (2013) RNA-CODE: A Noncoding RNA Classification Tool for Short Reads in NGS Data Lacking Reference Genomes. PLoS One, 8. 61. Sun,L., Liu,H., Zhang,L. and Meng,J. (2015) IncRScan-SVM: A tool for predicting long non-coding RNAs using support vector machine. PLoS One, 10. 62. Pyfrom,S.C., Luo,H. and Payton,J.E. (2019) PLAIDOH: A novel method for functional prediction of long non-coding RNAs identifies cancer-specific LncRNA activities. BMC Genomics, 20. 63. Jiang,Q., Ma,R., Wang,J., Wu,X., Jin,S., Peng,J., Tan,R., Zhang,T., Li,Y. and Wang,Y. (2015) LncRNA2Function: A comprehensive resource for functional investigation of human lncRNAs based on RNA-seq data. BMC Genomics, 16. 64. Wu,S.M., Liu,H., Huang,P.J., Chang,I.Y.F., Lee,C.C., Yang,C.Y., Tsai,W.S. and Tan,B.C.M. (2018) circlncRNAnet: An integrated web-based resource for mapping functional networks of long or circular forms of noncoding RNAs. Gigascience, 7, 1–10. 65. Sun,K., Chen,X., Jiang,P., Song,X., Wang,H. and Sun,H. (2013) iSeeRNA: Identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data. BMC Genomics, 14. 66. Musacchia,F., Basu,S., Petrosino,G., Salvemini,M. and Sanges,R. (2015) Annocript: A flexible pipeline for the annotation of transcriptomes able to identify putative long noncoding RNAs. Bioinformatics, 31, 2199– 2201. 67. Sun,Z., Nair,A., Chen,X., Prodduturi,N., Wang,J. and Kocher,J.P. (2017) UClncR: Ultrafast and comprehensive long non-coding RNA detection from RNA-seq. Sci. Rep., 7. 68. Sun,Y.-M. and Chen,Y.-Q. (2020) Principles and innovative technologies for decrypting noncoding RNAs: from discovery and functional prediction to clinical application. J. Hematol. Oncol., 13, 109. 69. Langmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9, 357– 359. 70. Li,H. and Durbin,R. (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26, 589–595. 71. Bucher,P. and Hofmann,K. (1996) A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. Proc. Int. Conf. Intell. Syst. Mol. Biol., 4, 44–51. 72. Phillips,A.J. (2006) Homology assessment and molecular sequence alignment. J. Biomed. Inform., 39, 18– 33. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ 73. Lippert,R.A. (2005) Space-efficient whole genome comparisons with Burrows-Wheeler transforms. J. Comput. Biol., 12, 407–415. 74. Kvam,V.M., Liu,P. and Yaqing,S. (2012) A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. Am. J. Bot., 99, 248–256. 75. Costa-Silva,J., Domingues,D. and Lopes,F.M. (2017) RNA-Seq differential expression analysis: An extended review and a software tool. PLoS One, 12. 76. Steeb, W.-H. (1998). Hilbert Spaces, Wavelets, Generalised Functions and Modern Quantum Mechanics. Springer Science & Business Media. 77. Debnath, L., & Mikusinski, P. (2005). Introduction to Hilbert Spaces with Applications. Academic Press. 78. Y. Hoogstrate, G. Jenster, and E. S. Martens-Uzunova, “FlaiMapper: computational annotation of small ncRNA-derived fragments using RNA-Seq high-throughput data,” Bioinformatics, vol. 31, no. 5, pp. 665– 673, Mar. 2015. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ 10_1101-2021_02_08_430343 ---- 96291204 1 Patient-specific cell communication networks associate with disease 1 progression in cancer 2 3 4 David L Gibbs1, Boris Aguilar1, Vésteinn Thorsson1, Alexander V Ratushny2, Ilya Shmulevich1 5 6 1Institute for Systems Biology, 401 Terry Avenue North, Seattle, WA 98109, USA; 2Bristol-7 Myers Squibb, 400 Dexter Avenue North, Suite 1200, Seattle, WA 98109, USA 8 9 Correspondence: 10 David L Gibbs 11 david.gibbs@isbscience.org 12 13 Abstract 14 15 The maintenance and function of tissues in health and disease depends on cell-cell communication. This 16 work shows how high-level features, representing cell-cell communication, can be defined and used to 17 associate certain signaling 'axes' with clinical outcomes. Using cell-sorted gene expression data, we 18 generated a scaffold of cell-cell interactions and define a probabilistic method for creating per-patient 19 weighted graphs based on gene expression and cell deconvolution results. With this method, we generated 20 over 9,000 graphs for TCGA patient samples, each representing likely channels of intercellular 21 communication in the tumor microenvironment. It was shown that particular edges were strongly 22 associated with disease severity and progression, in terms of survival time and tumor stage. Within 23 individual tumor types, there are predominant cell types and the collection of associated edges were found 24 to be predictive of clinical phenotypes. Additionally, genes associated with differentially weighted edges 25 were enriched in Gene Ontology terms associated with tissue structure and immune response. Code, data, 26 and notebooks are provided to enable the application of this method to any expression dataset 27 (https://github.com/IlyaLab/Pan-Cancer-Cell-Cell-Comm-Net). 28 Keywords 29 Networks, cell communication, immuno-oncology, computational oncology, bioinformatics, systems 30 biology 31 Introduction 32 The maintenance and function of tissues depends on cell-cell communication (Wilson et al., 2000; Haass 33 and Herlyn, 2005). While cell communication can take place through physically binding cell membrane 34 surface proteins, cells also release ligand molecules that diffuse and bind to receptors on other cells 35 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://paperpile.com/c/5TES3g/QCquJ https://paperpile.com/c/5TES3g/QCquJ https://paperpile.com/c/5TES3g/P8KBB https://paperpile.com/c/5TES3g/P8KBB https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 2 (paracrine or endocrine), or even the same cell (autocrine), triggering a signaling cascade that can 36 potentially activate a gene regulatory program (Cameron and Kelvin, 2013; Heldin et al., 2016; Cohen 37 and Nelson, 2018). More generally, a message is sent and received, transferring some information as part 38 of a large network (Frankenstein et al., 2006). Cells communicate in order to coordinate activity, such as, 39 to correctly (and jointly) respond to environmental changes (Song et al., 2019). 40 Altered cellular communication can cause disease, and conversely diseases can alter 41 communication (Wei et al., 2004). Cancer, once thought of as purely a disease of genetics, is now 42 recognized as being enmeshed in complex cellular interactions within the tumor microenvironment 43 (TME) (Trosko and Ruch, 1998). The cell-cell interactions are important for cell differentiation, tumor 44 growth (West and Newton, 2019), and response to therapeutics (Kumar et al., 2018). 45 Between cells, information transfer is directional in nature, where cells produce molecules that 46 are received by the properly paired, and expressed, receptor. There is often a sender and receiver, which 47 makes the cell-cell networks directionally linked by molecules. The dynamics of the signal is greatly 48 important (Fridman et al., 2012, Behar et al., 2013), but unfortunately is difficult to detect in bulk 49 sequencing experiments. One approach to studying cell interactions is through the use of graphical 50 models of communication networks (Morel et al., 2017). By incorporating experimental data, the 51 graphical models can become quantitative, providing predictions that can be tested and used in 52 discovering novel drug targets and developing optimal intervention strategies. 53 In recent work (Thorsson et al., 2018), we developed a method used to identify cellular 54 communication networks at work in the tumor microenvironment. Given a set of samples with a similar 55 tumor microenvironment, the method identified ligands, receptors and cells meeting certain criteria of 56 abundance and concordance within that set of samples. The method was applied to identify networks 57 playing a role within specific tumor types and molecular subtypes and is available as a workflow and 58 interactive module on the iAtlas portal for immuno-oncology (Eddy et al., 2020). 59 In this work, we have combined multiple sources of data with a new probabilistic method for 60 constructing patient-specific cell-cell communication networks (Figure 1). In total, we built networks for 61 9,234 samples in The Cancer Genome Atlas (TCGA), starting from a network of 64 cell types and 1,894 62 ligand-receptor pairs. This is a rich feature set from which to investigate biological alterations in cell 63 communication within the tumor microenvironment. We identified informative network features that are 64 associated with disease progression. The method can be applied to any cancer type, but in this manuscript 65 we focus on a selection of cancer types with very high mortality rates, including pancreatic 66 adenocarcinoma (PAAD), melanoma (SKCM), lung (LUSC), and cancers of the gastrointestinal tract 67 (ESCA, STAD, COAD, READ) (Cancer Genome Atlas Network, 2015). 68 This represents a new method that provides information on possible modes of intercellular 69 signaling in the TME, something that is currently lacking. While there are many methods on gene set 70 scoring, cellular abundance estimation, differential expression, there are still few ways to investigate cell-71 cell communication diversity in the TME with respect to patient outcomes. Fortunately, new databases of 72 receptor-ligand pairs are becoming available, making work in this area possible (Efremova et al., 2019; 73 Jin et al., 2020; Nath and Leier, 2020; Shao et al., 2020). The methods, code, data, and complete results 74 are available and open to all researchers (https://github.com/IlyaLab/Pan-Cancer-Cell-Cell-Comm-Net). 75 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://paperpile.com/c/5TES3g/1F4wh+OtjGl+kvv7o https://paperpile.com/c/5TES3g/1F4wh+OtjGl+kvv7o https://paperpile.com/c/5TES3g/lLbFv https://paperpile.com/c/5TES3g/H1JGR https://paperpile.com/c/5TES3g/vADrs https://paperpile.com/c/5TES3g/RMQGg https://paperpile.com/c/5TES3g/qtDx5 https://paperpile.com/c/5TES3g/nDiJk https://paperpile.com/c/5TES3g/KWOdD https://paperpile.com/c/5TES3g/5wEd8 https://paperpile.com/c/5TES3g/5wEd8 https://paperpile.com/c/5TES3g/CTvjs https://paperpile.com/c/5TES3g/EFysE https://paperpile.com/c/5TES3g/CXcXS https://paperpile.com/c/5TES3g/SO0k5 https://paperpile.com/c/5TES3g/IS5iX+ZgbVl+UyBhS+FfNDP https://paperpile.com/c/5TES3g/IS5iX+ZgbVl+UyBhS+FfNDP https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 3 Methods 76 Data aggregation and integration 77 Data sources including TCGA and cell-sorted gene expression, bulk tumor expression, cell type scores, 78 cell-ligand and cell-receptor presence estimations were used for network construction and probabilistic 79 weighting on a per-sample basis. 80 81 Each tumor sample is composed of a mixture of cell types including tumor, immune, and stromal cells. 82 Recently, methods have been developed to 'deconvolve' mixed samples into estimated fractions of cell 83 type quantities. For example, xCell, which resembles gene set enrichment, has performed this estimation 84 for 64 cell types across most TCGA samples (Aran et al., 2017). We use these xCell estimates of cellular 85 fractions in this work. 86 87 Ramilowski et al. performed a comprehensive survey of cellular communication, generating a 88 compendium that includes 1,894 ligand-receptor pairs, and a mapping between 144 cell types and 89 expression of ligand or receptor molecules (Ramilowski et al., 2015) The compendium was shared via 5th 90 edition of the FANTOM Project, FANTOM5. These ligand-receptor pairs were adopted for this study. 91 Unfortunately, the FANTOM5 collection of cell types does not overlap well with cell types in xCell. In 92 order to integrate the xCell and FANTOM5 data resources, it was necessary to determine the expressed 93 ligands and receptors for each of the 64 cell types in xCell, using the source gene expression data. 94 95 The xCell project used six public cell sorted bulk gene expression data sets in order to generate gene 96 signatures and score each TCGA sample. Across the data sets, there is some discrepancy in cell type 97 nomenclature, making it necessary to manually curate cell type names to improve alignment across 98 experiments (Supplementary Table 1). Typically, for a given cell type, there are several replicate 99 expression profiles, often across the data sets. 100 Building the cell-cell communication network scaffold 101 102 In the FANTOM5 'draft of cellular communication', an expression threshold of 10 TPMs was used to link 103 a cell type to a ligand or receptor. When considering the distribution of expression in the FANTOM5 104 project, 10 TPMs is close to the median. 105 106 To construct our scaffold, we used a majority voting scheme based on comparing expression levels to 107 median levels. For each cell type, paired with ligands and receptors, if the expression level was greater 108 than the median, it was counted as a vote (i.e., ligand expressed in this cell type). If a ligand or receptor 109 recieved a majority vote across all available data sources, it was accepted, and entered into the cell-cell 110 scaffold. 111 112 With this procedure, a network scaffold is induced, where cells produce ligands that bind to receptors on 113 receiving cells. One edge in the network is composed of components cell - ligand - receptor - cell. This 114 produced a cell-cell communication network with over 1M edges. Each edge represents a possible 115 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://paperpile.com/c/5TES3g/YAa5P https://paperpile.com/c/5TES3g/P3pxd https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 4 interaction in the tumor microenvironment. We subsequently determine the probability that an edge is 116 active in a particular patient sample using a probabilistic method described below. 117 Patient level cell-cell communication network weights 118 With a cell-cell scaffold, expression values and cell type estimations per sample, we can produce a per-119 sample weighted cell-cell communication network (Figure 2). This is done probabilistically, using the 120 following definition: 121 122 𝑃(𝑒𝑖 ) = 𝑃(𝑙𝑎 , 𝑐𝑙 ) · 𝑃(𝑟𝑏 , 𝑐𝑟 ), (Eq. 1) 123 124 where 𝑒𝑖 is edge i, 𝑙𝑎 is ligand a, 𝑟𝑏 is receptor b, and 𝑐𝑙 and 𝑐𝑟 are cells that can produce ligand a and 125 receptor b respectively. 𝑃(𝑒𝑖 ) represents a probability that edge i is active and is based on the premise that 126 the physical and biochemical link and activation is possible only if all the components are present, and 127 that activity becomes increasingly possible with greater availability of those components. The joint 128 probabilities can be decomposed to: 129 130 𝑃(𝑙𝑎 , 𝑐𝑙 ) = 𝑃(𝑙𝑎 | 𝑐𝑙 ) 𝑃(𝑐𝑙 ) and 131 𝑃(𝑟𝑏 , 𝑐𝑟 ) = 𝑃(𝑟𝑏 | 𝑐𝑟 ) 𝑃(𝑐𝑟 ). (Eq. 2) 132 133 The 𝑃(𝑐𝑙𝑘 ) is short for CDF 𝑃(𝐶𝑙 < 𝑐𝑙𝑘 ) which indicates the probability that a randomly sampled value 134 from the empirical 𝐶𝑙 distribution (over all 9K TCGA samples) would be less than the cell estimate for 135 cell type l, in sample k. To do this, for a given cell type, using all samples available, an empirical 136 distribution 𝑃(𝐶𝑙 ) is computed, and for any query, essentially using a value 𝑐𝑙𝑘 , the probability can be 137 found by integrating from 0 to 𝑐𝑙𝑘. 138 139 To compute 𝑃(𝑙𝑎 | 𝑐𝑙 ), each 𝐶𝑙 distribution was divided into quartiles, and then (again using the 9K 140 samples) empirical gene expression distributions within each quartile were fit. This expresses the 141 probability that with an observed cell quantity (thus within a quartile), the probability that a randomly 142 selected gene expression value (for gene 𝑙𝑎) would be lower than what is observed in sample k. 143 144 We refer to "edge weights" to be the probability as shown in Eq. (1). To compute edge weights, each 145 TCGA sample was represented as a column vector of gene expression and a column vector of cell 146 quantities (or enrichments). For each edge in the scaffold (cell-ligand-receptor-cell), data was used to look 147 up probabilities using the defined empirical distributions and taking products for the resulting edge 148 weight probability. This leads to over 9K tumor-specific weighted networks, one for each TCGA 149 participant. 150 151 Probability distributions were precomputed using the R language empirical cumulative distribution 152 function (ecdf). For example, fitting P(CD8 T cells) is done by taking all available estimates across the 153 Pan-Cancer samples and computing the ecdf. Then, for a sample k, we find 𝑃(𝐶𝑙 < 𝑐𝑙𝑘 ) using the ecdf. 154 The same technique is used to find the conditional probability functions, where for each gene, the 155 expression values are selected after binning samples using the R function 'quantile', and then used to 156 compute the ecdf. With all distributions precomputed, 9.8 billion joint probability functions were 157 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 5 computed using an HPC environment, then transferred to a Google BigQuery table where analysis 158 proceeded. This table of network weights was structured so that each row contained one weight from one 159 edge and one tumor sample. Although being a large table of 9.8 billion rows, taking nearly 500GB, 160 BigQuery allows for fast analytical queries that can produce statistics using a selection of standard 161 mathematical functions. 162 Association of network features and survival-based phenotypes 163 The S1 statistic is a robust measure based on the difference of medians (Yahaya et al., 2004; Ahad et al., 164 2016; Babu et al., 1999; Hubert et al. 2012), in this case the median of edge weights for a defined 165 phenotypic group. S1 statistics were computed using the NCI cancer research data commons cloud 166 resource, the ISB-CGC, per tissue type. 167 168 𝑆1 = 𝑚𝑒𝑑𝑖𝑎𝑛(𝑋) − 𝑚𝑒𝑑𝑖𝑎𝑛(𝑌) √1.4826 𝑀𝐴𝐷(𝑋) + 1.4826 𝑀𝐴𝐷(𝑌) 169 170 This statistic allowed for cell-cell interactions to be ranked within a defined context. The results were 171 again saved to BigQuery tables to allow for further cloud-based analysis and integration with underlying 172 data. 173 174 To judge the magnitude of the statistic with respect to a random context (Figure 3), an ensemble of three 175 edge-weight sample-pools were generated, each with 100K weights. Then, for each member of the 176 ensemble, 1 million S1 statistics were generated using sample sizes that match the analyzed data. These 177 random S1 statistic distributions were used to compare to the observed results (i.e., a resampling 178 procedure). 179 180 As an initial examination of the interplay of cell communication and disease, two proxies of disease 181 severity were investigated: progression-free interval (PFI) and tumor stage (Liu et al., 2018). The staging 182 variable used the AJCC pathologic tumor stage. The PFI feature was computed using days until a 183 progression event. The staging variable was binarized by binning stages I-II together (“early stage”), and 184 III-IV together (“late stage”). A binary PFI variable was created by computing the median PFI on non-185 censored samples and then applying the split to all samples. Both clinical features were computed by 186 tissue type (TCGA Study). As Liu et al writes, "The event time is the shortest period from the date of 187 initial diagnosis to the date of an event. The censored time is from the date of initial diagnosis to the date 188 of last contact or the date of death without disease." 189 190 For example, in LUSC, the median time to PFI event was 420 days (14.0 months) and in the censored 191 group, 649 days (21.1 months). After splitting samples at 420 days (14 months), the short PFI group was 192 composed of 67 uncensored samples and 128 censored samples. The long PFI group was composed of 68 193 uncensored samples and 234 censored samples. 194 195 Null distributions, using these same sample sizes (e.g., one group of 68 and another group of 234), were 196 generated by repeatedly drawing from the previously described ensemble of three sample-pools. The 197 distributions, while heavy tailed, were close to Normal (Supplemental Figure 1). The S1 statistics scale 198 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://paperpile.com/c/5TES3g/aP9Gd https://paperpile.com/c/5TES3g/aP9Gd https://paperpile.com/c/5TES3g/QrFnS https://paperpile.com/c/5TES3g/QrFnS https://paperpile.com/c/5TES3g/JNgul https://paperpile.com/c/5TES3g/7SC9C https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 6 with the difference in median values (Supplemental Figure 4). After combining resampled statistics across 199 the ensemble, an edge was selected as a high edge weight if it were in the top 1 millionth percentile when 200 compared to the null. Each tissue and contrast generates a weighted subgraph of the starting scaffold, 201 which is retained for further analysis (e.g., a LUSC-PFI network). 202 203 To identify informative cell-cell edges that relate to disease progression, machine learning models were 204 trained on binarized clinical data as described. With clinical features such as progression free interval 205 (PFI) and tumor stage for each sample, a matrix of patient-specific edge weights was constructed 206 representing each tissue and contrast. Classification of samples was performed with XGBoost classifiers 207 (Chen and Guestrin, 2016) , which are composed of an ensemble of tree classifiers. To avoid overfitting 208 the models, the tree depth was set at maximum of 2 and the early-stopping parameter was set at 2 rounds 209 (training was stopped after the classification error did not improve on a test set for two rounds). XGBoost 210 provides methods for determining the information gain of each feature in the model and was used to rank 211 edges that are most informative for classification. 212 213 Gene ontology (GO) term enrichment was performed using the GONet tool (Pomaznoy et al., 2018). The 214 set of 1,175 genes in the cell-cell scaffold was used as the enrichment background. GONet builds on the 215 "goenrich" software package, which maps genes onto terms and propagates them up the GO graph, 216 performs Fisher's exact tests, and moderates results with FDR. To compare the results, random collections 217 of genes were generated from the cell-cell scaffold and produced no significant results. 218 Results 219 The scaffold network graph is heterogeneous, containing nodes representing cells, ligands (e.g. 220 cytokines), and receptors. Edges are directed, following communication routes from cell to cell. But, to 221 simplify the graph, a cell produces a ligand that binds a receptor found on another cell type, which could 222 make a single edge "LCell-Ligand-Receptor-RCell". In total, there were 1,062,718 cell-cell edges in the 223 network. 224 225 The number of edges for ligand-producing cells varies from 32,910 for osteoblasts to 6,587 for Multi-226 Potent Progenitor (MPPs). For receptor-producing cells, the range spans from 30,225 for platelets to 227 5,763 edges for MPPs. 228 229 Applying the proposed probabilistic framework allowed for the creation of 9,234 weighted networks. The 230 edge weight distributions generally follow approximately exponentially decreasing function 231 (Supplemental Figure 1). There are few edges with strong weights and many with low (near zero) 232 weights. 233 234 We first sought to find communication edges that were most characteristic of an individual tumor type. 235 The S1 statistics comparing one tumor type to all other tissue types was computed, with a high score 236 indicating a substantial difference in edge weights between the two groups. Edges were found that clearly 237 delineated tissues (Figure 4). For example, in SKCM (skin cutaneous melanoma), the top scoring edge is 238 between melanocytes the most cell of origin for cutaneous melanoma (Melanocytes-MIA-CDH19-239 Melanocytes, S1 score 2.5, median edge weight 0.86 higher than in other tumor types). Normal tissue 240 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://paperpile.com/c/5TES3g/YrF4J https://paperpile.com/c/5TES3g/x5oMI https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 7 differences can contribute to differences in edge weights, though in this case the central role of 241 melanocytes in melanomas implies that the high scores are likely due to cancerous cell activity. The study 242 with the most similar edge weights is uveal melanoma (UVA), which arises from melanocytes resident in 243 the uveal tract (Robertson et al., 2017) (Fig. 4A). Additionally, we observed that when a cell type is 244 highly prevalent in a particular tissue, and the scaffold has an autocrine loop, interactions between that 245 type of cell tend to have high weights. If we exclude cell types communicating with self-types, then for 246 SKCM, osteoblasts, natural killer T cells, and mesenchymal stem cells (MSCs) interact with melanocytes 247 in the top 10 scoring edges, consistent with the emerging role of these cell types in melanoma. An 248 important role for osteoblasts is now coming to light for melanoma (Ferguson et al., 2020). Natural killer 249 T cells are being investigated for their applicability in immunotherapy of cancers such as melanoma 250 (Wolf et al., 2018). MSCs appear to interact with melanoma cells, as work by Zhang et al. (Zhang et al., 251 2017) showed the proliferation of A375 cells (a melanoma cell line) was inhibited and the cell cycle of 252 A375 was arrested by MSCs, and cell-cell signaling related to NF-κB was down-regulated. Overall, the 253 number of high weight edges in each tumor type did not associate with the number of samples, as might 254 be expected (Supplemental Figure 2). 255 256 To identify which elements of cellular communication networks might be associated with clinical 257 progression of particular tumor types, we identified edges associated with disease. 258 Disease progression and severity were examined using dichotomous values of tumor stage and 259 progression free interval (PFI) as described in the methods. Statistical scores were calculated comparing 260 edge weight distributions between the two clinical groups using S1. Results were carried forward if larger 261 than the threshold set by the millionth percentage of resampled statistics (Supplemental Figure 3-5), 262 yielding differentially weighted edges (DWEs). 263 264 Most tumor types showed DWEs for PFI, and fewer for the early to late tumor stage comparison 265 (Supplemental Figure 5). For example, STAD (gastric adenocarcinoma) had several hundred edges in for 266 both comparisons, while PAAD (pancreatic adenocarcinoma) showed fewer DWEs, and only for PFI. 267 Figure 5 shows median edge weights between the two groups for the selected studies. Some tumor types, 268 like SKCM, show much stronger deviations between the medians, compared to the other studies like 269 STAD, ESCA, and LUSC, which may be an indication of a stronger immune response. According to 270 CRI-iAtlas (Eddy et al., 2020), among our example studies, SKCM has the highest estimated level of 271 CD8 T cells and generally has a robust immune response. 272 Tumor stage comparison showed DWEs in 17 of 32 studies and ranged widely from 4 edges for 273 MESO (mesothelioma) to over 63K edges for BLCA (urothelial bladder cancer adenocarcinoma). The 274 PFI comparison showed results in 28 / 32 studies and ranged from 4 edges in READ to over 21K in 275 LIHC. See Table 1 for edge counts from selected studies. The studies with larger numbers of samples had 276 permuted S1 distributions that were narrow compared to studies with few samples (Supp. Fig. 3), but there 277 was not a strong association between DWE counts and sample sizes. The variation thus more likely has to 278 do with clinical factors. 279 280 Within a tumor type and clinical response variable, the set of high scoring edges were usually dominated 281 by a small number of cell-types, ligands, or receptors (Figure 6, Supplemental Figure 6A,6B). For SKCM, 282 in the tumor stage contrast, a majority of ligand-producing cells include GMP cells, Osteoblasts, MSC 283 cells, and Melanocytes, in order of prevalence. The number of edges starting with these four cells 284 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://paperpile.com/c/5TES3g/iigUR https://paperpile.com/c/5TES3g/EgnuO https://paperpile.com/c/5TES3g/2INhk https://paperpile.com/c/5TES3g/TD8Vb https://paperpile.com/c/5TES3g/TD8Vb https://paperpile.com/c/5TES3g/CXcXS https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 8 accounts for 53% of DWEs. Certainly, melanocytes are well known in melanoma, and mesenchymal stem 285 cells are drawn to inflammation, but the role of osteoblasts is less well documented, but still have been 286 associated with melanoma progression (Ferguson et al., 2020). 287 In the PFI contrast with gastroesophageal cancers, megakaryocytes are the most common cell 288 type in STAD DWEs (40 edges out of 142), and the second most common in ESCA (49 edges out of 137, 289 following CD8+ Tcm interactions). The megakaryocyte DWEs include ligands and receptors that 290 represent both interleukins and ECM-associated molecules such as integrins and collagen, but also 291 NOTCH1 and PF4 (platelet factor 4). For STAD, most edge weights are lower with longer PFI. Put 292 another way, the shorter PFI intervals (adverse outcome) were associated with increased megakaryocyte-293 involved edge weights (Supplemental Figure 7). 294 However, the opposite is observed in ESCA, where higher edge weights were generally 295 associated with longer PFI (negative S1 score). In ESCA, edges that show high weights for short PFI 296 include Neutrophils-HMGB1-SDC1-Sebocytes (0.17). Although ESCA has a much lower xCell mean 297 megakaryocyte score than STAD (38% lower), the cell score trends from xCell follow opposite trends 298 with STAD decreasing with longer PFI and ESCA increasing with PFI. STAD is among the tissues with 299 highest megakaryocyte scores (59, 56th rank out of 64 for PFI 1,2 resp.), ESCA is at a respectable rank of 300 49 and 44 out of 64 for short-PFI, respectively. 301 In COAD (colorectal adenocarcinoma), for ligand-producing cells, the DWEs were dominated by 302 astrocytes, MSCs, megakaryocytes, and sebocytes, while receptor-producing cells included astrocytes, 303 chondrocytes, and MSCs in order of counts of DWEs. By summarizing DWEs we can possibly categorize 304 cancer types based on which cells are taking part in potentially active interactions. 305 The above-described edge dominance is related to cells (graph nodes) with high degree. In the 306 language of graphs, the degree is the count of edges connected to a given node or vertex. In STAD the 307 cell types with highest degree are megakaryocytes (degree 50), followed by neutrophils (31), CLP cells 308 (26), and erythrocytes (23)(Supplemental Figure 6A,B). However, if we look at the directionality for the 309 directed graph, we see that while megakaryocytes are split nearly evenly in and out, cells like the Th1 310 have 5 edges in, and only a single edge out, whereas B cells have zero edges in and 4 edges out. The 311 network directionality should be considered in activities such as the modeling of dynamical systems. 312 313 Within the tumor microenvironment, communication between the multitude of cells happens 314 simultaneously through many ligand-receptor axes. By considering a set of differentially weighted edges 315 within a tissue type, we can construct connected networks that potentially represent dynamic 316 communication. DWEs derived by comparing edge weights between clinical groups may indicate which 317 parts of the cell-cell communication network shift together with disease severity. 318 We sought to identify which aspects of intercellular communication could relate to tumor staging 319 or disease severity. The edges making up the differential networks were used to model clinical states for 320 individual tumors. XGBoost models (Chen and Guestrin, 2016) were fit on each clinical feature, using 321 edge weights as predictive variables, to infer which edges carried the most information in classification 322 (Figure 8). 323 The purpose of the modeling was within-data inference rather than classification outside of the 324 TCGA pan-cancer data set. After fitting, it is possible to examine what model features (edges) are most 325 useful in classification. The XGBoost classifiers are regularized models, not all features will be used and 326 often only a small subset of features are retained in the final model. We assess the relative usefulness of a 327 feature by comparing the feature gain -- the improvement in accuracy when a feature is added to a tree. 328 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://paperpile.com/c/5TES3g/EgnuO https://paperpile.com/c/5TES3g/YrF4J https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 9 All classification models had an accuracy between 91% (SKCM, PFI) and 99% (COAD, Stage). As 329 mentioned above, there can be a high degree of correlation between edge values in a data set. While 330 features are selected first based on improving prediction, the machine learning model accounts for 331 correlated features by selecting the one that has best predictive power, leaving out other correlated 332 features. That said, the number of features selected by the model is then related to the correlation 333 structure. In a set of uncorrelated features where all features add to the predictive power, all features will 334 be selected, whereas for correlated features, only a small number will be selected. This is seen in results 335 here in terms of differences in the numbers of features compared with the starting network. 336 In the COAD-PFI case, the number of features was reduced by approximately 20%, keeping 50 337 edges in the model. The STAD-PFI features were reduced by approximately 45%. Other examples are are 338 LUSC-PFI at 60% reduction, ESCA-PFI at 74%, and SKCM-PFI at 95% (12 edges selected) indicating a 339 high degree of internal feature correlation. 340 A similar pattern was observed in the tumor stage contrasts, where SKCM-stage had a 96% 341 reduction in features, STAD-stage 52%, READ-stage 47%. For COAD-stage, feature reduction was 95% 342 reduction, but attributable to the large number of starting edges (1851) compared to the 84 edges selected. 343 A collection of the most predictive edges is given in Table 2. 344 345 The collection of genes from each differential network was used for gene ontology (GO) term enrichment 346 using the GONet tool (Pomaznoy et al., 2018). All tissue-contrast combinations with differentially 347 weighted edges produced enriched GO terms (FDR < 0.05, within tissue contrasts) except the SKCM-348 stage group, which although contained 77 genes in the differential network, produced no enriched terms. 349 Common themes included structural GO terms such as "extracellular structure organization" (for 350 SKCM), cell-substrate adhesion (ESCA, LUSC), cell-cell adhesion (STAD), ECM / extracellular matrix 351 organization (LUSC, COAD, READ, STAD). Cell migration was also a common theme with "cell 352 migration" (STAD), "epithelial cell migration" (SKCM), and "regulation of cell migration" (LUSC, 353 COAD/READ). Among immune related themes, GO terms included "IFNG signaling" and "antigen 354 processing and presentation" (SKCM), "regulation of immune processes" and "IL2" (STAD), and "viral 355 host response" (COAD / READ). See Table 3 for a summary and supplemental table 3 for complete 356 results. 357 Discussion 358 Patient outcome or response to therapy is not necessarily well predicted by tumor stage alone (Kirilovsky 359 et al., 2016). As Fridman et al. wrote, "different types of infiltrating immune cells have different effects 360 on tumour progression, which can vary according to cancer type" (Fridman et al., 2012). This idea has 361 been developed further with the creation of the 'Immunoscore', a prognostic based on the presence and 362 density of particular immune cells in the TME context, expanded to include the peripheral margin as well 363 as tumor core. For example, the Immunoscore in colorectal cancer depends on the density of both CD3+ 364 lymphocytes (any T cell) and specifically, CD8+ cytotoxic T cells in the tumor core and invasive margin 365 (Pagès et al., 2018). The differences in factors that relate to stage and survival is reflected in the current 366 work in the identification of different cell-cell interactions of importance for each. 367 368 Previous studies have shown that cellular interactions within the tumor microenvironment have an 369 impact on patient survival, drug response, and tumor growth. X. Zhao et al. (Zhou et al., 2017) described 370 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://paperpile.com/c/5TES3g/x5oMI https://paperpile.com/c/5TES3g/OVhrZ https://paperpile.com/c/5TES3g/OVhrZ https://paperpile.com/c/5TES3g/KWOdD https://paperpile.com/c/5TES3g/iSTvg https://paperpile.com/c/5TES3g/3xjIz https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 10 alterations in ligand-receptor pair associations in cancer compared to normal tissue, the cell-cell 371 communication structures thereby becoming a generalized phenotype for malignancy. Using the same 372 foundational database of possible interactions as this work, ligand-receptor pair expression correlation 373 was compared between tumor and normal tissue. Their "aggregate analysis revealed that … tumors of 374 most cancer types generally had reduced (ligand-receptor) correlation compared with the normal tissues." 375 The ligand-receptor pairs that commonly showed such differences across the ten tissue types studied 376 included PLAU-ITGA5, LIPH-LPAR2, SEM14G-PLXNB2, SEMABD-TYROBP, CCL2-CCR5, CCL3-377 CCR5, and CGN-TYROBP. 378 379 Like the Zhao et al. work, we found the collection of associated edges enriched for related biological 380 processes, especially to ECM organization and cell adhesion -- possibly related to the progression towards 381 dysplasia. For example, in Zhao et al., the ligand-receptor pairs COL11A1-ITGA2, COL7A1-ITGA2, 382 MDK-GPC2 and MMP1-ITGA2 were found to be positively correlated in cancer but not in normal tissue. 383 In the current work, integrins and laminins generally have elevated edge weights in late tumor stage. In 384 the PFI contrasts, except for ESCA, such edges have higher weights in shorter PFIs, corresponding to 385 more severe progression. Regarding SEMA7A, found in the PFI STAD results as a predictive feature, 386 previous findings report the collagen gene COL1A1 has been associated with metastasis, and SEMA7A is 387 known to play an important role in integrin-mediated signaling and functions both in regulating cell 388 migration and immune responses. Cancers such as esophageal, gastric, and colorectal all show transitions 389 to metaplasia and dysplasia, a process that breaks down the structural order of a tissue, replacing it with 390 disorder and cell transdifferentiation. 391 392 In our model, a host response is reflected in a change in S1 score, negative if the edge weight is higher 393 with longer PFI times. In the PFI results, Th1 cells appeared in 13 high scoring edges in SKCM, all with 394 negative S1 values. Also, for SKCM and COAD, ligand producing (pro-inflammatory) M1 macrophage 395 edges are present but show both positive and negative S1 scores. Inflammation cytokines IL1B and IL18 396 are both present in the results of ESCA and STAD (Figure 9). In the tumor stage contrasts, we see Th2 397 and NK cells with inflammation cytokines IL1A, IL1B, IL4, TNF in STAD and COAD. So, while certain 398 inflammatory signatures are observed, the absence of well-known canonical edges such as Th1-IL12-399 IL12RB1-M1 macrophages, may be due to essentially no difference, or undetectable differences in the 400 quantity of Th1 cells or IL12A expression between PFI groups (4.9 vs 3.3 TCGA Pan-Cancer RSEM for 401 short vs long PFI). These observations point to possible mechanisms of action for immune cells known to 402 be important for cancer immune response, the CD4+ T helper 1 cells and M1 macrophages, in relation to 403 tumor progression 404 405 In tissues susceptible to dysplasia, such as the tissues explored here, unexpected cell types may be 406 detected. For example, the '...disruption of tissue organization appears to trigger a profound change in 407 cellular commitment, which leads to hepatocyte differentiation in the “oval cells” in … the epithelial 408 cells lining the small pancreatic ductules' (Reddy et al., 1991). As another example, pancreatic cancer is 409 known to have desmoplastic stroma, the source of which may include MSCs which are defined by their 410 ability to differentiate into osteoblasts, chondrocytes, and adipocytes (Mathew et al., 2016). In line with 411 that finding, it's been observed that "...stromal cells isolated from the neoplastic pancreas can differentiate 412 into osteoblasts, chondrocytes, and adipocytes" (Mathew et al., 2016). 413 414 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://paperpile.com/c/5TES3g/trKJF https://paperpile.com/c/5TES3g/7gKvG https://paperpile.com/c/5TES3g/7gKvG https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 11 It has been reported that (Yáñez et al., 2017), "granulocyte-monocyte progenitors (GMPs) and monocyte-415 dendritic cell progenitors (MDPs) produce monocytes during homeostasis and in response to increased 416 demand during infection." or as in (Weston et al., 2018), "Granulocyte-monocyte progenitor (GMP) cells 417 play a vital role in the immune system by maturing into a variety of white blood cells, including 418 neutrophils and macrophages, depending on exposure to cytokines such as various types of colony 419 stimulating factors (CSF)." 420 421 In our results for SKCM and COAD, GMPs had negative S1 statistics, meaning the late-stage cases had 422 edges with higher weights. The GMP cells most often interacted with (as receptor bearing cells) MSC, 423 Melanocytes, both M1 and M2 macrophages, and CD8+ Tem (T effector-memory cells). The presence of 424 GMP related edges may be indicative of the commonly observed 'myeloid dysfunction', which "can 425 promote tumor progression through immune suppression, tissue remodeling, angiogenesis or 426 combinations of these mechanisms."(Messmer et al., 2015) Also, "tumors secrete a variety of factors such 427 as G-CSF that act in a systemic way to reduce IRF-8 within progenitor cells, releasing myelopoiesis from 428 IRF-8 control such that the granulocytic lineage (blue cell) undergoes hyperplasia, leading to increased 429 immature suppressive cells to promote tumor growth." This is in line with our observations. 430 431 Megakaryocytes, a multipotent stem cell, are cells that typically reside in the bone marrow and produce 432 platelets. Megakaryocytes are also produced in the liver, kidney, and spleen. Additionally, 433 megakaryocytes have been observed in the lung and circulating blood where they were useful as a 434 biomarker in prostate cancer. Case reports exist showing megakaryocytes in the metaplasia of gastric 435 cancer patients (Chatelain et al., 2004). Megakaryocytes respond to a variety of cytokines such as IL-3, 436 IL-6, IL-11, CXCL5, CXCL7, and CCL5. A majority of interacting cells are leukocytes. In both 437 esophageal and gastric cancers “...thrombocytosis has been reported in general to be associated with 438 adverse clinical outcomes. (Voutsadakis, 2014)" Additionally, there are reports of 'tumor educated 439 platelets' that can be useful as part of a liquid biopsy (Best et al., 2018) (Haemmerle et al., 2018). 440 441 Among the rich literature regarding oncological cytokine networks, there is a strong emphasis on the 442 cancer cell as a central actor. Many of the review articles and research focuses on the cancer cell 443 interactions in the TME. For example, cancer cells producing an overabundance of IL6 or IL10 that has 444 been associated with poor prognosis (Burkholder et al., 2014; Fisher et al., 2014; Lippitz and Harris, 445 2016). 446 447 However, in this work, the focus has been put on the environment and less about the cancer cell itself. 448 This is largely because in performing cell deconvolution on gene expression data to determine the 449 presence and quantity of different cell types in the mixed sample, reliable signatures for cancer cells are 450 not readily available. Because in carcinomas, a cancer cell derives from the epithelium, and in many 451 ways remains similar to epithelial cells. Even in single cell RNA-seq studies, it is often difficult to 452 determine what cells are cancerous and picking this signature out of a mixed expression dataset is difficult 453 and remains an open question. 454 455 This work is based upon gene expression, rather than protein expression, cell-surface expression or 456 secretion measurements. Also, the base expression data is taken from sorted cells, rather than cells in 457 tissue with an assumption that we cannot get “new/non-scaffold edges” in a tissue/cancer context. 458 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://paperpile.com/c/5TES3g/X6Njd https://paperpile.com/c/5TES3g/Dtjl8 https://paperpile.com/c/5TES3g/XBz0P https://paperpile.com/c/5TES3g/kSjUZ https://paperpile.com/c/5TES3g/6AFSZ https://paperpile.com/c/5TES3g/OxEbr https://paperpile.com/c/5TES3g/1Nr0r https://paperpile.com/c/5TES3g/QAJ4q+Ew2iX+H4TcE https://paperpile.com/c/5TES3g/QAJ4q+Ew2iX+H4TcE https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 12 However new data types and methods including scRNA-seq and PIC-seq will provide ways of 459 determining new cell-cell interactions that are context specific (Giladi et al., 2020). 460 461 Importantly, the physical and biochemical process of secretion, binding and activation cannot be 462 identified with the current data and method. By identifying the propensity of edge constituents in 463 particular tumor microenvironments in comparison with others, it becomes more likely that 464 communication with activation can take place, as the presence of those constituents is a prerequisite. 465 466 With the data and results publicly available in a Google BigQuery table (Supplemental Figure 8), this 467 resource is open to researchers to explore and ask questions. It is a low-cost way (with free options) to 468 achieve compute cluster performance for quickly answering such questions. The table is easily joined to 469 clinical and molecular annotations and can be worked with from R and python notebooks. With the 470 addition of resources like GTEx, it should begin to be possible to tease aberrant, cancer specific 471 interactions apart. 472 473 In terms of future work, it could be important to examine communication networks given the immune 474 subtypes of (Thorsson et al., 2018) and communication differences between TCGA tumor molecular 475 subtypes. New data types can be applied to enhance the scaffold with knowledge gained from (for 476 example) single-cell RNA-seq. 477 478 In this work, we have introduced a method and identified lines of communication between cells that may 479 play a role in disease. These lines include both established/recognized cells in the context of cancer, as 480 well as others that should be explored further, with targeted methods. 481 Acknowledgments 482 The authors would like to thank Samuel Danziger, David Reiss, Mark McConnell, Andrew Dervan, 483 Matthew Trotter, Douglas Bassett, Robert Hershberg, the Shmulevich Lab and the Institute for Systems 484 Biology for engaging and informative discussions. This study was supported by Celgene, a wholly owned 485 subsidiary of Bristol-Myers Squibb, in part through a Sponsored Research Award to D.L.G., B.A. and I.S, 486 and by the Cancer Research Institute (D.L.G, V.T, I.S). We thank the ISB-CGC for their ongoing support. 487 ISB-CGC has been funded in whole or in part with Federal funds from the National Cancer Institute, 488 National Institutes of Health, Task Order No. 17X148under Contract No. 75N91019D00024. The content 489 of this publication does not necessarily reflect the views or policies of the Department of Health and 490 Human Services, nor does mention of trade names, commercial products, or organizations imply 491 endorsement by the U.S. Government. 492 Competing Interests. 493 D.L.G., B.A., V.T. and I.S. declare no competing interests. A.V.R.: Bristol-Myers Squibb: Employment, 494 Equity Ownership. 495 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://paperpile.com/c/5TES3g/JZSBx https://paperpile.com/c/5TES3g/EFysE https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 13 Author Contributions 496 D.L.G., B.A., V.T, A.R., I.S. conceived of the idea. D.L.G. developed the method, wrote the code, and 497 performed the computations. D.L.G. wrote the manuscript with contributions from B.A., V.T., A.R., I.S. 498 and A.R. supervised the project. All authors provided critical feedback and helped shape the research, 499 analysis and manuscript. 500 501 Tables 502 Table 1. Counts of differentially weighted edges compared to the number of samples in each study. 503 504 Study N samples PFI short/long PFI DWEs Selected Feat. Model Accuracy GO results? ESCA 170 73/97 137 36 94.7 y STAD 409 155/231 142 78 95.1 y PAAD 178 68/83 8 - - y COAD 281 96/183 63 50 97.1 y READ 91 16/71 4 - - y SKCM 102 27/75 249 12 91.1 y LUSC 494 193/285 304 119 98.7 y Study N samples Stage early/late Stage DWEs Selected Feat. Model Accuracy GO results? ESCA 170 86/63 0 - - - STAD 409 167/198 241 114 99.7 y PAAD 178 142/7 0 - - - COAD 281 151/118 1851 84 99.6 y READ 91 36/44 34 18 97.5 y SKCM 102 68/29 221 8 99 n LUSC 494 390/89 0 - - - Study: tissue type, N samples: number of samples used, PFI short/long: number of samples in each group, 505 PFI DWEs: number of differentially weighted edges, Model Accuracy: accuracy of predicting group, GO 506 results?: if yes, significant GO enrichments. 507 508 509 510 511 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 14 Table 2. Top 5 most predictive edges from XGBoost models. 512 513 Contrast Study EdgeID LCell Ligand Receptor RCell S1 Median Diff Information Gain PFI COAD 586640 Megakaryocyt es BMP10 ENG Epithelial cells 0.169 0.082 0.109 PFI COAD 50871 astrocytes TNC ITGA5 mv Endothelial cells 0.168 0.061 0.069 PFI COAD 406871 Hepatocytes GDF2 ENG Epithelial cells 0.168 0.082 0.067 PFI COAD 49669 astrocytes EFNB1 EPHB4 Mesangial cells 0.199 0.117 0.066 PFI COAD 632560 MEP TIMP2 ITGB1 MEP 0.167 0.095 0.051 Stage COAD 406579 Hepatocytes CGN TGFBR2 Eosinophils -0.165 -0.077 0.043 Stage COAD 330377 Eosinophils LAMB3 ITGB1 Eosinophils -0.144 -0.048 0.038 Stage COAD 616033 Memory B- cells BMP15 BMPR2 Epithelial cells -0.150 -0.060 0.037 Stage COAD 784400 NK cells TNFSF10 TNFRSF10B CD4+ memory T- cells 0.137 0.043 0.037 Stage COAD 630108 MEP B2M KIR2DL1 iDC 0.138 0.055 0.037 PFI ESCA 457801 Keratinocytes GS ADCY7 CD4+ Tcm 0.167 0.078 0.078 PFI ESCA 182483 CD8+ Tcm RBP3 NOTCH1 pDC -0.184 -0.073 0.071 PFI ESCA 1051114 Th2 cells CALM1 GP6 naive B-cells 0.171 0.085 0.070 PFI ESCA 658080 Mesangial cells SPP1 CD44 Tregs 0.184 0.080 0.064 PFI ESCA 397215 GMP HMGB1 THBD MEP 0.184 0.060 0.059 PFI LUSC 879775 Plasma cells VEGFA ITGB1 GMP 0.120 0.047 0.041 PFI LUSC 451902 iDC VEGFA ITGB1 Plasma cells 0.137 0.067 0.038 PFI LUSC 398971 GMP ADAM17 ITGB1 Plasma cells 0.120 0.059 0.030 PFI LUSC 340857 Epithelial cells COL4A6 ITGB1 CD8+ naive T-cells 0.124 0.054 0.026 PFI LUSC 471558 Keratinocytes THBS1 ITGA6 Plasma cells 0.120 0.068 0.025 Stage READ 632552 MEP TGFB3 TGFBR2 MEP -0.267 -0.134 0.127 Stage READ 795527 NKT GZMB PGRMC1 CD4+ memory T- cells 0.343 0.144 0.115 Stage READ 402754 Hepatocytes CGN TGFBR2 CD4+ Tem -0.274 -0.134 0.108 Stage READ 808308 NKT GZMB IGF2R Plasma cells 0.261 0.101 0.103 Stage READ 800747 NKT IL7 IL2RG GMP 0.264 0.136 0.095 PFI SKCM 1008243 Smooth muscle SEMA7A PLXNC1 pro B-cells 0.438 0.259 0.242 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 15 PFI SKCM 517677 Macrophages UBA52 NOTCH1 Osteoblast -0.284 -0.145 0.200 PFI SKCM 80934 Basophils VIM CD44 NKT -0.383 -0.254 0.103 PFI SKCM 1007915 Smooth muscle PSAP SORT1 Preadipocytes 0.311 0.175 0.082 PFI SKCM 84049 Basophils CALM1 PTPRA Th1 cells -0.285 -0.151 0.080 Stage SKCM 275306 CLP GI2 CXCR1 Osteoblast 0.353 0.176 0.207 Stage SKCM 399084 GMP TIMP1 CD63 Plasma cells -0.302 -0.147 0.206 Stage SKCM 273727 CLP GI2 F2R MEP 0.290 0.123 0.182 Stage SKCM 182981 CD8+ Tcm GI2 TBXA2R Plasma cells -0.283 -0.095 0.123 Stage SKCM 397545 GMP BST1 CAV1 MSC -0.337 -0.194 0.109 PFI STAD 461765 Keratinocytes CALM3 KCNQ1 Eosinophils -0.136 -0.067 0.062 PFI STAD 644724 Mesangial cells TGFB2 ACVR1 Erythrocytes 0.149 0.061 0.054 PFI STAD 105991 CD4+ T-cells IL1B IL1R2 Megakaryocytes 0.134 0.081 0.047 PFI STAD 269013 CLP ADAM28 ITGA4 CD4+ T-cells 0.145 0.075 0.046 PFI STAD 343620 Epithelial cells VCAN TLR1 CLP 0.134 0.051 0.033 Stage STAD 128412 CD4+ Tem CALM1 KCNQ1 Macrophages 0.140 0.058 0.057 Stage STAD 43832 astrocytes FBN1 ITGB6 Epithelial cells -0.146 -0.058 0.036 Stage STAD 346120 Epithelial cells LAMB1 ITGAV Hepatocytes -0.139 -0.066 0.035 Stage STAD 403540 Hepatocytes SHH PTCH1 CD8+ T-cells -0.138 -0.069 0.034 Stage STAD 648983 Mesangial cells FGB ITGAV Megakaryocytes -0.140 -0.060 0.031 Contrast: the groupwise test performed, Study: tissue type, Edge ID: BigQuery table lookup ID, LCell: 514 cell producing ligands, Ligand: ligand gene symbol , Receptor: receptor gene symbol, R Cell: receptor 515 producing cell, S1: between group S1 statistic, Median Diff: difference in edge weights between groups, 516 Information Gain: Xgboost information gain after adding feature to model. 517 518 Table 3. Enriched GO terms. 519 Tissue Contrast Num GOs ECM Migration Immune Immune2 SKCM PFI 34 extracellular structure organization epithelium cell migration IFNG signaling antigen processing and presentation ESCA PFI 3 cell-substrate adhesion .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 16 STAD PFI 59 cell-cell adhesion mediated by integrin cell migration regulation of immune system process IL2 LUSC PFI 39 extracellular matrix organization positive regulation of cell migration COAD / READ stage 85 ECM regulation of epithelial cell migration viral host response STAD stage 28 ECM / adhesion cell migration Tissue: TCGA study, Contrast: the groupwise test performed, Num GOs: number of gene ontology terms 520 found significantly enriched, ECM: GO categories involving ECM, Migration: GO terms involving cell 521 migration, Immune: GO terms involving immune response, Immune2: additional GO terms involving 522 immune response. 523 524 References 525 526 Ahad, N. A., Yahaya, S. S. S., and Yin, L. P. (2016). Robustness of S1 statistic with Hodges-Lehmann for 527 skewed distributions. AIP Conf. Proc. 1782, 050002. 528 Aran, D., Hu, Z., and Butte, A. J. (2017). xCell: digitally portraying the tissue cellular heterogeneity 529 landscape. Genome Biol. 18, 220. 530 Babu, G. J., Padmanabhan, A. R., and Puri, M. L. (1999). Robust one-way ANOVA under possibly non-531 regular conditions. Biometrical Journal: Journal of Mathematical Methods in Biosciences 41, 321–532 339. 533 Behar, M., Barken, D., Werner, S. L., and Hoffmann, A. (2013). The dynamics of signaling as a 534 pharmacological target. Cell 155, 448–461. 535 Best, M. G., Wesseling, P., and Wurdinger, T. (2018). Tumor-Educated Platelets as a Noninvasive 536 Biomarker Source for Cancer Detection and Progression Monitoring. Cancer Res. 78, 3407–3412. 537 Burkholder, B., Huang, R.-Y., Burgess, R., Luo, S., Jones, V. S., Zhang, W., et al. (2014). Tumor-induced 538 perturbations of cytokines and immune cell networks. Biochim. Biophys. Acta 1845, 182–201. 539 Cameron, M. J., and Kelvin, D. J. (2013). Cytokines, Chemokines and Their Receptors. Landes 540 Bioscience. 541 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint http://paperpile.com/b/5TES3g/QrFnS http://paperpile.com/b/5TES3g/QrFnS http://paperpile.com/b/5TES3g/QrFnS http://paperpile.com/b/5TES3g/QrFnS http://paperpile.com/b/5TES3g/YAa5P http://paperpile.com/b/5TES3g/YAa5P http://paperpile.com/b/5TES3g/YAa5P http://paperpile.com/b/5TES3g/YAa5P http://paperpile.com/b/5TES3g/JNgul http://paperpile.com/b/5TES3g/JNgul http://paperpile.com/b/5TES3g/JNgul http://paperpile.com/b/5TES3g/JNgul http://paperpile.com/b/5TES3g/JNgul http://paperpile.com/b/5TES3g/5wEd8 http://paperpile.com/b/5TES3g/5wEd8 http://paperpile.com/b/5TES3g/5wEd8 http://paperpile.com/b/5TES3g/5wEd8 http://paperpile.com/b/5TES3g/OxEbr http://paperpile.com/b/5TES3g/OxEbr http://paperpile.com/b/5TES3g/OxEbr http://paperpile.com/b/5TES3g/OxEbr http://paperpile.com/b/5TES3g/H4TcE http://paperpile.com/b/5TES3g/H4TcE http://paperpile.com/b/5TES3g/H4TcE http://paperpile.com/b/5TES3g/H4TcE http://paperpile.com/b/5TES3g/kvv7o http://paperpile.com/b/5TES3g/kvv7o http://paperpile.com/b/5TES3g/kvv7o http://paperpile.com/b/5TES3g/kvv7o https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 17 Cancer Genome Atlas Network (2015). Genomic Classification of Cutaneous Melanoma. Cell 161, 1681–542 1696. 543 Chatelain, D., Devendeville, A., Rudelli, A., Bruniau, A., Geslin, G., and Sevestre, H. (2004). Gastric 544 myeloid metaplasia: a case report and review of the literature. Arch. Pathol. Lab. Med. 128, 568–545 570. 546 Chen, T., and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. in Proceedings of the 547 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD 548 ’16. (New York, NY, USA: ACM), 785–794. 549 Cohen, D. J., and Nelson, W. J. (2018). Secret handshakes: cell-cell interactions and cellular mimics. 550 Curr. Opin. Cell Biol. 50, 14–19. 551 Eddy, J. A., Thorsson, V., Lamb, A. E., Gibbs, D. L., Heimann, C., Yu, J. X., et al. (2020). CRI iAtlas: an 552 interactive portal for immuno-oncology research. F1000Res. 9, 1028. 553 Efremova, M., Vento-Tormo, M., Teichmann, S. A., and Vento-Tormo, R. (2019). CellPhoneDB v2.0: 554 Inferring cell-cell communication from combined expression of multi-subunit receptor-ligand 555 complexes. bioRxiv. doi:10.1101/680926. 556 Ferguson, J., Wilcock, D. J., McEntegart, S., Badrock, A. P., Levesque, M., Dummer, R., et al. (2020). 557 Osteoblasts contribute to a protective niche that supports melanoma cell proliferation and survival. 558 Pigment Cell Melanoma Res. 33, 74–85. 559 Fisher, D. T., Appenheimer, M. M., and Evans, S. S. (2014). The two faces of IL-6 in the tumor 560 microenvironment. Semin. Immunol. 26, 38–47. 561 Frankenstein, Z., Alon, U., and Cohen, I. R. (2006). The immune-body cytokine network defines a social 562 architecture of cell interactions. Biol. Direct 1, 32. 563 Fridman, W. H., Pagès, F., Sautès-Fridman, C., and Galon, J. (2012). The immune contexture in human 564 tumours: impact on clinical outcome. Nat. Rev. Cancer 12, 298–306. 565 Giladi, A., Cohen, M., Medaglia, C., Baran, Y., Li, B., Zada, M., et al. (2020). Dissecting cellular 566 crosstalk by sequencing physically interacting cells. Nat. Biotechnol. 38, 629–637. 567 Haass, N. K., and Herlyn, M. (2005). Normal human melanocyte homeostasis as a paradigm for 568 understanding melanoma. J. Investig. Dermatol. Symp. Proc. 10, 153–163. 569 Haemmerle, M., Stone, R. L., Menter, D. G., Afshar-Kharghan, V., and Sood, A. K. (2018). The Platelet 570 Lifeline to Cancer: Challenges and Opportunities. Cancer Cell 33, 965–983. 571 Heldin, C.-H., Lu, B., Evans, R., and Gutkind, J. S. (2016). Signals and Receptors. Cold Spring Harb. 572 Perspect. Biol. 8, a005900. 573 Hubert M., Pison G., Struyf A., Van Aelst S., editors. Theory and applications of recent robust methods. 574 Birkhäuser; 2012 Dec 6. 575 Jin, S., Guerrero-Juarez, C. F., Zhang, L., Chang, I., Myung, P., Plikus, M. V., et al. (2020). Inference and 576 analysis of cell-cell communication using CellChat. Cold Spring Harbor Laboratory, 577 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint http://paperpile.com/b/5TES3g/SO0k5 http://paperpile.com/b/5TES3g/SO0k5 http://paperpile.com/b/5TES3g/SO0k5 http://paperpile.com/b/5TES3g/SO0k5 http://paperpile.com/b/5TES3g/kSjUZ http://paperpile.com/b/5TES3g/kSjUZ http://paperpile.com/b/5TES3g/kSjUZ http://paperpile.com/b/5TES3g/kSjUZ http://paperpile.com/b/5TES3g/kSjUZ http://paperpile.com/b/5TES3g/YrF4J http://paperpile.com/b/5TES3g/YrF4J http://paperpile.com/b/5TES3g/YrF4J http://paperpile.com/b/5TES3g/YrF4J http://paperpile.com/b/5TES3g/YrF4J http://paperpile.com/b/5TES3g/1F4wh http://paperpile.com/b/5TES3g/1F4wh http://paperpile.com/b/5TES3g/1F4wh http://paperpile.com/b/5TES3g/1F4wh http://paperpile.com/b/5TES3g/CXcXS http://paperpile.com/b/5TES3g/CXcXS http://paperpile.com/b/5TES3g/CXcXS http://paperpile.com/b/5TES3g/CXcXS http://paperpile.com/b/5TES3g/IS5iX http://paperpile.com/b/5TES3g/IS5iX http://paperpile.com/b/5TES3g/IS5iX http://paperpile.com/b/5TES3g/IS5iX http://paperpile.com/b/5TES3g/IS5iX http://dx.doi.org/10.1101/680926 http://paperpile.com/b/5TES3g/IS5iX http://paperpile.com/b/5TES3g/EgnuO http://paperpile.com/b/5TES3g/EgnuO http://paperpile.com/b/5TES3g/EgnuO http://paperpile.com/b/5TES3g/EgnuO http://paperpile.com/b/5TES3g/QAJ4q http://paperpile.com/b/5TES3g/QAJ4q http://paperpile.com/b/5TES3g/QAJ4q http://paperpile.com/b/5TES3g/QAJ4q http://paperpile.com/b/5TES3g/lLbFv http://paperpile.com/b/5TES3g/lLbFv http://paperpile.com/b/5TES3g/lLbFv http://paperpile.com/b/5TES3g/lLbFv http://paperpile.com/b/5TES3g/KWOdD http://paperpile.com/b/5TES3g/KWOdD http://paperpile.com/b/5TES3g/KWOdD http://paperpile.com/b/5TES3g/KWOdD http://paperpile.com/b/5TES3g/JZSBx http://paperpile.com/b/5TES3g/JZSBx http://paperpile.com/b/5TES3g/JZSBx http://paperpile.com/b/5TES3g/JZSBx http://paperpile.com/b/5TES3g/P8KBB http://paperpile.com/b/5TES3g/P8KBB http://paperpile.com/b/5TES3g/P8KBB http://paperpile.com/b/5TES3g/P8KBB http://paperpile.com/b/5TES3g/1Nr0r http://paperpile.com/b/5TES3g/1Nr0r http://paperpile.com/b/5TES3g/1Nr0r http://paperpile.com/b/5TES3g/1Nr0r http://paperpile.com/b/5TES3g/OtjGl http://paperpile.com/b/5TES3g/OtjGl http://paperpile.com/b/5TES3g/OtjGl http://paperpile.com/b/5TES3g/OtjGl http://paperpile.com/b/5TES3g/ZgbVl http://paperpile.com/b/5TES3g/ZgbVl http://paperpile.com/b/5TES3g/ZgbVl http://paperpile.com/b/5TES3g/ZgbVl https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 18 2020.07.21.214387. doi:10.1101/2020.07.21.214387. 578 Kirilovsky, A., Marliot, F., El Sissy, C., Haicheur, N., Galon, J., and Pagès, F. (2016). Rational bases for 579 the use of the Immunoscore in routine clinical settings as a prognostic and predictive biomarker in 580 cancer patients. Int. Immunol. 28, 373–382. 581 Kumar, M. P., Du, J., Lagoudas, G., Jiao, Y., Sawyer, A., Drummond, D. C., et al. (2018). Analysis of 582 Single-Cell RNA-Seq Identifies Cell-Cell Communication Associated with Tumor Characteristics. 583 Cell Rep. 25, 1458–1468.e4. 584 Lippitz, B. E., and Harris, R. A. (2016). Cytokine patterns in cancer patients: A review of the correlation 585 between interleukin 6 and prognosis. Oncoimmunology 5, e1093722. 586 Liu, J., Lichtenberg, T., Hoadley, K. A., Poisson, L. M., Lazar, A. J., Cherniack, A. D., et al. (2018). An 587 Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome 588 Analytics. Cell 173, 400–416.e11. 589 Mathew, E., Brannon, A. L., Del Vecchio, A., Garcia, P. E., Penny, M. K., Kane, K. T., et al. (2016). 590 Mesenchymal Stem Cells Promote Pancreatic Tumor Growth by Inducing Alternative Polarization 591 of Macrophages. Neoplasia 18, 142–151. 592 Messmer, M. N., Netherby, C. S., Banik, D., and Abrams, S. I. (2015). Tumor-induced myeloid 593 dysfunction and its implications for cancer immunotherapy. Cancer Immunol. Immunother. 64, 1–594 13. 595 Morel, P. A., Lee, R. E. C., and Faeder, J. R. (2017). Demystifying the cytokine network: Mathematical 596 models point the way. Cytokine 98, 115–123. 597 Nath, A., and Leier, A. (2020). Improved cytokine-receptor interaction prediction by exploiting the 598 negative sample space. BMC Bioinformatics 21, 493. 599 Pagès, F., Mlecnik, B., Marliot, F., Bindea, G., Ou, F.-S., Bifulco, C., et al. (2018). International 600 validation of the consensus Immunoscore for the classification of colon cancer: a prognostic and 601 accuracy study. Lancet 391, 2128–2139. 602 Pomaznoy, M., Ha, B., and Peters, B. (2018). GOnet: a tool for interactive Gene Ontology analysis. BMC 603 Bioinformatics 19, 470. 604 Ramilowski, J. A., Goldberg, T., Harshbarger, J., Kloppmann, E., Lizio, M., Satagopam, V. P., et al. 605 (2015). A draft network of ligand–receptor-mediated multicellular signalling in human. Nat. 606 Commun. 6, 7866. 607 Reddy, J. K., Rao, M. S., Yeldandi, A. V., Tan, X. D., and Dwivedi, R. S. (1991). Pancreatic hepatocytes. 608 An in vivo model for cell lineage in pancreas of adult rat. Dig. Dis. Sci. 36, 502–509. 609 Robertson, A. G., Shih, J., Yau, C., Gibb, E. A., Oba, J., Mungall, K. L., et al. (2017). Integrative 610 Analysis Identifies Four Molecular and Clinical Subsets in Uveal Melanoma. Cancer Cell 32, 204–611 220.e15. 612 Shao, X., Liao, J., Li, C., Lu, X., Cheng, J., and Fan, X. (2020). CellTalkDB: a manually curated database 613 of ligand-receptor interactions in humans and mice. Brief. Bioinform. doi:10.1093/bib/bbaa269. 614 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint http://paperpile.com/b/5TES3g/ZgbVl http://dx.doi.org/10.1101/2020.07.21.214387 http://paperpile.com/b/5TES3g/ZgbVl http://paperpile.com/b/5TES3g/OVhrZ http://paperpile.com/b/5TES3g/OVhrZ http://paperpile.com/b/5TES3g/OVhrZ http://paperpile.com/b/5TES3g/OVhrZ http://paperpile.com/b/5TES3g/OVhrZ http://paperpile.com/b/5TES3g/nDiJk http://paperpile.com/b/5TES3g/nDiJk http://paperpile.com/b/5TES3g/nDiJk http://paperpile.com/b/5TES3g/nDiJk http://paperpile.com/b/5TES3g/Ew2iX http://paperpile.com/b/5TES3g/Ew2iX http://paperpile.com/b/5TES3g/Ew2iX http://paperpile.com/b/5TES3g/Ew2iX http://paperpile.com/b/5TES3g/7SC9C http://paperpile.com/b/5TES3g/7SC9C http://paperpile.com/b/5TES3g/7SC9C http://paperpile.com/b/5TES3g/7SC9C http://paperpile.com/b/5TES3g/7SC9C http://paperpile.com/b/5TES3g/7gKvG http://paperpile.com/b/5TES3g/7gKvG http://paperpile.com/b/5TES3g/7gKvG http://paperpile.com/b/5TES3g/7gKvG http://paperpile.com/b/5TES3g/7gKvG http://paperpile.com/b/5TES3g/XBz0P http://paperpile.com/b/5TES3g/XBz0P http://paperpile.com/b/5TES3g/XBz0P http://paperpile.com/b/5TES3g/XBz0P http://paperpile.com/b/5TES3g/XBz0P http://paperpile.com/b/5TES3g/CTvjs http://paperpile.com/b/5TES3g/CTvjs http://paperpile.com/b/5TES3g/CTvjs http://paperpile.com/b/5TES3g/CTvjs http://paperpile.com/b/5TES3g/FfNDP http://paperpile.com/b/5TES3g/FfNDP http://paperpile.com/b/5TES3g/FfNDP http://paperpile.com/b/5TES3g/FfNDP http://paperpile.com/b/5TES3g/iSTvg http://paperpile.com/b/5TES3g/iSTvg http://paperpile.com/b/5TES3g/iSTvg http://paperpile.com/b/5TES3g/iSTvg http://paperpile.com/b/5TES3g/iSTvg http://paperpile.com/b/5TES3g/x5oMI http://paperpile.com/b/5TES3g/x5oMI http://paperpile.com/b/5TES3g/x5oMI http://paperpile.com/b/5TES3g/x5oMI http://paperpile.com/b/5TES3g/P3pxd http://paperpile.com/b/5TES3g/P3pxd http://paperpile.com/b/5TES3g/P3pxd http://paperpile.com/b/5TES3g/P3pxd http://paperpile.com/b/5TES3g/P3pxd http://paperpile.com/b/5TES3g/trKJF http://paperpile.com/b/5TES3g/trKJF http://paperpile.com/b/5TES3g/trKJF http://paperpile.com/b/5TES3g/trKJF http://paperpile.com/b/5TES3g/iigUR http://paperpile.com/b/5TES3g/iigUR http://paperpile.com/b/5TES3g/iigUR http://paperpile.com/b/5TES3g/iigUR http://paperpile.com/b/5TES3g/iigUR http://paperpile.com/b/5TES3g/UyBhS http://paperpile.com/b/5TES3g/UyBhS http://paperpile.com/b/5TES3g/UyBhS http://paperpile.com/b/5TES3g/UyBhS http://dx.doi.org/10.1093/bib/bbaa269 http://paperpile.com/b/5TES3g/UyBhS https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 19 Song, D., Yang, D., Powell, C. A., and Wang, X. (2019). Cell-cell communication: old mystery and new 615 opportunity. Cell Biol. Toxicol. 35, 89–93. 616 Theory and Applications of Recent Robust Methods | Mia Hubert | Springer Available at: 617 https://www.springer.com/gp/book/9783764370602 [Accessed June 19, 2020]. 618 Thorsson, V., Gibbs, D. L., Brown, S. D., Wolf, D., Bortone, D. S., Ou Yang, T.-H., et al. (2018). The 619 Immune Landscape of Cancer. Immunity 48, 812–830.e14. 620 Trosko, J. E., and Ruch, R. J. (1998). Cell-cell communication in carcinogenesis. Front. Biosci. 3, d208–621 36. 622 Voutsadakis, I. A. (2014). Thrombocytosis as a prognostic marker in gastrointestinal cancers. World J. 623 Gastrointest. Oncol. 6, 34–40. 624 Wei, C.-J., Xu, X., and Lo, C. W. (2004). Connexins and cell signaling in development and disease. 625 Annu. Rev. Cell Dev. Biol. 20, 811–838. 626 West, J., and Newton, P. K. (2019). Cellular interactions constrain tumor growth. Proc. Natl. Acad. Sci. 627 U. S. A. 116, 1918–1923. 628 Weston, B. R., Li, L., and Tyson, J. J. (2018). Mathematical Analysis of Cytokine-Induced Differentiation 629 of Granulocyte-Monocyte Progenitor Cells. Front. Immunol. 9, 2048. 630 Wilson, M. R., Close, T. W., and Trosko, J. E. (2000). Cell Population Dynamics (Apoptosis, Mitosis, 631 and Cell–Cell Communication) during Disruption of Homeostasis. Exp. Cell Res. 254, 257–268. 632 Wolf, B. J., Choi, J. E., and Exley, M. A. (2018). Novel Approaches to Exploiting Invariant NKT Cells in 633 Cancer Immunotherapy. Front. Immunol. 9, 384. 634 Yahaya, S. S. S., Othman, A. R., and Keselman, H. J. (2004). Testing the Equality of Location Parameters 635 for Skewed Distributions Using S1 with High Breakdown Robust Scale Estimators. in Theory and 636 Applications of Recent Robust Methods (Birkhäuser Basel), 319–328. 637 Yáñez, A., Coetzee, S. G., Olsson, A., Muench, D. E., Berman, B. P., Hazelett, D. J., et al. (2017). 638 Granulocyte-Monocyte Progenitors and Monocyte-Dendritic Cell Progenitors Independently 639 Produce Functionally Distinct Monocytes. Immunity 47, 890–902.e4. 640 Zhang, J., Hou, L., Zhao, D., Pan, M., Wang, Z., Hu, H., et al. (2017). Inhibitory effect and mechanism of 641 mesenchymal stem cells on melanoma cells. Clin. Transl. Oncol. 19, 1358–1374. 642 Zhou, J. X., Taramelli, R., Pedrini, E., Knijnenburg, T., and Huang, S. (2017). Extracting Intercellular 643 Signaling Network of Cancer Tissues using Ligand-Receptor Expression Patterns from Whole-644 tumor and Single-cell Transcriptomes. Sci. Rep. 7, 8815. 645 646 647 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint http://paperpile.com/b/5TES3g/H1JGR http://paperpile.com/b/5TES3g/H1JGR http://paperpile.com/b/5TES3g/H1JGR http://paperpile.com/b/5TES3g/H1JGR http://paperpile.com/b/5TES3g/Ups9L http://paperpile.com/b/5TES3g/Ups9L https://www.springer.com/gp/book/9783764370602 http://paperpile.com/b/5TES3g/Ups9L http://paperpile.com/b/5TES3g/EFysE http://paperpile.com/b/5TES3g/EFysE http://paperpile.com/b/5TES3g/EFysE http://paperpile.com/b/5TES3g/EFysE http://paperpile.com/b/5TES3g/RMQGg http://paperpile.com/b/5TES3g/RMQGg http://paperpile.com/b/5TES3g/RMQGg http://paperpile.com/b/5TES3g/RMQGg http://paperpile.com/b/5TES3g/6AFSZ http://paperpile.com/b/5TES3g/6AFSZ http://paperpile.com/b/5TES3g/6AFSZ http://paperpile.com/b/5TES3g/6AFSZ http://paperpile.com/b/5TES3g/vADrs http://paperpile.com/b/5TES3g/vADrs http://paperpile.com/b/5TES3g/vADrs http://paperpile.com/b/5TES3g/vADrs http://paperpile.com/b/5TES3g/qtDx5 http://paperpile.com/b/5TES3g/qtDx5 http://paperpile.com/b/5TES3g/qtDx5 http://paperpile.com/b/5TES3g/qtDx5 http://paperpile.com/b/5TES3g/Dtjl8 http://paperpile.com/b/5TES3g/Dtjl8 http://paperpile.com/b/5TES3g/Dtjl8 http://paperpile.com/b/5TES3g/Dtjl8 http://paperpile.com/b/5TES3g/QCquJ http://paperpile.com/b/5TES3g/QCquJ http://paperpile.com/b/5TES3g/QCquJ http://paperpile.com/b/5TES3g/QCquJ http://paperpile.com/b/5TES3g/2INhk http://paperpile.com/b/5TES3g/2INhk http://paperpile.com/b/5TES3g/2INhk http://paperpile.com/b/5TES3g/2INhk http://paperpile.com/b/5TES3g/aP9Gd http://paperpile.com/b/5TES3g/aP9Gd http://paperpile.com/b/5TES3g/aP9Gd http://paperpile.com/b/5TES3g/aP9Gd http://paperpile.com/b/5TES3g/aP9Gd http://paperpile.com/b/5TES3g/X6Njd http://paperpile.com/b/5TES3g/X6Njd http://paperpile.com/b/5TES3g/X6Njd http://paperpile.com/b/5TES3g/X6Njd http://paperpile.com/b/5TES3g/X6Njd http://paperpile.com/b/5TES3g/TD8Vb http://paperpile.com/b/5TES3g/TD8Vb http://paperpile.com/b/5TES3g/TD8Vb http://paperpile.com/b/5TES3g/TD8Vb http://paperpile.com/b/5TES3g/3xjIz http://paperpile.com/b/5TES3g/3xjIz http://paperpile.com/b/5TES3g/3xjIz http://paperpile.com/b/5TES3g/3xjIz http://paperpile.com/b/5TES3g/3xjIz https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 20 Figure Legends 648 649 Figure 1. Overview of workflow showing the transition from data sources to results. 650 651 Figure 2. Illustration of the probabilistic model and edge weight computations. (A) For a given cell-cell 652 communication edge, (B) per patient values are used to 'look up' probabilities from the distributions 653 learned from all TCGA data. Those probabilities are then used to compute an edge weight. 654 655 Figure 3. Diagram of how differentially weighted edges were determined. Three samples of edge weights 656 were taken from the pool by tissue source. Then matching the sample proportions in the clinical features, 657 permutations were sampled and used for computing randomized S1 statistics. Each sample was used to 658 produce 1 million permuted statistics, and taken together, the millionth percentile was used as a cutoff in 659 determining important edges. 660 661 Figure 4. Top edges (by S1 scores) that can distinguish tissue types. Each point represents a tumor sample 662 and each panel represents one edge. (A) EdgeID 605551, Melanocytes-MIA-CDH19-Melanocyte SKCM 663 red, UVM blue, BRCA purple, PAAD orange. (B) EdgeID 687457, MSC-TFPI-F3-MSC, PAAD red. (C) 664 EdgeID 968128, Sebocytes-WNT5A-FZD6- Sebocytes, LUSC red, LUAD blue, HNSC purple. (D) 665 EdgeID 1049823, Th2 cells-IL4-IL2RG-Megakaryocytes, STAD red, READ blue, COAD purple, ESCA 666 orange. 667 668 Figure 5. (A) Median values for each differentially weighted cell-cell edge (DWE) for the PFI categories 669 (in row, DWE edges in columns). (B) Examples of differentially weighted edges. 670 671 Figure 6. Edge member dominance in DWEs shown by log10 counts of cell types. 672 673 Figure 7. High probability edges (DWEs) from PFI contrasts form predictive connected subnetworks. 674 Color indicates the magnitude and direction of S1 statistics (+ / -). 675 676 Figure 8. Informative edges selected by XGBoost models for prediction within study. Color indicates 677 information gain. 678 679 Figure 9. Cell-cell interaction diagram demonstrating complexity in communication with three cell types 680 that produce the IL1B ligand that have two possible binding partners on the same receptor bearing cell. 681 Edge weight violin plots are shown for two STAD PFI groups, short (left) and long (right) PFI. 682 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430343doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430343 http://creativecommons.org/licenses/by/4.0/ 10_1101-2021_02_09_430036 ---- A comparative study of genomic adaptations to low nitrogen availability in Genlisea aurea Thibaut Goldsborough (University of St Andrews, tg76@st-andrews.ac.uk) A comparative study of genomic adaptations to low nitrogen availability in Genlisea aurea Abstract: Genlisea aurea is a carnivorous plant that grows on nitrogen-poor waterlogged sandstone plateaus and is thought to have evolved carnivory as an adaptation to very low nitrogen levels in its habitat. The carnivorous plant is also unusual for having one of the smallest genomes among flowering plants. Genomic DNA is known to have a high nitrogen content and yet, to the author's knowledge, no published study has linked nitrogen starvation of G. aurea with genome size reduction. This comparative study of the carnivorous plant G. aurea, the model organism Arabidopsis thaliana (Brassicaceae) and the nitrogen fixing Trifolium pratense (Fabaceae) attempts to investigate whether the genome, transcriptome and proteome of G. aurea showed evidence of adaptations to low nitrogen availability. It was found that although G. aurea's genome, CDS and non-coding DNA were much lower in nitrogen than the genome of T. pratense and A. thaliana this was solely due to the length of the genome, CDS and non-coding sequences rather than the composition of these sequences. Introduction: Genlisea aurea (Lentibulariaceae) is a carnivorous plant found in Brazil that grows on waterlogged sandstone plateaus. It is thought to have evolved carnivory as an adaptation to very low nitrogen levels in its habitat (Müller K. et al. 2004). G. aurea is also unusual for having one of the smallest genomes among flowering plants with a genome length of just 43.4 Mb, resulting from a process called genome reduction in which intergenic regions and duplicated genes are removed (Leushkin, E.V. et al. 2013). The tiny genome of another carnivorous plant Utricularia gibba (Lentibulariaceae) shows that G. aurea is not the only carnivorous plants that grows in nitrogen poor habitats to have undergone genome size reduction (Ibarra-Laclette, E. et al. 2013). Despite DNA having a high nitrogen content, to the author’s knowledge, no published study has linked nitrogen starvation of G. aurea with genome size reduction. This project investigates whether the genome, transcriptome and proteome of G. aurea show evidence of adaptations to low nitrogen availability. In this study, the genome of G. aurea is compared to the model organism Arabidopsis thaliana (Brassicaceae) and to the nitrogen fixing Trifolium pratense (Fabaceae). T. pratense is known to fix nitrogen gas (N2) from the atmosphere with the help of nitrogen-fixing bacteria found in its roots (Davey A.G. et al. 1989). For this reason, T. pratense was taken as a control for a plant that is not nitrogen deprived. Reduction of nitrogen usage in proteomes has already been recorded when comparing plant proteins and animal proteins. Plants are generally regarded as nitrogen limited in comparison with animals and one study found a 7.1 % reduction in nitrogen use in amino acid side chains of plant proteins compared to animal proteins (Acquisti, C., Kumar and S., Elser, J.J. 2009). Another study found that parasitic microorganisms showed altered codon usage and genome composition as a response to nitrogen limitations (Seward, E.A. and Kelly, S. (2016). .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430036doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430036 http://creativecommons.org/licenses/by/4.0/ Methods: All genomic data was obtained from the NCBI genome database Genbank (available at ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant) . For all three species, the genome fasta file, the genome annotation gff file and the complete CDS was obtained. 1) Determination of genomic nitrogen content The genomic nitrogen content of each species was calculated by counting the number of occurrences of each nucleotide and multiplying by the corresponding number of nitrogen atoms using a python script and the genome fasta file. While a guanine-cytosine pair has eight nitrogen atoms, an adenine-thymine pair only has seven. The nitrogen content of the entire genome was determined first, then the nitrogen content of the CDS and non-CDS regions were also calculated using the CDS file. 2) Determination of transcriptomic nitrogen content and codon usage bias The nitrogen content of all the pre-mRNA was determined by transcribing the regions annotated by ‘gene’ in the gff file using the Biopython (Bio.seq) library. Adenine has 5 nitrogen atoms, uracil has 2, guanine has 5 and cytosine has 3. The nitrogen content of the introns and exons were also determined separately. Finally, a codon usage table was obtained to examine preferential codon usage. 3) Determination of proteome nitrogen content The nitrogen content of all the protein encoded by the CDS regions of each species was determined by counting the occurrences of each amino-acid and multiplying by the corresponding number of nitrogen atoms. The CDS sequences were converted to amino acid sequences using the Biopython library. 4) Determination of transfer-RNA nitrogen content and usage tRNA genes were identified in each genome using the tRNAscan-SE software by Lowe, T.M. and Chan, P.P. (2016). tRNAscan-SE uses an advanced methodology for tRNA gene detection and functional prediction (determination of tRNA anticodon). Combining the results from the tRNAscan- SE and a python script, the nitrogen content of the tRNAs was determined. Using the data obtained from the codon usage tables obtained in part 2), it was possible to link codon biases with the corresponding tRNA nitrogen content. The aim was to examine whether tRNAs that had low nitrogen content had their corresponding codons more frequently represented than codons that were associated with higher nitrogen content tRNAs (among codons that are coding for the same amino acid). Results and Discussion: Investigating the nitrogen content of the genome of the three species reveals that G. aurea has a considerably lower number of nitrogen atoms in its genome than the two other plant species. A comparison with T. pratense shows the vast difference in nitrogen content can mostly be explained .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430036doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430036 http://creativecommons.org/licenses/by/4.0/ by the reduction of the number of nitrogen atoms in the non-coding DNA sequences of G. aurea (Fig. 1). Although T. pratense has 3 times more nitrogen in its CDS than G. aurea, the carnivorous plant has 10 times less nitrogen in its non-coding DNA sequences. At first glance, this observation supports the theory that genome reduction of G. aurea was motivated by nitrogen starvation. However, it might not come as a surprise to the reader that a vast reduction is genome size is accompanied by a vast reduction in genomic nitrogen content. This observation alone does not explain whether G. aurea has preferential usage of nitrogen-poor nucleotides (A-T base pairs). Figure 1: Number of nitrogen atoms in the entire genomic DNA, CDS and non-coding DNA of G. aurea (red), A. thaliana (blue) and T. pratense (green). In this report, the term molecular unit refers to a DNA base-pair, an RNA nucleotide or a protein amino acid. Relative nitrogen content refers to the average number of nitrogen atoms per molecular unit. Upon examination of the relative nitrogen content of DNA, RNA and protein of the three-plant species, an unexpected pattern occurs (Fig. 2). The nitrogen starved carnivorous plant has higher nitrogen counts per molecular unit in genomic DNA, CDS, Non-Coding DNA, protein, mRNA, exons and introns. This data does not support the hypothesis that nitrogen starvation has caused preferential usage of molecular units that are lower in nitrogen. Inter species variations aside, CDS DNA was found to be higher in nitrogen than non-coding DNA and similarly exons were found to be higher in nitrogen than introns. Interestingly, in all 7 plots, A. thaliana, which has an intermediary genome length compared to G. aurea and T. pratense, was also found to have to have an intermediary nitrogen usage as well. G . a ur ea A . t ha lia na T. p ra te ns e 0.0e+00 5.0e+08 1.0e+09 1.5e+09 2.0e+09 Genomic DNA N itr og en a to m s G . a ur ea A . t ha lia na T. p ra te ns e 0.0e+00 5.0e+08 1.0e+09 1.5e+09 2.0e+09 CDS N itr og en a to m s G . a ur ea A . t ha lia na T. p ra te ns e 0.0e+00 5.0e+08 1.0e+09 1.5e+09 2.0e+09 Non−Coding DNA N itr og en a to m s .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430036doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430036 http://creativecommons.org/licenses/by/4.0/ The first explanation of why G. aurea has a higher nitrogen usage in its DNA, RNA and proteins could be that there is not enough selective pressure on each molecular unit due to the small difference of nitrogen atoms gained for each molecular change. For example, a single substitution of a GC base-pair to an AT base-pair only lowers nitrogen usage by one nitrogen atom. In practice, it may be easier to remove whole sequences of non-coding or repeating sequences of DNA to optimize nitrogen usage. Some RNA transcripts may only be expressed for very short periods of time in the plant’s life cycle, reducing once again selective pressure on nitrogen optimization in these transcripts. However, this cannot explain why longer genomes are associated with lower nitrogen usage per molecular unit, at least for the three-species considered in this project. It is possible that the tiny genome of G. aurea combined with additional nitrogen captured from carnivory enables the species to have more leeway in using nitrogen rich amino acids and nucleotides. Finally, a last hypothesis is that G. aurea is actually using its transcriptome and proteome as a nitrogen bank. The high nitrogen content and the ubiquitous recycling of RNA and proteins in cells could make nitrogen storage in proteomes and transcriptomes possible. Figure 2: Average number of nitrogen atoms per molecular unit in genomic DNA, CDS, Non-Coding DNA, protein, mRNA, exons, introns and tRNA of G. aurea (red), A. thaliana (blue) and T. pratense (green). Error bars correspond to 95% confidence intervals. In this figure, mRNA refers to the pre-mRNA that hasn’t undergone removal of introns by splicing. Three different scales have been set for DNA sequences (A-C), protein sequences (D) and RNA sequences (E-H). G . a ur ea A . t ha lia na T. p ra te ns e N itr og en a to m s pe r b as ep ai r 7.30 7.35 7.40 7.45 7.50 Genomic DNA G . a ur ea A . t ha lia na T. p ra te ns e N itr og en a to m s pe r b as ep ai r 7.30 7.35 7.40 7.45 7.50 CDS G . a ur ea A . t ha lia na T. p ra te ns e N itr og en a to m s pe r b as ep ai r 7.30 7.35 7.40 7.45 7.50 Non−Coding DNA G . a ur ea A . t ha lia na T. p ra te ns e N itr og en a to m s pe r a m in o ac id 1.360 1.362 1.364 1.366 1.368 1.370 Protein G . a ur ea A . t ha lia na T. p ra te ns e N itr og en a to m s pe r n uc le ot id e 3.65 3.70 3.75 3.80 mRNA G . a ur ea A . t ha lia na T. p ra te ns e N itr og en a to m s pe r n uc le ot id e 3.65 3.70 3.75 3.80 Exons G . a ur ea A . t ha lia na T. p ra te ns e N itr og en a to m s pe r n uc le ot id e 3.65 3.70 3.75 3.80 Introns G . a ur ea A . t ha lia na T. p ra te ns e N itr og en a to m s pe r n uc le ot id e 3.65 3.70 3.75 3.80 tRNAG A F E D C B H .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430036doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430036 http://creativecommons.org/licenses/by/4.0/ Interestingly, the relative nitrogen content of tRNA may be lower in G. aurea than in the two- other species (Fig. 2, plot H). RNA sequencing data from Westermann, A., Gorski, S. and Vogel, J. (2012) shows that in eukaryotic cells there is about 3 times more tRNA than mRNA (as a measure of weight). The paper also states that cells contain almost 5 times more rRNA than tRNA, however, due to time constraints and the generally poorly annotated rRNA genes, rRNA nitrogen content was not determined. The fact that G. aurea had lower nitrogen content in tRNA sequences but not in other types of RNA or DNA sequences supports the hypothesis that there isn’t enough selective pressure on each molecular unit of DNA and mRNA to motivate nucleotide substitutions. Figure 3: Bar graph representing the codon usage bias and tRNA nitrogen content in G. aurea. For each amino acid, the codon usage bias was determined and the relative proportion of each codon is represented. Codons that are complementary to tRNAs that are low in nitrogen are lighter in colour than codons complementary to tRNAs that are rich in nitrogen. When no tRNA sequences were found, the codon is represented in grey. When multiple tRNA sequences were found for a single codon, the average nitrogen of the tRNAs is represented. The colour scale bar ranges from 220 N atoms (pure white) to 320 N atoms (pure red). Note that Tryptophan (W) was removed for aesthetic reasons, no tRNA gene was found by tRNAscan-SE for tryptophan (grey). I 0 20 40 60 80 10 0 I 0 20 40 60 80 10 0 220 N 320 N No Data .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430036doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430036 http://creativecommons.org/licenses/by/4.0/ Studying the codon usage bias of G. aurea (Figure 3) revealed that the carnivorous plant uses the entire genetic code and this preliminary attempt to link codon usage bias with tRNA nitrogen concentration cannot conclusively draw a conclusion on whether codons complementary to tRNAs that are rich in nitrogen are less represented than codons complementary to tRNAs poor in nitrogen. For the majority of codons, multiple tRNA sequences were found to encode that codon, taking the mean of the nitrogen content of these sequences makes the assumption that all the tRNAs are equally expressed in G. aurea. This of course is extremely unlikely, thus sequencing the tRNA transcriptome of G. aurea is the only way to make this data more accurate. Conclusion: This comparative study of the carnivorous plant Genlisea aurea, the model organism Arabidopsis thaliana (Brassicaceae) and the nitrogen fixing Trifolium pratense (Fabaceae) attempted to investigate whether the genome, transcriptome and proteome of G. aurea showed evidence of adaptations to low nitrogen availability. It was found that although G. aurea’s genome, CDS and non- coding DNA were much lower in nitrogen than the genome of T. pratense and A. thaliana this was solely due to the length of the genome, CDS and non-coding sequences rather than the composition of these sequences. In fact, in the genomic DNA, CDS, non-coding DNA, mRNA, exons, introns and proteins of G. aurea, the relative nitrogen content was found to be greater than in the two-other species suggesting that nitrogen starvation might not put enough selective pressure on each molecular unit to motivate nucleotide substitutions. It was found that in tRNA sequences, which are about 5 times more abundant than mRNA in eukaryotes, G. aurea may have lower relative nitrogen. Finally, an attempt to link codon usage bias with the nitrogen content of complementary tRNAs proved inconclusive possibly due to the fact that multiple tRNAs can be complementary to a single codon. Future studies should determine the relative nitrogen content of ribosomal RNAs and perform transcriptome sequencing to determine the nitrogen content of the three species’ transcriptomes. References: Acquisti, C., Kumar and S., Elser, J.J. (2009) Signatures of nitrogen limitation in the elemental composition of the proteins involved in the metabolic apparatus. Royal Society, Biological Sciences 2009;276:2605–10. Carlsson, G., Huss-Danell, K. (2003) Nitrogen fixation in perennial forage legumes in the field. Plant and Soil 253, 353–372 https://doi.org/10.1023/A:1024847017371 Ibarra-Laclette, E., Lyons, E., Hernández-Guzmán, G. et al. (2013) Architecture and evolution of a minute plant genome. Nature 498, 94–98. https://doi.org/10.1038/nature12132 Leushkin, E.V., Sutormin, R.A., Nabieva, E.R. et al. (2013) The miniature genome of a carnivorous plant Genlisea aurea contains a low number of genes and short non-coding sequences. BMC Genomics 14, 476 (2013). https://doi.org/10.1186/1471-2164-14-476 .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430036doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430036 http://creativecommons.org/licenses/by/4.0/ Lowe, T.M. and Chan, P.P. (2016) tRNAscan-SE On-line: Search and Contextual Analysis of Transfer RNA Genes. Nucleic Acids Research. 44: W54-57. Müller K, Borsch T, Legendre L, Porembski S, Theisen I, Barthlott W. Evolution of carnivory in Lentibulariaceae and the Lamiales, (2004), Plant Biology, Jul;6(4):477-90. doi: 10.1055/s-2004- 817909. PMID: 15248131. Seward, E.A. and Kelly, S. (2016), Dietary nitrogen alters codon bias and genome composition in parasitic microorganisms. Genome Biology 17, 226. https://doi.org/10.1186/s13059-016-1087-9 Westermann, A., Gorski, S. and Vogel, J. (2012), Dual RNA-seq of pathogen and host. Nature Reviews Microbiology 10, 618–630. https://doi.org/10.1038/nrmicro2852 .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430036doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430036 http://creativecommons.org/licenses/by/4.0/ 10_1101-2021_02_09_430363 ---- Accommodating site variation in neuroimaging data using hierarchical and Bayesian models ACCOMMODATING SITE VARIATION IN NEUROIMAGING DATA USING HIERARCHICAL AND BAYESIAN MODELS A PREPRINT Johanna M. M. Bayer Orygen Centre for Youth Mental Health, Melbourne, Australia The University of Melbourne, Melbourne, Australia bayerj@student.unimelb.edu.au Richard Dinga Donders Institute, Radboud University, Nijmegen, the Netherlands Radboud University Medical Centre, Nijmegen, the Netherlands Seyed Mostafa Kia Donders Institute, Radboud University, Nijmegen, the Netherlands Radboud University Medical Centre, Nijmegen, the Netherlands Akhil R. Kottaram Orygen Centre for Youth Mental Health, Melbourne, Australia Thomas Wolfers Radboud University Medical Centre, Nijmegen, the Netherlands Department of Psychology, University of Oslo, Norway Jinglei Lv School of Biomedical Engineering Brain and Mind Center, University of Sydney, Sydney, Australia Andrew Zalesky Melbourne Neuropsychiatry Centre, The University of Melbourne Melbourne Health, Melbourne, Australia Department of Biomedical Engineering, The University of Melbourne, Australia Lianne Schmaal ∗ Orygen Centre for Youth Mental Health,Melbourne, Australia The University of Melbourne, Melbourne, Australia Andre Marquand * Donders Institute, Radboud University, Nijmegen, the Netherlands Radboud University Medical Centre, Nijmegen, the Netherlands Institute of Psychiatry, Kings College London, London, UK February 9, 2021 ∗shared last author (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 ABSTRACT The potential of normative modeling to make individualized predictions has led to structural neu- roimaging results that go beyond the case-control approach. However, site effects, often con- founded with variables of interest in a complex manner, induce a bias in estimates of normative models, which has impeded the application of normative models to large multi-site neuroimag- ing data sets. In this study, we suggest accommodating for these site effects by including them as random effects in a hierarchical Bayesian model. We compare the performance of a linear and a non-linear hierarchical Bayesian model in modeling the effect of age on cortical thickness. We used data of 570 healthy individuals from the ABIDE (autism brain imaging data exchange, http://preprocessed-connectomes-project.org/abide/) data set in our experiments. We compare the proposed method to several harmonization techniques commonly used to deal with additive and multiplicative site effects, including regressing out site and harmonizing for site with ComBat, both with and without explicitly preserving variance related to age and sex as biological variation of interest. In addition, we make predictions from raw data, in which site has not been accommodated for. The proposed hierarchical Bayesian method shows the best performance accord- ing to multiple metrics. Performance is particularly bad for the regression model and the ComBat model when age and sex are not explicitly modeled. In addition, the predictions of those models are noticeably poorly calibrated, suffering from a loss of more than 90 % of the original variance. From these results we conclude that harmonization techniques like regressing out site and ComBat do not sufficiently accommodate for multi-site effects in pooled neuroimaging data sets. Our results show that the complex interaction between site and variables of interest is likely to be underestimated by those tools. One consequence is that harmonization techniques removed too much variance, which is undesirable and may have unpredictable consequences for subsequent analysis. Our results also show that this can be mostly avoided by explicitly modeling site as part of a hierarchical Bayesian Model. We discuss the potential of z-scores derived from normative models to be used as site corrected variables and of our method as site correction tool. Keywords neuroimaging · normative modeling · site effects · Hierarchical Bayesian Modeling 1 Introduction The most prominent paradigm in clinical neuroimaging research has for a long time been case-control approaches which compare averages of groups of individuals on brain imaging measures. Case-control inferences can be clinically meaningful under some circumstances when the group mean is a good representation of each individual in the group. However, this pre-condition has been challenged recently, demonstrating that the biological heterogeneity within clinical groups can be substantially large [Marquand et al., 2016]. For example, the structure and morphology of the brain have been found to vary between individuals in dynamic phases like adolescence [Foulkes and Blakemore, 2018] and within clinical groups, such as bipolar disorder and schizophrenia [Wolfers et al., 2018a] and attention deficit disorder [Wolfers et al., 2019]. In addition, inter-individual differences have shown to not necessarily be in line with results obtained via the group comparison approach [Wolfers et al., 2019]. Such heterogeneity has been considered a potential cause for the lack of differences between clinical groups and controls within the standard group comparison approach [Feczko et al., 2019] and the failure to replicate findings between studies [Fried, 2017]. As a consequence, there has been a shift in focus towards taking into account variation at the individual level [Marquand et al., 2019]. This is in line with a trend towards personalized medicine or "precision medicine" [Mirnezami et al., 2012], where characteristics of the individual are used to guide the treatment of mental disorders. This shift has been accompanied by a trend towards approaches that go beyond comparing averages of distinctly labeled groups [Insel et al., 2010, Insel, 2014], for an overview of methods see [Marquand et al., 2016]. Among them, normative modeling has been successfully used to capture inter-individual variability and make predictions at the individual level. The strength of normative modeling lies within the ability to map variation along one dimension (e.g., brain volume) onto a second co-varying variable (e.g., age), redefining the variation in the first dimension as explained by this new covariate of interest. This concept allows to describe the normative variation, thus the range containing e.g., 95 % of all individuals, as a function of the covariate and considers each individual’s score in relation to the variation in the reference group defined by the covariate score. The concept is similar to the use of growth charts in pediatric medicine, in which height and weight are expressed as a function of age. Hence, in this setting, an individual’s height or weight is not considered by its absolute value, but expressed as a percentile score of deviation fluctuating with age, with the median line corresponding to the 50% percentile and defining the norm, or average height. 1 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint http://preprocessed-connectomes-project.org/abide/ https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 In neuroimaging, normative models have been applied to clinical and non-clinical problems using various covariates, statistical modeling approaches (for an overview see [Marquand et al., 2019]) and targeting a variety of response variables. In general, any variable can be used as a covariate in a normative model targeting neuroimaging measures, as long as the variation along the co-varying dimension is not zero. However, normative models with age and sex as covariates and brain volume as response variable are currently more frequently found in the litera- ture, [Wolfers et al., 2020, Wolfers et al., 2018b, Zabihi et al., 2019, Kessler et al., 2016]. These implement the growth charting idea applied to high dimensional brain imaging data. For example, a normative model of a brain structure can be created based on the variation of individuals in population based cohorts. The estimated norm can be used to infer where individuals with clinical symptoms can be placed with respect to the reference defined by the norma- tive model. This has been the recipe of many recently published studies using the normative modeling framework [Wolfers et al., 2020, Bethlehem et al., 2018, Wolfers et al., 2019, Lv et al., 2020]. Underlying this approach is the assumption that the individually derived patterns of deviation uncover associations to clinical/behavioral variables that would be obscured by averaging across groups of individuals. However, the amount of data necessary to create normative models poses a challenge to normative modeling in neuroimaging, as the cost and time factor associated with neuroimaging data impedes the collection of large neuroimaging samples in a harmonized way. One exceptional example where large scale data collection succeeded and included both harmonized scanners and scanning protocols, is the UK Biobank initiative, which, when launched in 2006, aimed to scan 100,000 individuals at four different scanning locations [https://www.ukbiobank.ac.uk/explore-your-participation/contribute-further/ imaging-study][Miller et al., 2016]. Other neuroimaging initiatives have also taken on the challenge to collect neuroimaging data in large scale quantities and have relied on harmonized scanning protocols, but did not collect the data using harmonized scanners (i.e. ADNI, [Mueller et al., 2005], ABCD study [Volkow et al., 2018]). Nonetheless, the restricted age ranges (e.g., 40-69 years in UK Biobank [Miller et al., 2016]), or focus on a particular (clinical) cohort (e.g. Alzheimer’s in ADNI, [Mueller et al., 2005]) limit their utility for estimating normative models mapping the normative association between, for example, age and brain structure or function. An alternative way to obtain large neuroimaging data sets and assess data from a large number of subjects is by pooling or sharing data that has already been collected. One example is the Enhancing NeuroImaging and Genetics through Meta-Analysis (ENIGMA) consortium [Thompson et al., 2020]. ENIGMA succeeded in pooling neuroimaging and genetics data of thousands of individuals, including healthy individuals and individuals with psychiatric or neurological disorders. The strategy of data sharing initiatives like ENIGMA is to collect already collected data from different cohorts and different scanning sites and harmonize preprocessing and statistical analysis with standardized protocols. However, a major disadvantage is the presence of confounding "scanner effects" [Fortin et al., 2018] (e.g., differences in field strength, scanner manufacturer etc. [Han et al., 2006])). These confounding effects present as site correlated biases that cannot be explained by biological heterogeneity between samples. An example of those effects on derived measures of cortical thickness can be found in Fig. 1a. They result from a complex interaction between site and variables of interest, manifesting in biases on lower and higher order properties of the distribution of interest, such as differences in mean and standard deviations, skewness and spatial biases Fig. (1a, 1b), and cannot be explained by e.g., differences in age or sex Fig. (1c). As the origin of these effects might not only be related to the scanner per se, but extend to various factors related to a single acquisition site [Gronenschild et al., 2012], we will refer to them as site effects from here on. As outlined in the previous paragraph, the effort to create large samples to capture between subject variability often induces site-driven variability. This issue of site-driven variability in shared neuroimaging data has been acknowledged and has led to the development of harmonization methods at a statistical level. A common approach to deal with site effects is through "harmonizing" by, e.g., confound regression. One example of this approach is a set of algorithms summarized under the name "ComBat" [Fortin et al., 2017]. The method had originally been developed by [Johnson et al., 2007], who used empirical Bayes to estimate "batch effects", referring to non-biological variation added due to the handling of petri-dishes in micro-array experiments on the results of gene expression data. Fortin and colleagues adapted the framework to apply to neuroimaging data [Fortin et al., 2017]. In ComBat, additive and multiplicative site effects on a particular target unit (e.g., a particular brain voxel for one participant) are estimated using empirical Bayes and by placing a prior distribution over estimates for these units. The etsimate of the scanner effect is then used to adjust the prediction. Newer versions also allow to preserve variance of interest in the model, for example for age, sex or diagnosis [Fortin et al., 2017, Fortin et al., 2018]. ComBat has been applied to several types of neuroimaging data, including diffusion tensor imaging data (DTI, [Fortin et al., 2017]) and structural magnetic resonance imaging data [Fortin et al., 2018]. However, the reliability of harmonization strategies is grounded on the condition that site effects are orthogonal to the effect of interest and uncorrelated with other covariates in the model [Chen et al., 2014]. In reality, however, data pooled from several sites is often confounded with co-linear effects. Many individual neuroimaging samples, 2 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://www.ukbiobank.ac.uk/explore-your-participation/contribute-further/imaging-study https://www.ukbiobank.ac.uk/explore-your-participation/contribute-further/imaging-study https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 for example, are restricted to a specific age range, leading to age being correlated with site effects. In this scenario, removing an estimate of the scanner effect can lead to excluding (biological) variation that would be of interest. With this paper we suggest an alternative approach to deal with site effects in neuroimaging, which is relatively generally applicable. However, here we focus in particular on normative modeling. We propose a hierarchical Bayesian approach in which we include site as a covariate into the model, avoiding the exclusion of meaningful variance correlated to site by predicting site effects as part of the model instead of removing them from the data. This approach is similar to the approach by [Kia et al., 2020], who used hierarchical Bayesian regression (HBR) in a similar way for multi-site modeling in a pooled neuroimaging data set, which contained 7499 participants that were scanned with 33 different scanners. [Kia et al., 2020]’s estimate of site variation is based on a partial pooling approach, in which the variation between site-specific parameters is bound by a shared prior. The approach showed better performance when evaluated with respect to metrics accounting for the quality of the predictive mean and variance compared to a complete pooling of site parameters and to ComBat harmonization, and similar performance to a no-pooling approach, with the benefit of reduced risk of over-fitting due to the shared site variance. Moreover, [Kia et al., 2020] also showed that the posterior distribution of site parameters from the training set can also be used as an informed prior to make predictions in an unseen, new test set, outperforming predictions from complete pooling and uninformed priors, and overcoming a weakness of ComBat. The method was also able to display heterogeneity between individuals with varying clinical diagnoses in associated brain regions of 1017 clinical patients of the study. The present paper is a replication and extension of the approach by [Kia et al., 2020]. Based on several successful attempts of using Gaussian Process Regression to map non-linearity in normative models [Kia and Marquand, 2018, Marquand et al., 2016, Marquand et al., 2014], we extend the normative model with the capacity to account for site effects by adding a Gaussian process to model non-linear effects between age and the brain structure. In addition, our model is fully Bayesian and entails a hierarchical structure, including priors and hyper priors for each parameter. We use data from the ABIDE (autism brain imaging data exchange, http://preprocessed-connectomes-project. org/abide/) data set to compare a non-linear, Gaussian version of the model, to a linear hierarchical Bayesian version accounting for site effects that does not include the Gaussian Process term. We show that the hierarchical Bayesian models including a site parameter perform better than existing methods for dealing with additive and multiplicative site effects, including ComBat and regressing out site. We discuss the normative hierarchical Bayesian methods with regard to their implications for neuroimaging data-sharing initiatives and their use as general technique to correct for site effects. 2 Methods In this section we will introduce the data used in this study and the pre-processing steps applied, followed by a conceptual and mathematical description of our approach to include site as predictor in a normative hierarchical Bayesian model. We will also illustrate other methods (than including site as predictor) to accommodate for site effects that will be used to validate our approach against. Lastly, we will outline which measures will be used for model comparison. 2.1 Data The following sub-section aims to give a description of the ABIDE data set, including a study on the scope of site effects in the data. 2.1.1 ABIDE data set The ABIDE consortium (http://preprocessed-connectomes-project.org/abide/) was founded to facilitate research and collaboration on autism spectrum disorders by data aggregation and sharing. The consortium provides a publicly available structural magnetic resonance imaging (MRI) data set and corresponding phenotypic information of 539 individuals with autism spectrum disorder and 573 age-matched typical controls. For this study, only data from healthy individuals were included. As those healthy controls are meant to be complementary to the autism branch in the data set, 403 out of 539 subjects in this study were male. The data was processed using a standardized protocol [Craddock et al., 2013] of the FreeSurfer standard pipeline (Desikan-Kiliany Atlas) as part of the Preprocessed Connectomes Project [Craddock et al., 2013] and has been made available for download on the preprocessed section of the ABIDE initiative. For the current study we focused on cortical thickness measures of the 34 bilateral regions of the Desikian Killiany atlas parcellation [Desikan et al., 2006] as a part of the FreeSurfer [Fischl et al., 2004] output and the average cortical thickness across all 34 regions. We chose to include cortical thickness measures since they show a strong (negative) association with age (unlike measures of surface area, which remain more stable across the life span [Storsve et al., 2014]). 3 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint http://preprocessed-connectomes-project.org/abide/ http://preprocessed-connectomes-project.org/abide/ http://preprocessed-connectomes-project.org/abide/ https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 (a) Distribution of average cortical thickness measures of 573 individuals, grouped by the 20 acquisition sites the data were collected at (each boxplot de- scribes the distribution of one site). (b) Average cortical thickness of 573 individuals regressed onto age, grouped by site (each regression line describes one site). (c) Thickness measures of all 34 cortical re- gions average cortical thickness grouped by individual, colored by site, sorted by age (each boxplot represents one individual). Displayed are 4 out of 20 sites from the ABIDE data set (d) Distribution of all 34 cortical regions average cortcial thickness per individual, summarized as boxplot (each boxplot represents one individual). Boxplots are coloured by site and ordered by age within site. Figure 1: Site effects in 573 healthy individuals from the ABIDE data set. 2.1.2 Site effects in the ABIDE data set The ABIDE data set has been obtained by aggregating data from 20 independent samples collected at 17 different scan- ning locations [Di Martino et al., 2014]. Although all data has been collected with 3 Tesla scanners and preprocessed in a harmonized way [Craddock et al., 2013], sequence parameters for anatomical and functional data, as well as type of scanner varied across sites [Di Martino et al., 2014]. In addition, sites differ in distribution of age and sex and in sample size. An overview of site-specific data is provided in Table 1 and in [Di Martino et al., 2014]. The ABIDE data set is affected by site specific effects that are unlikely to be explained by biological variation. They manifest as linear and non-linear interactions between scanning site, covariates (for example age and sex), and cortical measures. Similar to batch effects in genomics [Leek et al., 2010], those effects lead to a clustering of the data caused by external factors related to the scanning- and analysis process. With the aim to estimate to which extent the ABIDE data set is affected by site effects, we calculated an ANCOVA with age as covariate. It revealed that average cortical thickness differed between site (main effect site: F(19, 516) = 4.4, p < 0.1 × 10−8, sum contrast). In addition we tested for differences in variance between sites. Bartlett’s sphericity test [Bartlett, 1937] showed a difference in variance between sites even after regressing out variance that could be explained by age and sex (p < 0.001). The site effects in the ABIDE data set are visualized in Fig. 1. 2.2 Splitting the ABIDE data set into training and test sets To evaluate the performance of the models, we split the data into a training set (70% of data) and a test set (30% of data) using the R package caret and splitstackshape, while the distribution of age, sex and site was preserved between sets. Thus, training and test sets contained individuals from the same sites ("within-site-split"). An overview of the distribution of age and sex for the training and test sets can be found in Fig. 2. Subsequently, the training and test sets were standardized region-wise based on location and scale parameters of the training set. For the model estimation process, only complete pairs of observations (per region) were used. 4 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 Figure 2: Overview over pheno- typic information in the ABIDE data set. // Age male subjects: M = 17.5. SD = 8.3. Age female subjects: M = 15.6, SD = 7.0. Range = 6.5-40 2.3 site as a predictor in a Hierarchical Bayesian Model With the aim to create reliable normative models in multi-site neuroimaging data, we developed and compared two versions of a hierarchical Bayesian models that include site as a predictor. In a hierarchical linear version of the model, site is modeled hierarchically, resulting in a random effect for site (Hierarchical Bayesian Linear Model, HBLM). In a non-linear version of the model, a Gaussian Process for age is added to test whether performance is increased if the model is also able to capture non-linear effects between age and thickness of the cortical region ("Hierarchical Bayesian Gaussian Process Model, HBGPM"). Both Hierarchical Bayesian models were trained and tested in a within site split (see section 2.2 on splitting the multi-site ABIDE data set.) 2.4 Comparison models To get a better understanding of the performance of our approach, we performed a second analysis, comparing the hierarchical Bayesian approach with site as predictor to predictions made from a data that other methods managing site effects had been applied on. In the following, those alternative models will be summarized under the term comparison models. Of note, the approach used to accommodate for site effects in the comparison models is fundamentally different from the approach used in the hierarchical Bayesian models. In the hierarchical Bayesian approach, multi-level modeling is used to account for site-variance without removing it, whereas different methods of harmonization are used on the data to remove variance related to site as part of the comparison models approach. In detail, the comparison model approach entailed a two-step procedure, in which site effects are first harmonized by three different common models of site harmonization, and then a simple Bayesian linear algorithm, with an additive term for age and sex, but without site as a predictor is used to make predictions in Stan [Stan Development Team, 2020b]. The harmonization procedures include i) regressing out site effect from the cortical thickness measures using linear regression and using the residuals as input to the simple Bayesian linear model (thus, removing additive variant components of site), ii) using ComBat [Johnson et al., 2007, Fortin et al., 2017] to clear the data from site effects (thus, harmonizing for additive and multiplicative effects of site, and iii) using ComBat as above, but explicitly preserving the variance associated with sex and age; an approach which will be referred to as modified ComBat in the following. Predictions made from raw data (thus, without any treatment of site effects) were used as a baseline model. An overview over all pipelines for all models can be found in Fig. 3. 2.5 Performance measures 2.5.1 Measures of model performance Model performance is assessed using several common performance metrics. The Pearson’s correlation coefficient ρ indicates the linear association between true and predicted value of cortical thickness measures. However, correlations 5 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 Figure 3: Pipelines for hierarchical Bayesian and comparison models are not a sensitive error measure and cannot capture the "miss" between true and predicted value. Hence, we also calculate the standardized version of the root mean squared error (SRMSE) and the point-wise log-likelihood at each data point in the test set as a metric indicating deviance from the true value. However, these measures only take into account the estimate of the mean, and do not account for variations in the estimate of the variance. Thus we also compute the proportion of variance explained (EV) by the predicted values and a standardized version of the log-loss (mean standardized log-loss, MSLL [Rasmussen and Williams, 2006]). The latter does not only take into account the variance of the test set, but also standardizes it by the variance of the training set, making a comparison between the models possible. This step is necessary as various methods of correcting for site might also have an impact on the variance remaining in the data. 2.5.2 Measures of goodness of the simulation in Stan Parameters indicating the goodness of the model simulation process in Stan itself, like convergence, effective sample size, and trace plots can be found in the supplementary material. 2.6 Model specification In this section we show how normative models describing the association between age, and sex, and cortical thickness measures can be modeled on data comprising site effects using a hierarchical Bayesian linear mixed model with a Gaussian Process term, which allows to model non-linear association between age and cortical thickness measures. 6 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 Following the notations of [Gelman, 2008, Rasmussen and Williams, 2006], we model a target vector y ∈ Rn×1 containing the the individual responses yi for each subject i = 1, . . . ,n and each region, using a latent function f = f(x). fi = f(xi) is the evaluation of the latent function for an input vector xi containing all p input variables of subject i, and is considered to differ from the true response variables by additive noise �i with the variance ηi and N(0,σ2) along the diagonal, with I being a n×n identity matrix: y ∼N(f,σ2I), (1) or, for the individual case: yi = f(xi) + �i, (2) with: �i ∼N (0,σ2). The ability of the model to deal with site effects is obtained by introducing a random effect for site s = 1, 2, . . . ,q so that the prediction for the ith subject is a combination of fixed and varying effects: f = Xβ + Zu + γ, (3) where γ is an additional non-linear component (defined in (5) below) and the estimate for one particular subject i is calculated the following: fi = p∑ j=1 xijβj + q∑ s=1 zisus + γi (4) with β ∼N(0, Σj) u ∼N(0, Σs). Here, β is a 1 × p vector containing the fixed regression weights corresponding to an n × p input matrix X with columns j = 1, . . . ,p. In case of non-centralized data one column of ones for an intercept offset has to be added. Similarly, u is a 1 × q vector containing the weights for random effects across subjects, corresponding to a dummy coded n×q matrix Z modeling site. For all linear models, in (3) we assume γi = 0. For the non-linear models we assume γ is a Gaussian Process with mean function m(x) and covariance function k(x,x′) to allow for non-linear dependencies between the predictors and the target variable: γ ∼ GP(m(x),k(x,x′)). (5) In our case, we set m(x) = 0 and define k(x,x′) as the additional non-linear component in the following squared-exponential form: k(x,x′) = σ2fexp(− 1 2l2j (x,x′))2, (6) with free parameters for the signal variance term σ2f and the length scale l. Note this allows to specify two sources of variance: The signal variance σ2f and the noise variance σ 2 as modeled in (1). From a hierarchical Bayesian point of view, random effects are equal to a hierarchical structure of sources of variation. For modeling site effects, introducing a hierarchical structure has the benefit that it allows to include structural dependencies between sites via partial pooling. Thus, instead of modeling site effects as an effect shared between sites or independently from each other, a semi-independent association between sites can be obtained via assuming that all site parameters originate from a shared first-order prior distribution. This concept has been used elsewhere [Kia et al., 2020, Gelman et al., 2013, Mathys et al., 2012]. 7 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 We hence induce shared priors and hyper priors θ0 for site s, i.e. ∀s,us ∼ InvΓ(2, 2), and a uniform prior for the length scale l ∼ U(1, 8). We use Stan [Carpenter et al., 2017, Stan Development Team, 2020b] to estimate all free parameters θ = (βT , uT, lT,σ,σf ) performing Bayesian inference: p(θ|X,y,θ0) = p(θ,X,y,θ0) p(X,y,θ0) = p(θ|X,y,θ0) p(X,y,θ0) (7) where p(X,y,θ0) = ∫ p(θ)p(X,y,θ0|θ) dθ. 2.6.1 Posterior predictive distribution We obtain the posterior predictive distribution y∗ for a new sample x∗ via: p(y∗|y) = ∫ p(y∗,θ|x∗,X,y,θ0) dθ = ∫ p(y∗|θ,x∗,X,y,θ0) p(θ|X,y,θ0) dθ = ∫ p(y∗|θ) p(θ|X,y,θ0) dθ (8) as y and y∗ are considered to be conditionally independent given θ [Gelman et al., 2013]. Further, the predictive distribution can be computed exactly, writing the joint distribution of the known data y, X and the new sample x∗, with the variance being determined by sample variance σ2 and the Gaussian kernel k(x,x′): k(x,x′) = [ K + σ2I k∗ kT∗ k∗∗ ] (9) Here, K is an n×n covariance matrix of training data, k∗∗ denotes the variance at the test sample points and k∗ is the covariance between y∗ the known data. 2.6.2 Comparison models We compare the hierarchical Bayesian attempt to normative modeling to commonly used harmonization techniques in which site is controlled for by subtracting an estimate of the site effect from the data prior to fitting the normative model. These methods included: i) removing additive effects of site, by regressing out site effects via linear regression and using the residuals as input for the simple Bayesian linear model to obtain the normative scores, ii) harmonizing for additive and multiplicative effects of site using ComBat [Johnson et al., 2007, Fortin et al., 2017], iii) modified ComBat, thus, using ComBat as before, but preserving biological variance of interest i.e., sex and age. All these methods involve removing site effects prior to estimating the normative scores in contrast to our method in which we explicitly model site within the normative modeling framework. These harmonized data, obtained as output from the harmonization techniques, are subsequently used for normative modeling in a simple Bayesian linear model that does neither take into account site effects nor non-linear dependencies between age and measures of cortical thickness. Thus, equation (3) is reduced to f = Xβ with β ∼N(0, ∑ j). In addition we use this simple Bayesian linear model to make one set of predictions for each regions from data that was not in any way harmonized for site (raw data model). R [R Core Team, 2020] was used for preprocessing of all data and to create the data set where site was regressed out, and for preprocessing the data with ComBat [Johnson et al., 2007, Fortin et al., 2017]. 2.6.3 Implementation: Normative modeling in Stan Both the hierarchical Bayesian and the comparison model version of the normative models were implemented in Stan [Carpenter et al., 2017, Stan Development Team, 2020b], a probabilistic C++ based programming language to perform Bayesian Inference, and analyzed in R [R Core Team, 2020] using the package rstan [Stan Development Team, 2020a]. Stan allows to directly compute the log posterior density of a model given the known variables x and y. It uses the No-U-Turn Sampler (NUTS) [Hoffman and Gelman, 2014], a variation of Hamiltonian Monte Carlo Sampling [Duane et al., 1987, Neal et al., 2011, Neal, 1994] to generate representative samples from the posterior distribution of parameters and hyper parameters θ, each of which has the marginal distribution p(θ | y,x). This is achieved by first approximating the distribution of the data to a defined threshold in a warm up period and then randomly sampling 8 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 from the model, generating new draws of parameters for each iteration and calculating the response of the model. This approach of sampling instead of fitting allows for the simulation of complex models for which the derivation of an analytical solution of the posterior is computationally costly or not possible. The Bayesian framework provides access to the full posterior distribution and to the distribution of all parameters. This allows to deduce the a variance estimate of each parameter, leading to a parameter estimate that is not only described by its mean, but also by the (un)-certainty around the mean estimation, providing information on its accuracy and reliability. Moreover, we can use the posterior distribution of each site-specific parameter from the training set as prior for the test set, allowing to make predictions for unfamiliar sites. The Stan code for the HBLM, the HBGPM and the simple Bayesian linear model without site as predictor can be found at https://github.com/likeajumprope/Bayesian_normative_models. 2.7 Model simulation process in Stan Parameters indicating the goodness of the model simulation process in Stan [Carpenter et al., 2017, Stan Development Team, 2020b] itself, like convergence, effective sample size and trace plots can be found in the supplementary material. 3 Results Both the HBLM and the HBGPM outperformed all other comparison models with respect to all performance measures considered in this study. In detail, the HBLM and the HBGPM showed higher average values of the Pearson’s correlation coefficient ρ (Table 2), lower average SRMSEs (Table 3), smaller average LL (Table 4) and higher average proportions of EV (Table 5) than all comparison models (p < 0.001 for all comparisons). For none of these comparisons did the non-linear HBGPM outperform the linear HBLM. In addition to the mean comparisons reported in Table 2 - 5, the distribution of all performance measures across all 34 regions and for average cortical thickness across the entire cortex per model can be found in Fig. 4. A detailed comparison of all models with respect to to ρ, SRMSE, EV and LL can be found in the supplementary material. 3.1 Mean standardized log loss To also account for the second order statistics of the posterior distributions created by each model, we calculated the mean standardized log loss (MSLL). This measure can only be calculated for the test set, as it is the log loss standardized by the mean loss of the training data set [Rasmussen and Williams, 2006]. Hence, the MSLL gives an indication of whether a model is able to predict the data better than the mean of the training set (with more negative values being better). An overview of the MSLL for all cortical thickness measures of all regions for all models is given in Fig. 5a. The only models that perform better for most regions than the mean of the training data set are the Hierarchical Bayesian models ( MSLLHBGP M < 0 for all regions; MSLLHBLM < 0 for all but one region), in contrast to prediction from the residuals and the ComBat model, where none of the predictions perform better than the mean of the training data set (MSLLresiduals > 0 for all regions; MSLLComBat > 0 for all regions, see Fig. 5a. The MSLL for the modified ComBat model and raw data model were region-dependent, with 45 % regions (16 out of 35) for the modified ComBat model and 17% of regions (six out of 35) for the raw data model performing better than predictions from the mean of the training set. It should also be mentioned that for some individual regions the comparison models performed very poorly (max MSLLComBat = 356, max MSLLmod.ComBat = 138, max MSLLraw = 1252; max MSLLresiduals = 517) and show measures that exceeded the plotted range of Fig. 5a. In contrast, the maximum MSLL for the hierarchical Bayesian models was max -0.056 for the HBGPM and max 0.08 for the HBLM. 3.2 Predictive Variance We also observed that the models differ in the variance of predicted values, as visualized in Fig. 5b for average cortical thickness. For the ComBat, the raw data and the residuals model the range of predicted values was severely restricted (range predicted values raw data, test set: [2.60 - 3.03], range predicted values residuals, test set [2.64 - 3.00]; range predicted values ComBat, test set: [2.73 - 2.97]. These intervals cover 9.2 %, 7.9 % and 8.0 % of the original test set variance, respectively. The modified ComBat model retained 29.0% of the original test set variance (range predicted value modified ComBat [2.55 = 3.01]. In other words, all harmonization techniques had a reduced predictive variance and were instead biased toward predicting the mean, sometimes severely. In contrast, this bias was substantially reduced 9 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://github.com/likeajumprope/Bayesian_normative_models https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 S it e M an uf ac tu re r P la tf or m V ox el S iz e T R T E n m al es ag e ra ng e (A bb re vi at io n) [m m ] [m s] [m s] [% ] [y ea rs ] C al if or ni a I. of Te ch no lo gy (C al te ch ) S IE M E N S T IM T R IO 1. 0× 1. 0× 1. 0 1 5 9 0 2 .7 3 1 5 0 .7 3 17 -3 9 C ar ne gi e M el lo n U .( C M U ) S IE M E N S T IM T R IO 1. 0× 1. 0× 1. 0 1 8 7 0 2 .4 8 1 3 0 .7 7 20 -4 0 K en ne dy K ri eg er I. (K K I) P hi li ps A ch ie va 1. 0× 1. 0× 1. 0 8 3 .7 3 3 0 .7 3 8- 12 L ud w ig M ax im il ia ns U .M un ic h (L M U ) S IE M E N S V E R IO 1. 0× 1. 0× 1. 0 1 8 0 0 3 .0 6 1 5 1 18 -2 9 N Y U L an go ne M ed ic al C en te r (N Y U ) S IE M E N S A L L E R G R A 1. 3× 1. 0× 1. 3 2 5 3 0 3 .2 5 2 0 0 .7 5 12 -1 6 O li n I. of L iv in g at H ar tf or d H os pi ta l (O L IN ) S IE M E N S A L L E G R A 1. 0× 1. 0× 1. 0 2 5 0 0 2 .7 4 3 0 0 .9 7- 35 O re go n H ea lt h an d S ci en ce U . (O H S U ) S IE M E N S T IM T R IO 1. 0× 1. 0× 1. 0 2 3 0 0 3 .5 8 1 0 5 0 .7 5 6- 31 S an D ie go S ta te U . (S D S U ) G E M R 75 0 1. 0× 1. 0× 1. 0 N A N A 1 5 1 8 12 S oc ia lB ra in L ab (S B L ) P hi li ps IN T E R A 1. 0× 1. 0× 1. 0 9 3 .5 1 6 0 .8 7 5 10 -2 3 S ta nf or d U . (S TA N F O R D ) G E S IG N A 0. 85 9 × 1. 50 0 × 0. 85 9 8 .4 1 .8 2 7 0 .8 5 9 33 T ri ni ty C en tr e fo r H ea lt h S ci en ce s (T R IN IT Y ) P hi li ps A ch ie va 1. 0 × 1. 0 × 1. 0 3 .9 8 .5 1 3 1 20 -3 9 U .o f C al if or ni a L os A ng el es 1 (U C L A 1 ) S IE M E N S T IM T R IO 1. 0 × 1. 0 × 1. 2 2 3 0 0 2 .8 4 2 2 0 .7 3 8- 16 U .o f C al if or ni a L os A ng el es 2 (U C L A 2 ) S IE M E N S T IM T R IO 1. 0 × 1. 0 × 1. 2 2 3 0 0 2 .8 4 2 0 0 .8 7- 12 U .o f L eu ve n 1 (L E U V E N 1 ) P hi li ps A ch ie va 0. 98 × 0. 98 × 1. 20 9 .6 4 .6 2 5 1 12 -2 5 U .o f L eu ve n 2 (L E U V E N 2 ) P hi li ps A ch ie va 0. 98 × 0. 98 × 1. 20 9 .6 4 .6 3 2 0 .8 8 9- 17 U .o f M ic hi ga n 1 (U M 1 ) G E S IG N A N A 2 5 0 5 .7 1 3 0 .8 5 9- 13 U .o f M ic hi ga n 2 (U M 2 ) G E S IG N A N A 2 5 0 5 .7 5 4 0 .6 9 8- 19 U .o f P it ts bu rg hS ch oo lo f M ed ic in e (P IT T ) S IE M E N S A L L E R G R A 1. 1× 1. 1× 1. 1 2 1 0 0 3 .9 3 2 1 0 .9 5 13 28 U .o f U ta h S ch oo lo f M ed ic in e (U S M ) S IE M E N S T IM T R IO 1. 0× 1. 0× 1. 2 2 3 0 0 2 .9 1 4 3 1 8- 39 Y al e C hi ld S tu dy C en te r (Y A L E ) S IE M E N S T IM T R IO 1. 0× 1. 0× 1. 2 1 2 3 0 1 .7 3 2 8 0 .7 1 7- 17 Ta bl e 1: T he sc an ne r pa ra m et er s an d sa m pl e sp ec ifi ca ti on s of th e A B ID E da ta se t. 10 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 in the hierarchical Bayesian models, which retained 57.0 % (HBLM) and 65.0 % (HBGPM) of the original test variance (range predicted values HBLM, test set: [2.43 - 3.23]; range predicted values HBGPM, test set: [2.38 - 3.28]). Mean Correlation (STD) Post-hoc comparison ρ training set test set HBLM HBGPM mod. ComBat ComBat residuals raw data HBLM 0.734 (0.06) 0.694 (0.06) ns. *** *** *** *** HBGPM 0.752 (0.05) 0.705 (0.06) ns. *** *** *** *** mod. ComBat 0.541 (0.15) 0.568 (0.16) *** *** *** *** *** ComBat 0.289 (0.09) 0.343 (0.11) *** *** *** ns. *** residuals 0.267 (0.08) 0.329 (0.12) *** *** *** ns *** raw data 0.435 (0.14) 0.435 (0.16) *** *** *** * ** Table 2: Post-hoc tests of correlations between true and predicted values. Cell values indicate post-hoc comparison significance values (adjusted by tukey method for a comparing a family of 6 estimates). Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ns. blue: test set. yellow: training set. Mean SRMSE (STD) Post-hoc comparison SRMSE training set test set HBLM HBGPM mod. ComBat ComBat residuals raw data HBLM 0.0608 (0.006) 0.066 (0.005) n.s *** *** *** *** HBGPM 0.0587 (0.006) 0.064 (0.006) ns. *** *** *** *** mod. ComBat 0.0763 (0.007) 0.075 (0.008) *** *** *** n.s ns. ComBat 0.0872 (0.003) 0.085 (0.005) *** *** *** *** *** residuals 0.0865 (0.003) 0.085 (0.004) *** *** ns. *** n.s raw data 0.0808 (0.006) 0.085 (0.008) *** *** *** *** *** Table 3: Post-hoc tests of SRMSE between true and predicted values. Cell values indicate post-hoc comparison significance values (adjusted by tukey method for a comparing a family of 6 estimates). Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ns. blue: test set. yellow: training set. LL training set test set HBLM −1.050 −1.121 HBGPM −1.020 −1.109 ComBat mod. −1.225 −1.193 ComBat −1.374 −1.336 residuals −1.381 −1.394 raw −1.299 −1.335 Table 4: Averaged log loss for training and test set. EV training set test set HBLM 0.5674 0.5 HBGPM 0.5397 0.485 ComBat mod. 0.3146 0.338 ComBat 0.0918 0.122 residuals 0.0778 0.114 raw 0.2091 0.208 Table 5: Averaged explained variance for training and test set. 4 Discussion In this work, we aim to provide a method that allows the application of normative modeling to neuroimaging data sets that are affected by site effects resulting from pooling data between sites. In contrast to other methods of harmonizing for additive and multiplicative site effects in the data prior to the normative modeling (e.g., regressing out site effects, harmonization with ComBat), our approach is based on modeling site as predictor within the normative modeling framework. The benefit of this approach is that it does not entail removing variance and thus cannot lead to an overestimation of site variance and accidental removal of meaningful variation in case the latter is confounded with site variation. Using a hierarchical Bayesian approach, we propose two versions of normative models that were able to control for site effects. In both versions, site is modeled via a random intercept offset, but one version only models linear effects of age on cortical thickness (Hierarchical Bayesian Linear Model, HBLM), whereas the other version also includes a Gaussian process term in order to allow potential non-linear relationships between age and cortical thickness measures (Hierarchical Bayesian Gaussian Process Model; HBGPM). The normative models are trained on a training set consisting of healthy individuals from the ABIDE data set (70% of the data from 20 different sites, within-site split, preserving the distribution of age and sex across training and 11 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 (a) Distribution of Pearson’s correlation coefficient ρ for 35 cortical regions, indicating the correlation between true and predicted values, training and test set. (b) SRMSE for 35 cortical regions, indicating the deviation true and predicted values of six different models for the training and the test set. (c) Explained variance for 35 cortical regions, training and test set. (d) Log likelihood distribution for 35 cortical regions, train- ing and test set. Figure 4: Performance measures test set) and we present results from generalization to a test set (the remaining 30% of the data from the same sites). We compare the performance of our hierarchical Bayesian normative models explicitly modeling site effects applied to cortical thickness measures derived from FreeSurfer [Fischl et al., 2004]) to other commonly used methods to deal with site effects. These alternative methods included: i) regressing out site via linear regression and using the residuals, removing additive site variation, ii) applying ComBat [Fortin et al., 2017, Fortin et al., 2018] to harmonize additive and multiplicative site effects in the data, and iii) modified ComBat, hence applying ComBat while preserving age and sex effects in the data. Cortical thickness measures cleared from site effects using these alternative methods are used as dependent variables in a normative model with age and sex as predictors but excluding site. For comparison reasons, we also include a fourth model where we made predictions from raw data uncorrected for any site effects. We report three main findings: (1) Our normative hierarchical Bayesian models (both the linear HBLM and non-linear HBGPM version), explicitly modeling site effects within the normative modeling framework, outperform all alternative harmonization models with respect to model fit, including correlations between true and predicted values (ρ), standardized root mean square error (SRMSE), explained variance scores (EV), log-likelihood (LL) and the mean standardized log loss (MSLL); (2) the non-linear model did not significantly improve prediction of cortical thickness 12 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 (a) MSLL distribution for 35 cortical regions, test set. (b) Predicted variance vs. actual variance for average cortical thickness for each model derived from predictions of 573 individuals. Figure 5: Mean standardized log loss and predicted variance for 35 cortical regions based on age, sex and site compared to the linear model; (3) all methods, but in particular the harmonization methods lead to an undesirable shrinking of the variance in the predictions. We showed that when using neuroimaging structural data sets pooled across different sites and scanners for estimating normative models, better predictive performance can be achieved by including site as a predictor than using a two-step approach of first harmonizing the data with respect to site and subsequently creating a normative model using these “cleared” data. This conclusion is based on results showing that the hierarchical Bayesian models outperformed the harmonizing comparison models on all of the performance metrics we examined. This includes the predictions derived from data that was cleared from site effects by a version of ComBat [Fortin et al., 2017, Fortin et al., 2018] in which variation associated with age and sex was preserved, which was the best performing method across all harmonizing models. We observed a higher correlation between true and predicted values and LL values closer to zero for our hierarchical Bayesian models explicitly modeling site effects with a random intercept offset, indicating better model fit. As a key factor of normative models is that they are not only able to estimate the predictive mean, but also give an estimate of the predictive variance and variation around the mean [Marquand et al., 2019, Marquand et al., 2016], we also included explained variance scores and the MSLL as performance metrics. Our HBLM and HBGPM models showed higher explained variance than the alternative models. In addition, the HBLM and HBGPM showed a negative MSLL in the test set; a metric which contrasts the log loss between the true and predicted values by the loss that would be achieved using the mean and the variance of the training set [Rasmussen and Williams, 2006], thus capturing differences in variance in the data sets. This benefit in performance for the hierarchical Bayesian models is in line with previous literature using a similar paradigm [Kia et al., 2020]. [Kia et al., 2020] showed that a hierarchical Bayesian regression approach using site as a batch effect lead to a better performance than complete pooling, no pooling and ComBat. In detail, our findings match [Kia et al., 2020]’s findings with respect to the comparison between a normative model created from hierarchical Bayesian regression (HBR) and a modified ComBat version in a data set with the same sites in training and test set. Their findings are in line with ours with respect to ρ ([Kia et al., 2020]: HBR range: 0.4 - 0.9, modified ComBat range: 0.2 - 0.8), SMSE: ([Kia et al., 2020]: HBR range: 0.2 - 0.9, modified ComBat range: 0.4 - 1.0) and MSLL ([Kia et al., 2020]: HBR range: -0.7 - -1.0, modified ComBat range: -0.04 - 0.0), except that the MSLL for the modified Combat model was worse in our study (see Figs. 4a, 4b, 5a). Therefore, our findings replicate the findings of [Kia et al., 2020] using an independent data set and separate implementation and extend that method to model non-linear functions using a Gaussian process term. We anticipated that the non-linear version of the normative model, which included a Gaussian Process for age, would perform better than the linear version, as studies have shown that the association between age and regions of cortical thickness can be non-linear, especially for older age ranges [Storsve et al., 2014]. However, our results showed similar performance in predicting cortical thickness based on age, sex and site for both linear and non-linear models. This might be due to the fact that the the age range in our sample was restricted, ranging from 6-40 years, thus likely capturing an age range where the association between age and cortical thickness is still mostly linear [Wierenga et al., 2014]. As a consequence, the non-linear version of the model was not able to improve the overall 13 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 performance. Nonetheless, since other structural brain measures, including sub-cortical volumes and cortical surface area [Wierenga et al., 2014, Raznahan et al., 2011], have shown stronger non-linear associations with age, non-linear normative models may outperform a linear model for other types of structural brain imaging measures. Despite an overall good performance of our models, it should also be mentioned that the performance showed substantial variation between regions, as reflected in the variation in ρ values, SRMSE, EV, LL and MSLL within models. We assume that this due to the fact that, although average cortical thickness shows a strong association with age, different cortical brain regions differ in their association with age and the magnitude of this correlation also changes across the life span ([Storsve et al., 2014]). All models, but in particular the comparison models, have a significant shrinkage effect on the variance of the predicted values, indicating that harmonization techniques remove variance that is useful in predicting the response variable. This is most extreme for regressing out site effects and leads to poor performance across all performance metrics. We also observe that the performance of the residuals model is similar to the ComBat model without the preservation of age and sex, which is particularly reflected in the similarities of predicted variance in Fig. 5b and in the SRMSE. Both models suffer a loss of more than 90% of their original test variance. In contrast, the performance improves when variables like age and sex are preserved, as demonstrated by an increase in performance measures when using the version of ComBat in which variation associated with age and sex was preserved. We argue that the similarity in performance between ComBat and the residuals model is an indicator of the same underlying process, showing a weakness of the harmonization approach: merely regressing out site effects leads to the removal of meaningful variation correlated with the predictors of interest (in this case age and sex), especially when these predictors of interest are correlated with the site effects, which subsequently led to worse predictions of cortical thickness based on age and sex. This can be partially prevented by preserving important sources of variation when regressing out site effects, as shown for the modified ComBat model, where specified sources of variance were preserved when regressing out site effects. However, our results show two additional flaws of the harmonization approach: 1) as already pointed out by [Kia et al., 2020], in order to specify sources of variance that should be retained, all those sources of variance have to be known, which is not always the case; 2) even with age and sex preserved the modified ComBat model only retains 40% of the original variance. Our hierarchical Bayesian models including the prediction-based approach, in contrast, preserves known and unknown interactions between site and biological covariates by specifically modeling site, thus overcoming this requirement. The result is reflected in larger proportions of variance retained (see Fig. 5b. The advantage of the hierarchical Bayesian approach becomes particularly clear when considering that the scores derived from normative models are relative scores describing the deviation from a predicted normative mean. Thus, the normative deviation score is not affected by the absolute value of the predicted mean, and the number of predictors in the model does not influence the normative score. Previous attempts to estimate the centiles of normative models have included polynomial regression [Kessler et al., 2016], support vector regression [Erus et al., 2015], quantile regression [Huizinga et al., 2018, Lv et al., 2020] and Gaussian process regression [Wolfers et al., 2018b], providing different degrees of the ability to separate between sources of variances and making individual predictions (for an overview see [Marquand et al., 2019]). We chose a hierarchical Bayesian framework for the implementation of our normative model as it has several advantages. The distribution-based structure based on posteriors allows for the separation and integration of different sources of variances, including epistemic (uncertainty in the model parameters), aleatoric (inherent variability in data) and prior variation, which are all considered when predicting cortical thickness based on age, sex and site. This allows for both the integration of already known information in the form of priors into the predictions, and for an adjustment of the precision of the estimate based on the uncertainty at each data point. In addition, the Bayesian framework, as implemented in software packages like Stan [Carpenter et al., 2017, Stan Development Team, 2020b], allows to draw samples from the full posterior distribution at the level of individual participants, which leads to an exact estimate of all parameters instead of an approximation. In particular in comparison to quantile regression, the distributional assumption entailed in the hierarchical Bayesian approach also allows to get more precise estimates of the underlying centiles, particularly in the outer centiles, which are usually of primary interest and where the data are sparsest. The proposed Bayesian framework also offers an elegant way to integrate site effects into normative models. site effects can be modeled via a hierarchical random effect structure, in which different sites are modeled semi-independently, sharing variation via a combined prior of higher order. This approach, also known as partial pooling, allows for including site- specific variance into the prediction for site, while at the same time constraining the amount of between-site variation to a maximum. Whilst the primary aim of this study was to develop a novel method for dealing with site effects specifically within a normative modeling framework, the method can be used as general approach to clear neuroimaging data from site, age and sex effects. This is due to the fact that a normative score describes an individual’s cortical thickness in relation to the variance explained by the predictor variables in the normative model (age, sex and site). Hence, they can be seen as “cleaned” cortical thickness measures that can be the basis for further analysis, for example to establish the 14 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 association between cortical thickness measures and clinical or demographic information. Another potential clinical use of a normative model based on healthy controls could be, that, once established, it can be used to derive individualized deviation scores from individuals with a psychiatric or neurological disorder. Their individual deviation scores can be considered as the degree of deviation from the normative variation and be used for further analysis, for example to predict clinically useful information. Our proposed method has two potential disadvantages. The first one is related to the computational cost associated with estimating the covariance matrix within the Gaussian Process for the non-linear models, which in our analysis amounted to 25 hours per model per region and could only be mastered via parallel processing on a cluster. This is due to the fact that using the non-linear Gaussian Process term becomes very time and memory expensive with growing n (O(n2)). Thus, in cases in which the relationship between the predictor and the outcome is estimated to be close to linear, the need for the more complex non-linear model should be carefully considered. Secondly, the between-site split and the model at its current state only allow generalizations to a test set which includes individuals from the same sites as the training set, thus where the site variation is known. However, especially in clinical settings, generalizing the model and making predictions in data from new sites is an important additional goal. Despite the fact that we cannot use the posterior distribution of one particular site as a prior when applying the model to a new, unknown site, the hierarchical Bayesian framework still allows using the posterior parameter distributions of all sites as derived from the training data set as priors for site parameters when applying the model to a new site. This approach has already been successfully demonstrated in [Kia et al., 2020] where the posterior parameter distribution of site derived from the training data was fed as a informative prior for the site predictor in a normative model applied to the test data consisting of new (unknown) sites. This use of a so called informed priors leads to more accurate and precise predictions than the broad, unspecific prior that would have to be used in cases where the distribution of the data is unknown [Kia et al., 2020]. Thus, despite some loss in precision, the Bayesian framework can, in contrast to all other methods examined in this paper, be adapted to make predictions to new, unknown sites. 5 Conclusion We proposed an extended version of a normative modeling approach that is able to accommodate for site effects in neuroimaging data. The method is superior to previous approaches, including regressing out site and versions of ComBat [Fortin et al., 2017, Johnson et al., 2007] and facilitates the estimation of normative models based on neuroimaging data pooled across many different scan sites. A further extension of the model to make generalizations to new sites and the application to clinical data will be the objective of future work. 6 Online material The supplementary material and the Stan code for the HBLM, HBGPM and simple Bayesian linear model can be found at https://github.com/likeajumprope/Bayesian_normative_models. 15 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://github.com/likeajumprope/Bayesian_normative_models https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 References [Bartlett, 1937] Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proceedings of the Royal Society of London. Series A-Mathematical and Physical Sciences, 160(901):268–282. [Bethlehem et al., 2018] Bethlehem, R., Seidlitz, J., Romero-Garcia, R., and Lombardo, M. (2018). Using normative age modelling to isolate subsets of individuals with autism expressing highly age-atypical cortical thickness features. bioRxiv, page 252593. [Carpenter et al., 2017] Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language. Journal of statistical software, 76(1). [Chen et al., 2014] Chen, J., Liu, J., Calhoun, V. D., Vasquez, A. A., Zwiers, M. P., Gupta, C. N., Frannke, B., and Turner, J. A. (2014). Exploration of scanning effects in multi-site structural MRI studies. Journal of Neuroscience Methods, 23(15):37–50. [Craddock et al., 2013] Craddock, C., Benhajali, Y., Chu, C., Chouinard, F., Evans, A., Jakab, A., Khundrakpam, B. S., Lewis, J. D., Li, Q., Milham, M., et al. (2013). The neuro bureau preprocessing initiative: open sharing of preprocessed neuroimaging data and derivatives. Neuroinformatics, 4. [Desikan et al., 2006] Desikan, R. S., Ségonne, F., Fischl, B., Quinn, B. T., Dickerson, B. C., Blacker, D., Buckner, R. L., Dale, A. M., Maguire, R. P., Hyman, B. T., Albert, M. S., and Killiany, R. J. (2006). An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. NeuroImage, 31(3):968–980. [Di Martino et al., 2014] Di Martino, A., Yan, C. G., Li, Q., Denio, E., Castellanos, F. X., Alaerts, K., Anderson, J. S., Assaf, M., Bookheimer, S. Y., Dapretto, M., Deen, B., Delmonte, S., Dinstein, I., Ertl-Wagner, B., Fair, D. A., Gallagher, L., Kennedy, D. P., Keown, C. L., Keysers, C., Lainhart, J. E., Lord, C., Luna, B., Menon, V., Minshew, N. J., Monk, C. S., Mueller, S., Müller, R. A., Nebel, M. B., Nigg, J. T., O’Hearn, K., Pelphrey, K. A., Peltier, S. J., Rudie, J. D., Sunaert, S., Thioux, M., Tyszka, J. M., Uddin, L. Q., Verhoeven, J. S., Wenderoth, N., Wiggins, J. L., Mostofsky, S. H., and Milham, M. P. (2014). The autism brain imaging data exchange: Towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular Psychiatry, 19(6):659–667. [Duane et al., 1987] Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). Hybrid monte carlo. Physics letters B, 195(2):216–222. [Erus et al., 2015] Erus, G., Battapady, H., Satterthwaite, T. D., Hakonarson, H., Gur, R. E., Davatzikos, C., and Gur, R. C. (2015). Imaging Patterns of Brain Development and their Relationship to Cognition. Cerebral Cortex, 25(6):1676–1684. [Feczko et al., 2019] Feczko, E., Miranda-Dominguez, O., Marr, M., Graham, A. M., Nigg, J. T., and Fair, D. A. (2019). The Heterogeneity Problem: Approaches to Identify Psychiatric Subtypes. Trends in Cognitive Sciences, 23(7):584–601. [Fischl et al., 2004] Fischl, B., Van Der Kouwe, A., Destrieux, C., Halgren, E., Ségonne, F., Salat, D. H., Busa, E., Seidman, L. J., Goldstein, J., Kennedy, D., Caviness, V., Makris, N., Rosen, B., and Dale, A. M. (2004). Automatically Parcellating the Human Cerebral Cortex. Cerebral Cortex, 14(1):11–22. [Fortin et al., 2018] Fortin, J. P., Cullen, N., Sheline, Y. I., Taylor, W. D., Aselcioglu, I., Cook, P. A., Adams, P., Cooper, C., Fava, M., McGrath, P. J., McInnis, M., Phillips, M. L., Trivedi, M. H., Weissman, M. M., and Shinohara, R. T. (2018). Harmonization of cortical thickness measurements across scanners and sites. NeuroImage, 167(June 2017):104–120. [Fortin et al., 2017] Fortin, J. P., Parker, D., Tunç, B., Watanabe, T., Elliott, M. A., Ruparel, K., Roalf, D. R., Sat- terthwaite, T. D., Gur, R. C., Gur, R. E., Schultz, R. T., Verma, R., and Shinohara, R. T. (2017). Harmonization of multi-site diffusion tensor imaging data. NeuroImage, 161:149–170. [Foulkes and Blakemore, 2018] Foulkes, L. and Blakemore, S.-J. (2018). Studying individual differences in human adolescent brain development. Nature neuroscience, 21(3):315–323. [Fried, 2017] Fried, E. (2017). Moving forward: how depression heterogeneity hinders progress in treatment and research. Expert Review of Neurotherapeutics, 17(5):423–425. [Gelman, 2008] Gelman, A. (2008). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. [Gelman et al., 2013] Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian data analysis. CRC press. 16 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 [Gronenschild et al., 2012] Gronenschild, E. H. B. M., Habets, P., Jacobs, H. I. L., Mengelers, R., Rozendaal, N., van Os, J., and Marcelis, M. (2012). The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements. PLoS ONE, 7(6):e38234. [Han et al., 2006] Han, X., Jovicich, J., Salat, D., van der Kouwe, A., Quinn, B., Czanner, S., Busa, E., Pacheco, J., Albert, M., Killiany, R., Maguire, P., Rosas, D., Makris, N., Dale, A., Dickerson, B., and Fischl, B. (2006). Reliability of MRI-derived measurements of human cerebral cortical thickness: The effects of field strength, scanner upgrade and manufacturer. NeuroImage, 32(1):180–194. [Hoffman and Gelman, 2014] Hoffman, M. D. and Gelman, A. (2014). The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res., 15(1):1593–1623. [Huizinga et al., 2018] Huizinga, W., Poot, D., Vernooij, M., Roshchupkin, G., Bron, E., Ikram, M., Rueckert, D., Niessen, W., and Klein, S. (2018). A spatio-temporal reference model of the aging brain. NeuroImage, 169:11–22. [Insel et al., 2010] Insel, T., Cuthbert, B., Garvey, M., Heinssen, R., Pine, D. S., Quinn, K., Sanislow, C., and Wang, P. (2010). Research domain criteria (rdoc): toward a new classification framework for research on mental disorders. [Insel, 2014] Insel, T. R. (2014). The NIMH Research Domain Criteria (RDoC) Project: Precision Medicine for Psychiatry. American Journal of Psychiatry, 171(4):395–397. [Johnson et al., 2007] Johnson, W. E., Li, C., and Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1):118–127. [Kessler et al., 2016] Kessler, D., Angstadt, M., and Sripada, C. (2016). Growth charting of brain connectivity networks and the identification of attention impairment in youth. JAMA psychiatry, 73(5):481–489. [Kia et al., 2020] Kia, S. M., Huijsdens, H., Dinga, R., Wolfers, T., Mennes, M., Andreassen, O. A., Westlye, L. T., Beckmann, C. F., and Marquand, A. F. (2020). Hierarchical bayesian regression for multi-site normative modeling of neuroimaging data. In Martel, A. L., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M. A., Zhou, S. K., Racoceanu, D., and Joskowicz, L., editors, Medical Image Computing and Computer Assisted Intervention – MICCAI 2020, pages 699–709, Cham. Springer International Publishing. [Kia and Marquand, 2018] Kia, S. M. and Marquand, A. (2018). Normative Modeling of Neuroimaging Data Using Scalable Multi-task Gaussian Processes. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 11072 LNCS, pages 127–135. [Leek et al., 2010] Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E., Geman, D., Baggerly, K., and Irizarry, R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11(10):733–739. [Lv et al., 2020] Lv, J., Biase, M. D., Cash, R. F., Cocchi, L., Cropley, V., Klauser, P., Tian, Y., Bayer, J., Schmaal, L., Cetin-Karayumak, S., Rathi, Y., Pasternak, O., Bousman, C., Pantelis, C., Calamante, F., and Zalesky, A. (2020). Individual deviations from normative models of brain structure in a large cross-sectional schizophrenia cohort. bioRxiv, page 2020.01.17.911032. [Marquand et al., 2014] Marquand, A. F., Brammer, M., Williams, S. C., and Doyle, O. M. (2014). Bayesian multi-task learning for decoding multi-subject neuroimaging data. NeuroImage, 92:298–311. [Marquand et al., 2019] Marquand, A. F., Kia, S. M., Zabihi, M., Wolfers, T., Buitelaar, J. K., and Beckmann, C. F. (2019). Conceptualizing mental disorders as deviations from normative functioning. Molecular Psychiatry, 24(10):1415–1424. [Marquand et al., 2016] Marquand, A. F., Rezek, I., Buitelaar, J., and Beckmann, C. F. (2016). Understanding Heterogeneity in Clinical Cohorts Using Normative Models: Beyond Case-Control Studies. Biological Psychiatry, 80(7):552–561. [Mathys et al., 2012] Mathys, C. D., Prüssmann, K., Stephan, K. E., and Behrens, T. (2012). HIERARCHICAL GAUSSIAN FILTERING Construction and variational inversion of a generic Bayesian model of individual learning under uncertainty. [Miller et al., 2016] Miller, K. L., Alfaro-Almagro, F., Bangerter, N. K., Thomas, D. L., Yacoub, E., Xu, J., Bartsch, A. J., Jbabdi, S., Sotiropoulos, S. N., Andersson, J. L., et al. (2016). Multimodal population brain imaging in the uk biobank prospective epidemiological study. Nature neuroscience, 19(11):1523. [Mirnezami et al., 2012] Mirnezami, R., Nicholson, J., and Darzi, A. (2012). Preparing for Precision Medicine. New England Journal of Medicine, 366(6):489–491. [Mueller et al., 2005] Mueller, S. G., Weiner, M. W., Thal, L. J., Petersen, R. C., Jack, C. R., Jagust, W., Trojanowski, J. Q., Toga, A. W., and Beckett, L. (2005). Ways toward an early diagnosis in alzheimer’s disease: the alzheimer’s disease neuroimaging initiative (adni). Alzheimer’s & Dementia, 1(1):55–66. 17 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 [Neal, 1994] Neal, R. M. (1994). An improved acceptance procedure for the hybrid monte carlo algorithm. Journal of Computational Physics, 111(1):194–203. [Neal et al., 2011] Neal, R. M. et al. (2011). Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2. [R Core Team, 2020] R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Rasmussen and Williams, 2006] Rasmussen, C. E. and Williams, C. K. I. (2006). GAUSSIAN PROCESSES FOR MACHINE LEARNING. MIT Press, Cambridge. [Raznahan et al., 2011] Raznahan, A., Shaw, P., Lalonde, F., Stockman, M., Wallace, G. L., Greenstein, D., Clasen, L., Gogtay, N., and Giedd, J. N. (2011). How Does Your Cortex Grow? Journal of Neuroscience, 31(19):7174–7177. [Stan Development Team, 2020a] Stan Development Team (2020a). RStan: the R interface to Stan. R package version 2.21.2. [Stan Development Team, 2020b] Stan Development Team (2020b). Stan modeling language users guide and reference manual, version 2.25. [Storsve et al., 2014] Storsve, A. B., Fjell, A. M., Tamnes, C. K., Westlye, L. T., Overbye, K., Aasland, H. W., and Walhovd, K. B. (2014). Differential Longitudinal Changes in Cortical Thickness, Surface Area and Volume across the Adult Life Span: Regions of Accelerating and Decelerating Change. Journal of Neuroscience, 34(25):8488–8498. [Thompson et al., 2020] Thompson, P. M., Jahanshad, N., Ching, C. R. K., Salminen, L. E., Thomopoulos, S. I., Bright, J., Baune, B. T., Bertolín, S., Bralten, J., Bruin, W. B., Bülow, R., Chen, J., Chye, Y., Dannlowski, U., de Kovel, C. G. F., Donohoe, G., Eyler, L. T., Faraone, S. V., Favre, P., Filippi, C. A., Frodl, T., Garijo, D., Gil, Y., Grabe, H. J., Grasby, K. L., Hajek, T., Han, L. K. M., Hatton, S. N., Hilbert, K., Ho, T. C., Holleran, L., Homuth, G., Hosten, N., Houenou, J., Ivanov, I., Jia, T., Kelly, S., Klein, M., Kwon, J. S., Laansma, M. A., Leerssen, J., Lueken, U., Nunes, A., Neill, J. O., Opel, N., Piras, F., Piras, F., Postema, M. C., Pozzi, E., Shatokhina, N., Soriano-Mas, C., Spalletta, G., Sun, D., Teumer, A., Tilot, A. K., Tozzi, L., van der Merwe, C., Van Someren, E. J. W., van Wingen, G. A., Völzke, H., Walton, E., Wang, L., Winkler, A. M., Wittfeld, K., Wright, M. J., Yun, J.-Y., Zhang, G., Zhang-James, Y., Adhikari, B. M., Agartz, I., Aghajani, M., Aleman, A., Althoff, R. R., Altmann, A., Andreassen, O. A., Baron, D. A., Bartnik-Olson, B. L., Marie Bas-Hoogendam, J., Baskin-Sommers, A. R., Bearden, C. E., Berner, L. A., Boedhoe, P. S. W., Brouwer, R. M., Buitelaar, J. K., Caeyenberghs, K., Cecil, C. A. M., Cohen, R. A., Cole, J. H., Conrod, P. J., De Brito, S. A., de Zwarte, S. M. C., Dennis, E. L., Desrivieres, S., Dima, D., Ehrlich, S., Esopenko, C., Fairchild, G., Fisher, S. E., Fouche, J.-P., Francks, C., Frangou, S., Franke, B., Garavan, H. P., Glahn, D. C., Groenewold, N. A., Gurholt, T. P., Gutman, B. A., Hahn, T., Harding, I. H., Hernaus, D., Hibar, D. P., Hillary, F. G., Hoogman, M., Hulshoff Pol, H. E., Jalbrzikowski, M., Karkashadze, G. A., Klapwijk, E. T., Knickmeyer, R. C., Kochunov, P., Koerte, I. K., Kong, X.-Z., Liew, S.-L., Lin, A. P., Logue, M. W., Luders, E., Macciardi, F., Mackey, S., Mayer, A. R., McDonald, C. R., McMahon, A. B., Medland, S. E., Modinos, G., Morey, R. A., Mueller, S. C., Mukherjee, P., Namazova-Baranova, L., Nir, T. M., Olsen, A., Paschou, P., Pine, D. S., Pizzagalli, F., Rentería, M. E., Rohrer, J. D., Sämann, P. G., Schmaal, L., Schumann, G., Shiroishi, M. S., Sisodiya, S. M., Smit, D. J. A., Sønderby, I. E., Stein, D. J., Stein, J. L., Tahmasian, M., Tate, D. F., Turner, J. A., van den Heuvel, O. A., van der Wee, N. J. A., van der Werf, Y. D., van Erp, T. G. M., van Haren, N. E. M., van Rooij, D., van Velzen, L. S., Veer, I. M., Veltman, D. J., Villalon-Reina, J. E., Walter, H., Whelan, C. D., Wilde, E. A., Zarei, M., and Zelman, V. (2020). ENIGMA and global neuroscience: A decade of large-scale studies of the brain in health and disease across more than 40 countries. Translational Psychiatry, 10(1):100. [Volkow et al., 2018] Volkow, N. D., Koob, G. F., Croyle, R. T., Bianchi, D. W., Gordon, J. A., Koroshetz, W. J., Pérez-Stable, E. J., Riley, W. T., Bloch, M. H., Conway, K., et al. (2018). The conception of the abcd study: From substance use to a broad nih collaboration. Developmental cognitive neuroscience, 32:4–7. [Wierenga et al., 2014] Wierenga, L. M., Langen, M., Oranje, B., and Durston, S. (2014). Unique developmental trajectories of cortical thickness and surface area. NeuroImage, 87:120–126. [Wolfers et al., 2019] Wolfers, T., Beckmann, C. F., Hoogman, M., Buitelaar, J. K., Franke, B., and Marquand, A. F. (2019). Individual differences v. the average patient: mapping the heterogeneity in ADHD using normative models. Psychological Medicine, pages 1–10. [Wolfers et al., 2020] Wolfers, T., Beckmann, C. F., Hoogman, M., Buitelaar, J. K., Franke, B., and Marquand, A. F. (2020). Individual differences v. the average patient: mapping the heterogeneity in adhd using normative models. Psychological Medicine, 50(2):314–323. [Wolfers et al., 2018a] Wolfers, T., Doan, N. T., Kaufmann, T., Alnæs, D., Moberget, T., Agartz, I., Buitelaar, J. K., Ueland, T., Melle, I., Franke, B., Andreassen, O. A., Beckmann, C. F., Westlye, L. T., and Marquand, A. F. (2018a). 18 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 Mapping the Heterogeneous Phenotype of Schizophrenia and Bipolar Disorder Using Normative Models. JAMA Psychiatry, 75(11):1146. [Wolfers et al., 2018b] Wolfers, T., Doan, N. T., Kaufmann, T., Alnæs, D., Moberget, T., Agartz, I., Buitelaar, J. K., Ueland, T., Melle, I., Franke, B., et al. (2018b). Mapping the heterogeneous phenotype of schizophrenia and bipolar disorder using normative models. JAMA psychiatry, 75(11):1146–1155. [Zabihi et al., 2019] Zabihi, M., Oldehinkel, M., Wolfers, T., Frouin, V., Goyard, D., Loth, E., Charman, T., Tillmann, J., Banaschewski, T., Dumas, G., et al. (2019). Dissecting the heterogeneous cortical anatomy of autism spectrum disorder using normative models. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 4(6):567–578. 19 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 Introduction Methods Data ABIDE data set Site effects in the ABIDE data set Splitting the ABIDE data set into training and test sets site as a predictor in a Hierarchical Bayesian Model Comparison models Performance measures Measures of model performance Measures of goodness of the simulation in Stan Model specification Posterior predictive distribution Comparison models Implementation: Normative modeling in Stan Model simulation process in Stan Results Mean standardized log loss Predictive Variance Discussion Conclusion Online material 10_1101-2021_02_09_430405 ---- In-silico Structural and Molecular Docking-Based Drug Discovery Against Viral Protein (VP35) of Marburg Virus: A potent Agent of MAVD In-silico Structural and Molecular Docking-Based Drug Discovery Against Viral Protein (VP35) of Marburg Virus: A potent Agent of MAVD Sameer Quazi1,2, Javed Malik3, Arnaud Martino Capuzzo1,4, Kamal Singh Suman1,3 , Zeshan Haider5 1. Gen-Lab BioSolutions Private Limited, Bangalore, Karnataka, India 2. Department of Genetics, Indian Academy Degree College, Bangalore, Karnataka, India. 3. Department of Zoology, Guru Ghasidas Vishwavidyalaya, Bilaspur, Chhattisgarh, India. 4. Department of Veterinary Sciences, University of Milan, Italy. 5. Centre of Agricultural Biochemistry and Biotechnology (CABB), University of Agriculture Faisalabad, Pakistan .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ ABSTRACT The Marburg virus (MARV) is a highly etiological agent of hemorrhagic fever in humans. MARV has spread across the world, including America, Australia, Europe, and different Asia countries. However, there is no approved vaccine to combat MARV, combined with a high mortality rate, which makes antiviral drugs against MARV urgent. The viral protein (VP35) is a core protein of MARV that involves multiple functions of the infection cycle. This research used an in-silico drug design technique to discover the new drug-like small molecules that inhibit VP35 replication. First, several combinations of ~ 4260 showed that structure-based similarity above 90% was retrieved from an online "PubChem" database. Molecular docking was performed using AutoDock 4.2, and ligands were selected based on docking / S score lower than reference CID_5477931 and RMSD value between 1-2. Finally, about 50 compounds showed greater bonding producing hydrogen, Van der Waals, and polar interactions with VP35. After evaluating their binding energy strength and ADMET analysis, only CID_ 3007938 and CID_11427396 were finalized, which showed the most vital binding energy and a strong inhibitory effect with MARV's VP35. The higher binding energy, suitable ADMET, and drug similarity parameters suggest that these "CID_ 3007938 and CID_11427396" candidates have incredible latency to inhibit MARV replication; hence, these strengths led to the treatment of MAVD. KEYWORD: Marburg virus, VP35, database screening, Molecular docking, ADMET profiling, in-silico drug discovery .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ 1. INTRODUCTION Marburg virus (MARV) is an enveloped virus that is a member of the Filoviridae family. MARV is a non-segmented and single-strand negative RNA of 17kb to 19kb size genome (Bausch et al. 2006; Carroll et al. 2013). MARV has induced intermittent diseases in limited numbers of persons in Africa for ten years following its first discovery in 1968. Two significant events in the Democratic Republic of the Congo demonstrated approximately 83% mortality (Towner et al. 2006; Zhu et al. 2017). Thus, MARV is caused by the disease commonly known as homographic fever in humans and animals (Mehedi et al. 2011). No commercially authorized vaccinations or therapeutics are presently approved to manage MARV infections, and work on MARV is therefore desperately required (Anthony and Bradfute 2015). MARV contained seven different genes across the entire genome. Each gene contained the open read frame (ORF) compatible with a wide range of 68-648 nucleotide lengths at the flanking ends (Wang et al. 2001). The five structural proteins, including nucleoprotein N.P., the viral proteins (V.P.) 35 and 40, the glycoprotein, and the RNA-dependent RNA polymerase (L), are playing an essential role in the infectivity of MARV (Biacchesi et al. 2003). N.P. performs a pivotal function in the growth and development of virions in MARV. N.P. combines with some other viral proteins, particularly VP24, VP40, VP35, and VP30, as an essential component of the virus assembly machinery to coordinate the replication process (Bamberg et al. 2005; Becker et al. 1998). So, it arranges as the scaffold of nucleocapsid development into a helical tubular structure. Sequence homology reveals that N.P. Includes a preserved N-terminal region appropriate for assembling self-assembly and single-stranded RNA (ssRNA) and a mostly unorganized C-terminal region containing a part necessary for the flourishing of virions (Dolnik et al. 2010; Kolb et al. 2014). Further, through multipurpose VP35, which plays an essential function in the synthesis of viral RNA, assembly, and structure of the virus, MARV often counteracts immune response. MARV VP35 communicates with many innate antiviral defense elements, particularly mechanisms that contribute to the IFN formation of the RIG-I (Retinoic acid-inducible gene-I) like receptor (Ramanan et al. 2012). The FGI-103 (2-(2-(5-(aminomethyl)-1-benzofuran-2-yl) vinyl)-1H-benzimidazole-5-carboximidamide) is the small drug-like compound that has .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ previously classified and reported as an effective drug against VP35 of MARV (Warren et al. 2010). The VP35 is considered a vital target to synthesize the antiviral drug due to the important role of VP35 in the transcription of MARV. The FGI-103 drug was selected to screen small molecules from the PubChem database using structure similarity-based filtration (More than 90% similarity) to find novel compounds. The CADD and Virtual high throughput screening perform a critical function in drug discovery (Lyne 2002). The bioinformatics techniques, including structure-based drug-like compounds screening from online databases, molecular docking, and molecular dynamic simulation, could be utilized to block the P1 active site of VP35. The current research was designed to novel drug-like substances with greater contact, binding energy, and inhibition effect at the P1 site of MARV VP35 by using computational strategies. The final small molecules of drug-like compounds would have more effective and substantial latent to stop the replication of MARV in the host, which could ultimately help develop and design new drugs to cure and target MAVD. 2. MATERIALS AND METHODS 2.1. Amino acid sequence retrieval and analysis The amino acid sequences of VP35 protein were retrieved from The National Center for Biotechnology (NCBI) (https://www.ncbi.nlm.nih.gov/) database. NCBI is a significant and leading public biomedical database and contains different tools for analyzing genomic and molecular information in computational biology (Jenuth 2000). Furthermore, the protein's primary sequence was analyzed using an online bioinformatics-based tool Expasy-Protparam (https://web.expasy.org/protparam/). The Protparam tool was used to analyze the different physical and chemical parameters of protein, including the molecular weight, isoelectric point, atomic composition, estimated half-life, amino acid composition, aliphatic index, and grand average hydrophobicity (Garg et al. 2016). 2.2. Structure prediction, Evaluation, and Validation of protein The sequence of VP35 protein was utilized to identify the template having more significant similarity in the protein sequence. The protein sequence was used with the Basic .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ Local Alignment Search Tool (Blast) in Protein Data Bank (PDB) and selected the best structure with the highest similarity in sequence. The three-dimensional structure (3D) of the VP35 protein was developed by using MODELLER v9.25 software. The MODELLER software is a desktop- based computational tool used to indicate the homology-based 3D structure of the protein. The most favorable and accurate 3D model was selected based on the DOPE score (Eswar et al. 2006). The quality of the 3D structure of VP35 was assessed and validated by using an online freely available PROCHECK tool (https://servicesn.mbi.ucla.edu/PROCHECK/). The PROCHECK software highlights the stereochemistry of protein (Laskowski et al. 1993). 2.3. Formation of Coordinate files The 3D structure of the VP35 protein was modified by using bioinformatics software's Discovery Studio Visualizer and AutoDock4.2. The structure was optimized by removing the water molecules from the VP35, hydrogen and polar hydrogen atoms, the addition of Kollman charges and fixed the receptor atoms. Finally, VP35 structure was saved in "Pdbqt" format file (Haider et al. 2020; Quazi et al. 2021). 2.4. Selection of ligand and database virtual screening The ". SDF" file of the FGI-103 antiviral drug was downloaded from the PubChem database. The active site (P1) of VP35 was categorized by using an online DoGSite (https://proteins.plus/help/dogsite). The P1 "ALA210,14, LYS211, LEU-215, PHE-218, ILE- 230, GLN-233, VAL-234, SER-236, LYS-237, VAL-280, PRO-282, ILE-284, and CYS-315", of selected for the molecular docking with FGI-103 antiviral drug-using AutoDock 4.2 software. After that, FGI-103 was set and screen other drug-like compounds from PubChem databases. The Pfizer law was used to evaluate the drug-like properties of each compound. The different parameters of Lipinski's rule like M.W. < 500 Da, LogP < 5, HBD < 10, and HBA < 5, were used to screen the drug-like small molecules (Chen et al. 2020; Lipinski 2004). Selected compounds were nominated for further analysis. Every selected drug-like compound's energy minimization was completed using AutoDock 4.2 software and saved files into a "pdbqt" file separately for further molecular docking. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ 2.5. Molecular docking The finally selected drug-like compounds were docked with the P1 site of VP35 of MARV using a desktop AutoDock 4.2 software. The molecular docking was carried out on a computer system that installed a window 10 with an 86x operating system. The applications, including AutoDock 4.2 and MGL 1.5.4 using Python 2.7, were used for these experiments (France, Scotti, and Scotti 2019). The protein-ligand interaction investigation was accomplished utilizing Discovery Studio Visualizer and PyMol software's, respectively (D Studio and 2008; Inwood et al. 2009). For molecular docking, the receptor and ligand were used after their energy minimization, and both the structure were saved in ". pdbqt" files. The grid chart of all kinds of atom's energy was generated using AutoGrid algorithm of AutoDock 4.2. A grid box was drawn based on ap1 site for ligand in every dock for VP35 MARV utilizing a grid chart of 50 × 50 × 50 points, 70 × 70 × 70 grid spacing points, and 0.38 Å and 0.44 Å, individually. The docking was completed by selecting the parameters that our previous study described (Haider et al. 2020; Quazi et al. 2021). The "S" value is showed the docking score between the respective receptor and ligand. The more negative "S" indicates the strong binding affinity of the ligand with the receptor. The RMSD (Root Mean Square Deviation) value is utilized the docked conformations between the ligand and receptors. All the docking score and RMSD values of each ligand were calculated using the default scoring parameter in the AutoDock 4.2 (Haider et al. 2020). The "S" score and binding energy of finalized compounds were compared with the values of FGI-103. The small molecule with binding energy like or greater than the FGI-103 was selected. The finally selected compounds were considered for further analysis. 2.6. ADMET Profiling The ADMET properties of finally selected drug-like compounds were checked to utilize an available admetSAR (Immd.ecust.edu.cn/admetsar2) tool. This admetSAR expects multiple toxic effects, including mutagenicity, annoyance behaviour, and competitiveness. The ADMET profile's drug-like properties help pick healthy human antiviral medicines (Fatima et al. 2020). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ 3. RESULTS 3.1. Amino acid sequence retrieval and analysis The VP35 primary sequence of 329 Amino Acids (A.A.) was obtained from the NCBI database. The stability of the protein structure has relied on the three-dimensional conformation of the protein. The protein sequence of the target protein was developed based on physical and chemical properties. The physicochemical properties estimated using Expasy- Protparam showed that the Molecular weight (W.W.) of protein 36201.85, isoelectric point (pI), 8.84, and Grand average hydropathicity -0.193. All the physical and chemical properties of VP35 protein are shown in Table 1. 3.2. Structure prediction, Evaluation, and Validation of protein The 3D structure of the VP35 protein was predicted homology-based. The template "c4gh9A" showed 85% sequence similarity downloaded from PDB. The homology modelling was done by using MODELLER desktop software. The finest 3D structure of VP35 out of ten structures was chosen based on a DOPE and G.A. score (-33763.4352 and 336) (Figure 1). The geomaterial analysis predicted 3D VP35 was performed using PRICHECK tool that showed most of the A.A. approximately 319 A. An out 329 were situated in the protein's favorable region that made the 97.1% out of 100%. Moreover, the 3D structure of VP35 was considered more reliable, efficient, and stable for further study (Figure 2). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ Figure 1. The 3D structure of VP35 by using template "c4gh9A" predicted by PyMol.ol. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ Table 1. Physio-chemical characteristics of VP35 MARV protein Physical and chemical properties VP35 Amino acid arrangement #. Composition (%) MW 36201.85 Alanine 27 8.2 A.A # 329 Arginine 13 4.0 Isoelectric point 8.84 Asparagine 11 3.3 Instability index 40.86 aspartic acid 16 4.9 Total number of negative atoms 33 Cysteine 5 1.5 Total no of positive atoms 38 Glutamine 17 5.2 Aliphatic index 90.12 Glutamic acid 17 5.2 Grand average of hydropathicity -0.193 Glycine 18 5.5 half-life 30 h Histidine 7 2.1 Atomic composition Isoleucine 18 5.5 Carbon atoms 1611 Leucine 34 10.3 Hydrogen atoms 2600 Lysine 25 7.6 Nitrogen atoms 438 Methionine 9 2.7 Oxygen atoms 478 Phenylalanine 11 3.3 Sulphur atoms 14 Proline 21 6.4 Molecular Formula C1611H2600N438O478S14 Serine 24 7.3 Complete no of atoms 5141 Threonine 24 7.3 Tryptophan 3 0.9 Tyrosine 6 1.8 Valine 23 7.0 . C C -B Y -N C 4 .0 In te rn a tio n a l lice n se a va ila b le u n d e r a (w h ich w a s n o t ce rtifie d b y p e e r re vie w ) is th e a u th o r/fu n d e r, w h o h a s g ra n te d b io R xiv a lice n se to d isp la y th e p re p rin t in p e rp e tu ity. It is m a d e T h e co p yrig h t h o ld e r fo r th is p re p rin t th is ve rsio n p o ste d F e b ru a ry 1 0 , 2 0 2 1 . ; h ttp s://d o i.o rg /1 0 .1 1 0 1 /2 0 2 1 .0 2 .0 9 .4 3 0 4 0 5 d o i: b io R xiv p re p rin t https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ Figure 2. Assessment of Ramachandran's plot of MARV VP35 shows that 97.1% A. A is present in favorable regions, while about 2.5% A.A, an extant in allowed areas, and 0.4% A.A, exists in an outlier region. is A, .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ 3.3. Selection of ligand and database virtual screening The VP35 protein was docked through an antiviral medication named FGI-103. The results indicated that FGI-103 and the MARV were recognized as being correlated with one another. This analysis showed that the FGI-103 formed a complex with VP35 by "S" score of - 13.46, RMSD of 1.53, and binding energy of - 16.70 (Table 2). The interaction analysis showed that GLN233 formed a strong contact by hydrogen bonding, and GLN230 made a strong connection through polar interaction with FGI-103. While LYS241, VAL279, ILE230, 238 are involved in Van der Waals interactions (Figure 3). In our research, compounds with >90 % structural similarity to FGI-103 were chosen through virtual screening from the broad online PubChem database. Out of the 4260 compounds, Pfizer did cross-validation and applied the law of five to all the combinations in the sample. The 50 out of 4260 drugs like small molecules were placed into another database for docking with the VP35 protein after the most feasible energy minimization. Figure 3. (A) 3D structural description of VP35 MARV (Showing in blue colour) formed a complex with FGI-103 (Showing in yellow colour) (B) 3D ligand complex between the " FGI- he ne - ed ng re % ne aw re gy a - .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ 103" and P1 site of VP35 MARV (C) 2D ligand complex between the " FGI-103" and P1 of VP35 MARV. 3.1. Molecular docking The molecular docking rules play an important role in creating modern drugs against various lethal illnesses (Ursic-Bedoya et al. 2014). All of the hits that were docked against the P1 position of VP35 MARV by utilizing the Auto Dock. Subsequently, there have been two compounds recorded with a minimum S/docking-score than FGI-103. The successful docked top compounds with lower S-score, and RMSD value was selected for further evaluation. The binding relationship of such two-hit compounds with VP35 was determined using Drug Discovery Studio tools. The best locations were defined in the specified order of preference, constructed on the minimum binding energy in the greatest cluster, no. of hydrogen connections formed with A.A, residues of P1 (Table 2). That was done to ensure that the compounds were attached exactly in the correct binding position. Successful inhibitors have shown an important correlation with the P1 site of MARV VP35 (Figure 4). Figure 4. (A) Representation of 3D complex of VP35 (Showing in blue colour) interacted with novel inhibitor CID_ 3007938 (Showing in orange colour) (B) Representation of 3D complex of VP35 (Showing in blue colour) connected with novel inhibitor CID_11427396 (Showing in tints white colour). of nst P1 o op he ug ce, ns re nt ith of nts . .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ 3.2. ADMET profiling The Molinspiration server was used to crisscross the drug-like parameters of suggested small molecules against MARV VP35. The selected compounds showed a zero violation against the Pfizer's law of five and recognized the properties of drug including M.W., HBD, HBA, LogP and tPSA (Table 2). Further, to evaluate the properties of drugs safety in the living organism. The term ADMET is the abbreviation of Absorption, Digestion, Metabolism, Excretion and Toxicity. The ADMET analysis was performed by using AdmetSAR server. The ADME analysis of all the finally selected compounds showed zero violation against the use in a living organism (Table 3). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ Table 2: The predicted favorable docking results and Pfizer's properties of finalized drugs compound against VP35. PubChem # Name Docking score RMSD Binding energy kcal/mol Pfizer's properties CID_5477931 1H-Benzimidazole-6- carboximidamide, 2-(2-(5- (aminoiminomethyl)-2- benzofuranyl) ethenyl) - 13.46 1.53 - 16.70 MW=347.402, LogP= -1.173, HBD= 6, HBA =0, and tPSA=146.290 CID_ 3007938 2-[4-[5-(4-aminophenyl)-2- furyl] phenyl]-N-isopropyl- 3H-benzimidazole-5- carboxamidine 1H-Benzimidazole-6- carboximidamide, 2-[4-[5- (4-aminophenyl)-2-furanyl] phenyl]-N-(1-methylethyl)- -16.33 1.85 -19.45 MW= 435.53, LogP= 5.85, HBD =3, HBA =2, tPSA=106.22 CID_11427396 2-Dibenzofuran-2-yl-1H- benzoimidazole-5- carboxylic acid amide -17.78 1.54 -21.23 MW=329.36, LogP=4.32, HBD =3, HBA=1, tPSA=80.32 *HBD= Hydrogen Bond Donor, HBA =Hydrogen bond acceptor . C C -B Y -N C 4 .0 In te rn a tio n a l lice n se a va ila b le u n d e r a (w h ich w a s n o t ce rtifie d b y p e e r re vie w ) is th e a u th o r/fu n d e r, w h o h a s g ra n te d b io R xiv a lice n se to d isp la y th e p re p rin t in p e rp e tu ity. It is m a d e T h e co p yrig h t h o ld e r fo r th is p re p rin t th is ve rsio n p o ste d F e b ru a ry 1 0 , 2 0 2 1 . ; h ttp s://d o i.o rg /1 0 .1 1 0 1 /2 0 2 1 .0 2 .0 9 .4 3 0 4 0 5 d o i: b io R xiv p re p rin t https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ Table 3: ADME analysis of finalized drugs compounds against VP35. PubChem # Blood-Brain Barrier Human Intestinal Absorption CaCO2 permeability p- glycoprotein inhibitor Renal Organic Cation Transporter Absorption CID_5477931 CID_ 3007938 CID_11427396 Positive (+) Positive (+) Positive (+) Positive (+) Positive (+) Positive (+) Negative (-) Negative (-) Negative (-) Non-Inhibitor Inhibitor Non-Inhibitor Non-Inhibitor Non-Inhibitor Non-Inhibitor PubChem # CYP450 1A2 CYP450 2C9 CYP450 2D6 CYP2450 C19 CYP450 3A4 Metabolism CID_5477931 CID_ 3007938 CID_11427396 Non-Inhibitor Inhibitor Inhibitor Non-Inhibitor Non-Inhibitor Non-Inhibitor Non-Inhibitor Non-Inhibitor Non-Inhibitor Non-Inhibitor Non-Inhibitor Inhibitor Non-Inhibitor Non-Inhibitor Non-Inhibitor PubChem # AMES analysis Carcinogenic analysis Toxicity CID_5477931 CID_ 3007938 CID_11427396 Non-Poisonous Non-Poisonous Non-Poisonous Non-dangerous Non-dangerous Non-dangerous . C C -B Y -N C 4 .0 In te rn a tio n a l lice n se a va ila b le u n d e r a (w h ich w a s n o t ce rtifie d b y p e e r re vie w ) is th e a u th o r/fu n d e r, w h o h a s g ra n te d b io R xiv a lice n se to d isp la y th e p re p rin t in p e rp e tu ity. It is m a d e T h e co p yrig h t h o ld e r fo r th is p re p rin t th is ve rsio n p o ste d F e b ru a ry 1 0 , 2 0 2 1 . ; h ttp s://d o i.o rg /1 0 .1 1 0 1 /2 0 2 1 .0 2 .0 9 .4 3 0 4 0 5 d o i: b io R xiv p re p rin t https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ 3.3. Analysis of receptor-ligand interaction The S/docking-score tests the contact strength of VP35 against drugs compounds; therefore, the drug-like small molecules are chosen based on the S-score and the binding energy of an outstanding drug compound. The following compounds, CID_ 3007938 and CID_11427396, have a solid binding with the P1 active sites VP35. The receptor-ligand complex of CID_ 3007938 and CID_11427396 with VP35 showed that Hydrogen bonding, Van der Waals, formed a stable complex. The LYS237 made a hydrogen bond along with other amino acids of P1 site formed Van der Waals interaction with CID_ 3007938 ligand (Figure 5 A). while VAL280 formed a hydrogen bond along with other amino acids of P1 site formed Van der Waals interaction with CID_11427396 ligand (Figure 5 B). The 2D and 3D pockets configurations of the particular drug-like small molecules are shown in Figure 5. Figure 5. The 2D and 3D representation the analysis of receptor binding interaction (A) shows the 2D interaction of ligand CID_ 3007938 with VP35 (B) represents the 3D pocket of VP35 ds; gy nd ex er no ile als of ws 35 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ with Ligand CID_ 3007938 (C) shows the 2D interaction of ligand CID_11427396 with VP35. (D) represents the 3D pocket of VP35 with Ligand CID_11427396. Figure 6. 2D molecules structure of selected drug-like compounds (A) represents the 2D structure of 1H-Benzimidazole-6-carboximidamide, 2-(2-(5-(aminoiminomethyl)-2- benzofuranyl) ethenyl) drug-like compound (B) represents the 2D structure of 2-[4-[5-(4- aminophenyl)-2-furyl] phenyl]-N-isopropyl-3H-benzimidazole-5-carboxamidine 1H- Benzimidazole-6-carboximidamide, 2-[4-[5-(4-aminophenyl)-2-furanyl] phenyl]-N-(1- methylethyl) drug-like compound (C) represents the 2D structure of 2-Dibenzofuran-2-yl-1H- benzimidazole-5-carboxylic acid amide drug-like compound 5. D - - - - - d .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ 4. DISCUSSION Many research-based studies have been conducted to discover effective therapeutic vaccines against MARV. But unfortunately, effective treatment for MARV is not yet available. Nowadays, MARV is considered a global problem and it is still necessary to discover a less expensive and effective antiviral drug against MARV (Brown et al. 2014). Traditional drug development approaches are largely costly and unproductive for solving evolving public health challenges (Velmurugan, Mythily & Rao 2014). Therefore, the most appropriate approaches should be implemented that could easily cope with this adverse circumstance. In silico drug design strategies are becoming the popular field in the pharmaceutical industries due to fast, less expensive, and time-saving practices in identifying new drugs (Geisbert, Bausch, and Feldmann 2010). MARV's VP35 viral protein is a promising candidate for vaccine design against MARV infection. Due to the above arguments, current research has proposed small drug-like molecules that caused MARV replication inhibition by binding firmly to the P1 site of VP35 MARV and could be considered pharmacological compounds. The three- small drug-like molecules were also analyzed for ADMET properties with the AdmetSAR server. All the compounds were chosen to have passed the ADMET properties. Blood-brain barrier cells are endothelial cells that function as resistance and prevent the brain from absorbing any medicine. Therefore, blood-brain barrier cells are considered an integral feature in the drug design discipline (Alavijeh et al. 2005; Cheng et al. 2012; Stamatovic, Keep and Andjelkovic 2008). Oral bioavailability is a significant factor for the pharmacological similarity of the active drug compound as a curative agent (Varma et al. 2010). ADMET properties of beneficial drug- like small molecules have strong results for the similarity of effective treatment such as P- glycoprotein substrate (inhibitor / non-inhibitor), blood-brain barrier penetration (positive/negative), human intestinal preparation (positive/negative), renal transporter of organic cations (inhibitor / non-inhibitor) and CaCO2 permeability (positive/negative). Cytochromop450 (CYP) is classified into isoenzymes and has remained active for the catabolism of several chemicals, including hormones, medicines, bile acids, carcinogens, etc. The ADMET research test is useful and efficient for scanning drug compounds and consisted of the following .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ parameters: (1) blood-brain barrier penetration, (2) human intestinal absorption, (3) CaCO2 permeability absorption, (4) non-toxic, (5) non-carcinogenic, and (6) non-inhibitor of the CYP enzyme. These ADMET parameters were significantly exceeded by the two compounds CID 3007938 and CID 11427396 (Table 2). 5. CONCLUSION The current research focus was structure-based virtual screening using the PubChem online database, Pfizer/Lipinski's analysis, molecular docking, ADMET analysis, and evaluation of the interaction between ligands and the MARV VP35 site P1. The drug-like compounds CID 3007938 and CID 11427396 showed a strong connection with the P1active site of MARV VP35 creating hydrogen bonds, van der Waals and polar interaction. The results suggest that they can hypothetically be applied against MARV as a drug. The compounds mentioned may function as the novel, fundamentally distinct and potentially active pharmaceutical compounds against MARV VP35. The molecule structure of three drug-like compounds is shown in Figure 6. Our in-silico research found that two drugs as small molecules have the potential of a drug that can be guided as therapeutic drugs against MARV by skillfully directing the P1 of VP35 through MARV. Consequently, concerning two small drug-like molecules CID_ 3007938 and CID_11427396, the work we performed requires further investigations and future in vitro and in vivo experiments before a possible verification with the competent authorities. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ REFERENCES Anthony, Scott M, and Steven B Bradfute. 2015. "Filoviruses: One of These Things Is (Not) like the Other." Viruses 7(10): 5172–90. Bamberg, Sandra et al. 2005. "VP24 of Marburg Virus Influences Formation of Infectious Particles." Journal of virology 79(21): 13421–33. Bausch, Daniel G et al. 2006. "Marburg Hemorrhagic Fever Associated with Multiple Genetic Lineages of Virus." New England Journal of Medicine 355(9): 909–19. Becker, S et al. 1998. "Interactions of Marburg Virus Nucleocapsid Proteins." Virology 249(2): 406–17. Biacchesi, Stéphane et al. 2003. "Genetic Diversity between Human Metapneumovirus Subgroups." Virology 315(1): 1–9. Carroll, Serena A et al. 2013. "Molecular Evolution of Viruses of the Family Filoviridae Based on 97 Whole-Genome Sequences." Journal of virology 87(5): 2608–16. Chen, Xiaoxia et al. 2020. "Analysis of the Physicochemical Properties of Acaricides Based on Lipinski's Rule of Five." Journal of Computational Biology 27(9): 1397–1406. D Studio, and . 2008. "Discovery Studio Life Science Modeling and Simulations." Researchgate.Net : 1–8. Dolnik, Olga, Larissa Kolesnikova, Lea Stevermann, and Stephan Becker. 2010. "Tsg101 Is Recruited by a Late Domain of the Nucleocapsid Protein to Support Budding of Marburg Virus-like Particles." Journal of virology 84(15): 7847–56. Eswar, Narayanan et al. 2006. "Comparative Protein Structure Modeling Using Modeller." Current protocols in bioinformatics 15(1): 5–6. Fatima, Shehnaz et al. 2020. "ADMET Profiling of Geographically Diverse Phytochemical Using Chemoinformatic Tools." Future medicinal chemistry 12(1): 69–87. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ France, Alex, Marcus Scotti, and Luciana Scotti. 2019. MOLECULAR DOCKING OF FRUCTOSE-DERIVED NUCLEOSIDE ANALOGS AGAINST REVERSE TRANSCRIPTASE OF HIV-1. Garg, Vijay Kumar et al. 2016. "MFPPI–Multi FASTA ProtParam Interface." Bioinformation 12(2): 74. Haider, Zeshan et al. 2020. "In-Silico Pharmacophoric and Molecular Docking-Based Drug Discovery against the Main Protease (M pro) of SARS-CoV-2, a Causative Agent COVID- 19." Pak. J. Pharm. Sci 33(6): 2697–2705. Inwood, William B et al. 2009. "Genetic Evidence for an Essential Oscillation of Transmembrane-Spanning Segment 5 in the Escherichia Coli Ammonium Channel AmtB." Genetics 183(4): 1341–55. Jenuth, Jack P. 2000. "The Ncbi." In Bioinformatics Methods and Protocols, Springer, 301–12. Kolb, Ryan et al. 2014. "Inflammasomes in Cancer: A Double-Edged Sword." Protein & cell 5(1): 12–20. Laskowski, Roman A, Malcolm W MacArthur, David S Moss, and Janet M Thornton. 1993. "PROCHECK: A Program to Check the Stereochemical Quality of Protein Structures." Journal of applied crystallography 26(2): 283–91. Lipinski, Christopher A. 2004. "Lead-and Drug-like Compounds: The Rule-of-Five Revolution." Drug Discovery Today: Technologies 1(4): 337–41. Lyne, Paul D. 2002. "Structure-Based Virtual Screening: An Overview." Drug discovery today 7(20): 1047–55. Mehedi, Masfique, Allison Groseth, Heinz Feldmann, and Hideki Ebihara. 2011. "Clinical Aspects of Marburg Hemorrhagic Fever." Future virology 6(9): 1091–1106. https://pubmed.ncbi.nlm.nih.gov/22046196. Quazi, Sameer et al. 2021. "In-Silico Structural and Molecular Docking-Based Drug Discovery .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ Against Viral Protein (VP40) of Marburg Virus: A Causative Agent of MAVD." bioRxiv. Ramanan, Parameshwaran et al. 2012. "Structural Basis for Marburg Virus VP35–Mediated Immune Evasion Mechanisms." Proceedings of the National Academy of Sciences 109(50): 20661–66. Towner, Jonathan S et al. 2006. "Marburgvirus Genomics and Association with a Large Hemorrhagic Fever Outbreak in Angola." Journal of virology 80(13): 6497–6516. Ursic-Bedoya, Raul et al. 2014. "Protection Against Lethal Marburg Virus Infection Mediated by Lipid Encapsulated Small Interfering RNA." The Journal of Infectious Diseases 209(4): 562–70. https://doi.org/10.1093/infdis/jit465. Wang, Lin-Fa et al. 2001. "Molecular Biology of Hendra and Nipah Viruses." Microbes and infection 3(4): 279–87. Warren, Travis K et al. 2010. "Antiviral Activity of a Small-Molecule Inhibitor of Filovirus Infection." Antimicrobial agents and chemotherapy 54(5): 2152–59. Zhu, Tengfei et al. 2017. "Crystal Structure of the Marburg Virus Nucleoprotein Core Domain Chaperoned by a VP35 Peptide Reveals a Conserved Drug Target for Filovirus." Journal of virology 91(18). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430405doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430405 http://creativecommons.org/licenses/by-nc/4.0/ 10_1101-2021_02_09_430460 ---- 99120235 1 Sequence neighborhoods enable reliable prediction of pathogenic mutations in cancer genomes Shayantan Banerjee​1,2,3​, Karthik Raman​1,2,3*​ & Balaraman Ravindran​1,2,4* 1​Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI), Indian Institute of Technology (IIT) Madras, Chennai - 600 036 2​Initiative for Biological Systems Engineering, IIT Madras, Chennai - 600 036 3​Bhupat and Jyoti Mehta School of Biosciences, Department of Biotechnology, IIT Madras, Chennai - 600 036 4​Department of Computer Science and Engineering, IIT Madras, Chennai - 600 036 *Corresponding author Abstract Identifying cancer-causing mutations from sequenced cancer genomes hold much promise for targeted therapy and precision medicine. “Driver” mutations are primarily responsible for cancer progression, while “passengers” are functionally neutral. Although several computational approaches have been developed for distinguishing between driver and passenger mutations, very few have concentrated on utilizing the raw nucleotide sequences surrounding a particular mutation as potential features for building predictive models. Using experimentally validated cancer mutation data in this study, we explored various string-based feature representation techniques to incorporate information on the neighborhood bases immediately 5ʼ and 3ʼ from each mutated position. Density estimation methods showed significant distributional differences between the neighborhood bases surrounding driver and passenger mutations. Binary classification models derived using repeated cross-validation experiments gave comparable performances across all window sizes. Integrating sequence features derived from raw nucleotide sequences with other genomic, structural and evolutionary features resulted in the development of a pan-cancer mutation effect prediction tool, NBDriver, which was highly efficient in identifying pathogenic variants from five independent validation datasets. An ensemble predictor obtained by combining the predictions from NBDriver with two other commonly used driver prediction tools (CONDEL and Mutation Taster) outperformed existing pan-cancer models in prioritizing a literature-curated list of driver and passenger mutations. Using the list of true positive mutation predictions derived from NBDriver, we identified a list of 138 known driver genes with functional evidence from various sources. Overall, our study underscores the efficacy of utilizing raw nucleotide sequences as features to distinguish between driver and passenger mutations from sequenced cancer genomes. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 2 Introduction Cancer is caused due to the accumulation of somatic mutations during an individualʼs lifetime ​[1]​. These mutations arise due to both endogenous factors such as errors during DNA replication, or exogenous factors such as substantial exposure to mutagens such as tobacco smoking, UV light, and radon gas. ​[2]–[4]​. These somatic mutations can be of different types, ranging from single-nucleotide variants (SNVs), to insertions and deletions of a few nucleotides, copy-number aberrations (CNAs), and large-scale rearrangements known as structural variants (SVs) ​[5]​. With the advent of high-throughput sequencing, the identification of somatic mutations from sequenced cancer genomes has become easier. International cancer genomics projects have resulted in the development of large mutational databases such as the Catalogue Of Somatic Mutations In Cancer (COSMIC) ​[6]​, the International Cancer Genome Consortium (ICGC) ​[7]​, and The Cancer Genome Atlas (TCGA) ​[8]​. Several open-access resources to analyze and visualize large cancer genomics datasets, such as the cBio Cancer Genomics Portal ​[9] and the Database of Curated Mutations in cancer (DoCM) ​[10]​, have also been developed. These resources aggregate functionally relevant cancer variants from different studies and help researchers gain easy access to expert-curated lists of pathogenic somatic variants. However, not all somatic mutations present in the cancer genome are ​equally ​responsible for developing the disease. A small fraction of somatic variants known as “driver mutations” provide a growth advantage and are positively selected for, during cancer cell development ​[1]​. On the other hand, “passenger mutations” provide no growth advantage and do not contribute to cancer progression ​[1]​. Identifying the complete set of cancer-causing genes that harbor driver mutations, also known as driver genes, holds much promise for precision medicine, where a specific therapeutic intervention is tailored towards a patientʼs mutational profile ​[11]​. Distinguishing between driver and passenger mutations from sequenced cancer genomes is a non-trivial task. Doing so solely based on the substitution type (A->T, G->C, etc.) is very difficult. Hence, several computational methods that utilize several other factors to identify driver mutations have been developed over the years. Recurrence-based driver prioritization tools such as MutSigCV ​[12] and MuSiC ​[13] for single-nucleotide variants, and GISTIC2 ​[14] for copy number aberrations, have been developed to identify variants that occur more than what is expected by chance, otherwise known as the “background mutation rate”. Other methods such as SIFT ​[15]​, PROVEAN ​[16]​, PolyPhen-2 ​[17]​, CHASM ​[18]​, and FATHMM ​[19] are based on predicting the functional impact of mutations on the protein encoded by the gene. Expert-curated databases such as the OncoKB database ​[20] contain information regarding the functional impact of over 3000 cancer-causing alterations belonging to over 400 genes. Pathway analysis based tools such as NetBox ​[21] and HotNet ​[22] work by identifying mutations affecting large scale gene regulatory or protein–protein interaction networks. Machine .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?pliJhv https://www.zotero.org/google-docs/?xyj6f6 https://www.zotero.org/google-docs/?Wzjo7i https://www.zotero.org/google-docs/?AZvuMf https://www.zotero.org/google-docs/?c6p736 https://www.zotero.org/google-docs/?72SvaB https://www.zotero.org/google-docs/?r1u2Rw https://www.zotero.org/google-docs/?a8AvBQ https://www.zotero.org/google-docs/?0mGhQ8 https://www.zotero.org/google-docs/?sLMuet https://www.zotero.org/google-docs/?uxLOSi https://www.zotero.org/google-docs/?190ndh https://www.zotero.org/google-docs/?MhCwtw https://www.zotero.org/google-docs/?GM72Ep https://www.zotero.org/google-docs/?tz5l8h https://www.zotero.org/google-docs/?cCyI07 https://www.zotero.org/google-docs/?1PtyhX https://www.zotero.org/google-docs/?1aj7lA https://www.zotero.org/google-docs/?YXY631 https://www.zotero.org/google-docs/?4eMxBI https://www.zotero.org/google-docs/?nwbPxD https://www.zotero.org/google-docs/?PHDuqr https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 3 learning-based methods have also been recently developed to predict deleterious missense mutations ​[23]–[28]​. Genome instability, demonstrated by a higher than average rate of substitution, insertion, and deletion of one or more nucleotides, is a hallmark of most cancer cells. There is a considerable variation in the rates of SNPs across the human genome. Sequence context plays a significant role in the variability of the substitutions rate as explained by the CpG dinucleotides, which exhibit an elevated C->T substitution rate by almost 15 folds relative to the average rate observed in mammals ​[29]​. Mutational hotspots such as the CpG dinucleotides in breast and colorectal cancer ​[30] and TpC dinucleotides in lung cancer, melanoma, and ovarian cancer [31] are some examples of “signatures” that promote mutagenesis. There have been several efforts to utilize the sequence context to measure the human genomeʼs substitution rates. Aggarwala ​et al. ​[32] used the local sequence context of SNPs to explain the observed variability in substitution rates. Zhao et al. ​[33] studied the neighboring nucleotide biases and their effect on the mutational and evolutionary processes for over two million SNPs. Recent studies have identified specific signatures or patterns of mutations in different cancer types that shed light on the underlying mechanisms responsible for cancer progression ​[34], [35]​. Alexandrov et al. ​[34] identified 21 distinct mutational signatures in human cancers by considering the substitution class and the sequence context immediately to the 3ʼ and 5ʼ of the mutated base. Several studies have demonstrated that certain factors such as tobacco smoking, UV light, or the inactivation of tumor suppressor genes involved in DNA repair can result in the development of mutational hotspots ​[31], [34], [36]​. There have been two recently published studies that have tackled this problem, to the best of our knowledge. Deitlein ​et al. ​[61] hypothesized that driver mutations occur more frequently in “unusual” nucleotide positions than passenger mutations and built probabilistic models to identify driver genes that had mutations in those “unusual” contexts. Agajanian ​et al. ​[37] integrated classical machine learning and deep learning approaches to model raw nucleotide sequences to differentiate between driver and passenger mutations. In this study, our overall aim is to build models utilizing machine learning and natural language processing techniques to differentiate between driver and passenger mutations solely based on the raw nucleotide context. ​Using missense mutation data with experimentally validated functional impacts compiled from various studies, we show that the underlying probability distributions of driver and passenger mutationsʼ neighborhoods are significantly different from one another. We extracted features from the neighborhood nucleotide sequences and built robust binary classification models to distinguish between the two classes of mutations. We achieved good classification performances during our repeated cross-validation experiments and against an independent hold-out set of literature curated mutations. Integrating neighborhood features with other features such as protein physicochemical properties and evolutionary conservation scores significantly improved our .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?5P3Kkb https://www.zotero.org/google-docs/?DpDkuX https://www.zotero.org/google-docs/?m4YBv7 https://www.zotero.org/google-docs/?rWqM7V https://www.zotero.org/google-docs/?DsMpDa https://www.zotero.org/google-docs/?t4eZ94 https://www.zotero.org/google-docs/?f6c2Xc https://www.zotero.org/google-docs/?f6c2Xc https://www.zotero.org/google-docs/?D8jlpx https://www.zotero.org/google-docs/?jg5IEa https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 4 algorithmʼs overall predictive power in identifying pathogenic variants from five separate independent test sets, and had comparable performances with some of the existing state-of-the-art mutation effect prediction tools. Overall, this study establishes that we can leverage efficient feature representation of the neighborhood sequences of cancer-causing mutations to differentiate between a known driver and passenger mutations with sufficient discriminative power. Methods Mutation Datasets for Building and Evaluating the Models Our training data consisted of the list of missense mutations whose effects were determined from experimental assays and were compiled in the study conducted by Brown ​et al. ​[37]​. In this study, missense mutations from 58 genes that were pan-cancer-based were combined from five different datasets ​[38], [75]–[79] (Supplementary Table 1). These mutations were presented as amino acid substitutions based on their protein coordinates (e.g., F595L, L597Q, etc.). Since we were interested in studying the effects of neighboring DNA nucleotide sequences, we mapped them to their corresponding genomic coordinates (gDNA) for further analysis. We used the publicly available TransVar web-interface ​[80] for this purpose. The final training set was made up of 5265 single nucleotide variants (4131 passengers and 1134 drivers). For external validation, we collected somatic mutation data from five different sources. First, we considered a literature-curated list of 140 passengers and 849 driver mutations categorized based on functional evidence published by Martelotto ​et al. ​[38] as part of the benchmarking study to rank various mutation effect prediction algorithms. Second, we used a subset of mutations published by the recently released Cancer Mutation Census. The Cancer Mutation Census (CMC) ​[6] is a database that integrates all coding somatic mutation data from the COSMIC database to prioritize variants driving different cancer forms. It contains functional evidence obtained using both manual curation and computational predictions from multiple sources. For our validation experiments, we chose only single nucleotide variants classified as missense and derived from the CGC-classified list of tumor suppressor genes and oncogenes. Based on the databaseʼs various evidence criteria, we considered only mutations categorized as tier 1, 2, and 3 for our study. From this list, we further removed all overlapping mutations with our training set and derived a final set of 277 mutations for further analysis. The Catalog of Validated Oncogenic Mutations from the Cancer Genome Interpreter ​[35] database contains a high confidence list of pathogenic alterations compiled from several sources such as the DoCM ​[10]​, ClinVar ​[81]​, OncoKB ​[20]​, and the Cancer Biomarkers Database .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?31MoFj https://www.zotero.org/google-docs/?wdyKrt https://www.zotero.org/google-docs/?1GiunD https://www.zotero.org/google-docs/?ZqAWO0 https://www.zotero.org/google-docs/?r9T5z2 https://www.zotero.org/google-docs/?Iw1hGh https://www.zotero.org/google-docs/?vUyBxN https://www.zotero.org/google-docs/?eDUh2d https://www.zotero.org/google-docs/?eIteb0 https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 5 [35]​. We extracted only missense somatic mutations flagged as “cancer” for our validation experiments. After removing all overlapping mutations with our training set, we obtained a final list of 1628 driver mutations. This constituted our third validation set. The fourth validation dataset consisted of the list of top 50 hotspot mutations reported in the comprehensive study done by Rheinbay ​et al. ​[44]​. In this study, mutation data was accumulated from the Pan-Cancer Analysis of Whole Genomes (PCAWG) consortium and involved analyzing more than 2700 cancer genomes derived from more than 2500 patients. A total of 33 coding missense mutations from five well-known cancer genes: TP53, PIK3CA, NRAS, KRAS, IDH1, were extracted from this study. Mao ​et al. ​[27] published mutation datasets to judge the performance of the driver prediction tool (CanDrA) in predicting rare driver mutations. They were constructed using the following criteria: 1. GBM and OVC mutations reported in the COSMIC database only once. 2. The reported mutations had no other mutations within 3bp of their position and were not part of either the training or test datasets for building the machine learning model (CanDrA). We used the same datasets to judge our modelʼs ability to predict rare driver mutations based solely on the neighborhood sequences. After removing all overlapping mutations with the training set, we obtained 34 GBM mutations and 38 OVC mutations. A summary of all the mutational datasets used in our study is available in Table 1. Besides, all our predictions are derived using the forward strand and were based on the GRCh37 (ENSEMBL release 87) build of the human genome. Feature Extraction Sequence-Based Features We used the raw nucleotide sequences surrounding a mutation as features for our analysis. Each unique mutation was represented as a triplet (Chromosome, Position, Type) where “Type” refers to one of the 12 types of point substitution (A>T, A>G, A>C, T>A, TC, G>A, G>C, G>T, C>T, C>A, C>G). We then extracted the surrounding raw nucleotide sequences from the reference genome for a given mutation position using the bedtools ​getfasta command. The “window size” for a particular mutation captures the number of nucleotides upstream and downstream from the mutated position. Hence, considering all possible window sizes between 1 and 10, including the wild-type nucleotide at the mutated position, we obtained nucleotide strings of length 3, 5, 7, 9, 11, 13, 15, 17, 19, and 21, respectively. We also considered the chromosome number and the type of point substitution as features for our analysis. Now, for particular window size, to map the nucleotide strings to a numerical format, we used the following two widely used feature transformation approaches (Figure 1): .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?qegaMK https://www.zotero.org/google-docs/?cpPJWd https://www.zotero.org/google-docs/?AZdAoN https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 6 1. One-hot encoding:​ Each neighboring nucleotide was represented as a binary vector of size 4 containing all zero values except the nucleotide index, which was marked as 1. Thus “A” was encoded as [1,0,0,0], “G” as [0,1,0,0] and so on. This particular feature representation resulted in a feature space of size , wheren8 + 2 represents the window sizes. We used the pandas ​get_dummies() ​to, , ... 10 n = 1 2 3 perform this task. 2. Overlapping ​k​-mers:​ In this type of feature representation, the neighboring nucleotide string sequences for a given window size were represented as overlapping ​k-​mers of lengths 2,3 and 4. For instance, an arbitrary sequence of window size 3 {ATT​T​GGA}, where ​̒T​ʼ is the wild type base at the mutated position, can be decomposed into overlapping ​k-​mers of size 2 {AT, TT, T​T​, ​T​G, GG, GA}, 3 {ATT, TT​T​, T​T​G, ​T​GG, GGA} and 4 {ATT​T​, TT​T​G, T​T​GG, TGGA} respectively. To map these overlapping ​k​-mers to a numerical format, we applied two commonly used encoding techniques known as CountVectorizer and TfidfVectorizer. The CountVectorizer returns a vector encoding whose length is equal to that of the vocabulary (total number of unique ​k​-mers in the data set) and contains an integer count for the number of times a given k​-mer has appeared in our dataset. A Term Frequency – Inverse Document Frequency (TF-IDF) vectorizer assigns scores to each k​-mer based on i) how often the given ​k​-mer appears in the dataset and ii) how much information the given ​k​-mer provides, i.e., whether it is common or rare in our dataset. Mathematically, for a given term ​i ​present in a document ​j​, the TF-IDF score is given bytf i,j req ogtf i,j = f i,i × l di N where is the number of occurrences of ​i in ​j​, is the number of documents containingreq f i,j d i i​, and ​N is the total number of documents. These techniques were implemented in Python using the ​feature_extraction ​module from scikit-learn. The final processed training set used to build the machine learning models was represented as a matrix of size , where ​m is the total nm number of coding point mutations and ​n is the size of the vocabulary. The matrix entries were the TF-IDF or the CountVectorizer scores. The number of one-hot encoded features, ​k​-mers, and the size of the vocabulary possible for each window size is shown in Table 2. Descriptive Genomic Features In addition to the neighborhood features, a set of 27 features (Supplementary Table 2) previously used to train the cancer-specific missense mutation annotation tool, CanDrA ​[27]​, were extracted from the following three data portals: CHASMʼs SNVBOX ​[18]​, Mutation Assessor ​[25] and ANNOVAR ​[82]​. Among them were conservation scores (such as ʻGERPʼ scores, ʻHMMPHCʼ scores and others), amino acid substitution features (such as ʻPREDRSAEʼ, ʻPredBFactorSʼ, and others), exon features (such as ʻExonSnpDensityʼ, ʻExonConservationʼ and others), features indicative of protein domain knowledge (such as ʻʻUniprotDOM_PostModEnzʼ, ʻUniprotREGIONSʼ and others) and functional impact scores computed by algorithms such as VEST ​[23] and CHASM ​[18]​. A tiny fraction (0.1%) of the UniProtKB annotations were not .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?IADw88 https://www.zotero.org/google-docs/?w0LQNM https://www.zotero.org/google-docs/?g8Ogev https://www.zotero.org/google-docs/?DMXzXZ https://www.zotero.org/google-docs/?aEpklV https://www.zotero.org/google-docs/?hAesz4 https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 7 available from the SNVBOX database for our training data. We used the ​k​-nearest neighbors-based imputation technique to substitute the missing features with those of the same geneʼs nearest mutations. Our external validation datasets were free from any missing information. Density Estimation A kernel density estimator (or KDE) takes an ​n​-dimensional dataset as an input and outputs an estimate of the underlying ​n​-dimensional probability distribution. A Gaussian KDE uses a mixture of ​n​-dimensional Gaussian probability distributions to represent the density being estimated. It essentially tries to center one Gaussian component per data point, resulting in a non-parametric estimation of the density. One of the hyperparameters for a kernel density estimator is the bandwidth, which controls the kernelʼs size at each data point, thereby affecting the “smoothness” of the resulting curve. We estimated the underlying probability distributions for the driver and passenger neighborhoods using a Gaussian kernel density estimator. The schematic workflow of the entire process for a single run of the kernel density estimation experiment is shown in Figure 2(A-F). First, we randomly selected, with replacement, an equal number (​n​) of driver and passenger mutations from our training data for a single run of the kernel density estimation algorithm and particular window size (Figure 2A). Then, we tuned the bandwidth hyperparameter for each class of mutations using a 5-fold cross-validation approach and used the best parameters to derive the kernel density estimates (Figure 2B). Finally, we used the Jensen-Shannon (JS) distance metric to calculate the similarity between the two class-wise density estimates (Figure 2C). The JS distance between two probability distributions is based on the Kullback-Leibler (KL) divergence, but unlike KL divergence, it is bounded and symmetric. ​ For two probability vectors, p and q, it is given by, S J = 2 1 √D(p||m) (q||m)+ D where , and is the KL divergence. The significance of the estimated distances (p )m = 2 1 + q D between the probability estimates was calculated using a randomized bootstrapping approach. Specifically, we randomly sampled with replacement twice the number (​2n​) of mutations from the same training set, irrespective of the labels. We then split the dataset in half, randomly assigning each half to driver and passenger mutations, respectively (Figure 2D). This was followed by a similar process of tuning the hyperparameters and deriving the class-wise density estimates (Figure 2E). Finally, we reported the JS distance between the density estimates (Figure 2F). We experimented with the following seven different neighborhood-based feature representations: ● One-hot encoding ● Count Vectorizer (​k-​mer sizes of 2,3 and 4) ● TF-IDF Vectorizer (​k-​mer sizes of 2,3 and 4) .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 8 The aforementioned KDE estimation experiments were repeated 30 times for all possible window sizes between 1 and 10 and all seven feature representations. Next, the best median JS distance estimate from the original experiments was reported for the given window size. The percentage of runs of the randomized experiments for which the estimated distance was greater than the original estimate was reported as the ​p​-value.The KernelDensity() function from the scikit-learn ​neighbors ​module was used to derive the density estimates and jensenshannon() from the scipy ​spatial.distance submodule was used to calculate the distance metric. Classification Models To build our binary classification models, we implemented three classifiers: the Random Forest classifier, the Extra Trees classifier (Extreme Random Forests), and the generative KDE classifier. The overall approach for the KDE-based classification was as follows (Figure 3A): 1. The dataset was split using the cross-validation strategy. 2. The training data was then split by label (driver/passenger). 3. For each class, we fit a generative model using the kernel density estimation method as described in the previous section. This gave us the likelihood that and (x|passenger) P respectively for a particular data point ​x​.(x|driver) P 4. Next, the class prior, which is given by the number of examples of each class: (driver) P and was calculated.(passenger) P 5. Now, for a test data point ​x​, the posterior probability was given by and . ​The(driver|x) ∝ P (x|driver)P (driver) P (passenger|x) ∝ P (x|passenger)P (passenger) P label that maximized the posterior probabilities was the one assigned to ​x. In contrast, both the tree-based classifiers are discriminative. They are composed of a large collection of decision trees where the final output is derived by combining every single treeʼs predictions by a majority voting scheme. The main difference between the two tree-based classifiers lies in selecting splits or cut points to split the individual nodes. Random Forest chooses an optimal split for each feature under consideration, whereas Extra Trees chooses it randomly. All the classification models were written using the predefined functions available in the ​scikit-learn (v. 0.22) ​[83]​ module. Model Selection and Tuning Repeated Cross-Validation Experiments Owing to the relatively smaller sample size (5265 mutations) of the training set of mutations, we adopted a repeated 10-fold cross-validation approach to building our model. First, we split the dataset into ten equal subsets in a stratified fashion. Splitting the dataset in a stratified fashion maintains the same proportion of mutations in each class as observed in the original data. Nine of the ten subsets were combined into one training set (Figure 3A). In each training phase, we performed feature selection using the Extra Trees classifier, cross-validated grid search-based parameter tuning, training the classifiers using the best parameters, and .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?w18oSk https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 9 obtaining the corresponding prediction scores on the hold-out test set (Figure 3B). ​For a given window size, we experimented with a total of seven feature representations (One-hot encoding, Count Vectorizer (​k-​mer size=2, 3 and 4), TF-IDF Vectorizer (​k​-mer size=2, 3 and 4), and three binary classifiers (Random Forests, Extra Trees, and Kernel Density Estimation). So overall, we had 21 distinct feature-classifier pairs. We ran the 10-fold cross-validation experiments (Figure 3(A-B)) three times for each such pair, thereby obtaining 30 values for each classification metric: sensitivity, specificity, AUC, and MCC. The best overall median value, the 95% CI for each of the above metrics, and the corresponding feature-classifier pair were reported. To study the variation in classification performances with the addition of more nucleotides (or increase in window size), we repeated the Wilcoxon signed-rank test on the generated performance metrics for all 45 pairs of window sizes , where . The ​ci()​from the ​gmodels package ​[84] in R x, ) ( y and (x, ) 1, , .., 0] x < y y ∈ [ 2 . 1 was used to calculate the 95% CIs for the various classification metrics. Derivation of the Binary Classification Model to Distinguish between Driver and Passenger Mutations To derive the final machine learning model, all overlapping mutations between the training set Brown ​et al.​, and the validation set Martelotto ​et al., were discarded, and the classifier was retrained on the reduced train set (4549 mutations: 544 drivers and 4005 passengers). The set of 989 mutations published by Martelotto ​et al. [43] formed our independent test set. Due to the inherent imbalance in the dataset, we implemented an undersampling technique known as Repeated Edited Nearest Neighbors ​[85] to downsize the majority class and consequently obtain a balanced dataset for subsequent training. Predictions were obtained using two separate feature sets: 1) only neighborhood features based on the raw nucleotide sequences (or the neighborhood-only-model) and 2) neighborhood features plus the descriptive genomic features (or NBDriver). In addition to Random Forests, Extra trees, and the KDE classifier, we also experimented with a fourth classifier: a linear kernel SVM to obtain these predictions. Various combinations of these classifiers were implemented as ensemble models using the ​VotingClassifier() of the ​ensemble ​module in scikit-learn​. Feature Selection We adopted an impurity-based feature selection technique for feature selection using the extra trees classifier to derive a ranked list of the top predictive features for our analysis. For the repeated cross-validation experiments, the features that were within the top 30 percentile of the most important features were selected and subsequently used to train our models. However, for deriving NBDriver, we built several classification models based on the top ​n (​n​=20, 30, 40, 50, 60) features and chose the one that gave the best overall classification performance. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?ensKKz https://www.zotero.org/google-docs/?EUz3I2 https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 10 The TF-IDF and Countvectorizer scores, used as features for our analysis, were implemented using the ​feature_extraction module in ​scikit-learn​. In both cases, a new vocabulary dictionary of all the ​k​-mers was first learnt from the training data using the ​fit_transform() routine and the corresponding term-document matrix was returned. Using this vocabulary, the scores of the ​k​-mers from the test data were obtained using the ​transform() routine and were subsequently used in our analysis. Hyperparameter Tuning and Classifier Threshold Selection Hyperparameter tuning was done using a cross-validation based grid search technique over a parameter grid. The GridSearchCV() ​from the ​model_selection module in ​scikit-learn was used for this purpose. To further fine-tune the classifiers, we experimented with various classification thresholds from 0 to 1 with step sizes 0.001 and chose the one that gave the best AUROC. For an imbalanced classification problem, using the default threshold of 0.5 is not a viable option and often results in the incorrect prediction of the minority class examples. Performance Metrics For the repeated cross-validation experiments, we assessed our classifiersʼ performance using four commonly used performance metrics: Sensitivity, Specificity, Mathews correlation coefficient (MCC), and Area under the ROC curve (AUROC). Mathews correlation coefficient is a balanced metric and is very useful in imbalance classification problems. It is bounded between -1 and 1, with -1 representing perfect misclassification, 0 representing average classification, and +1 representing ideal classification. It is given by the following expression: CCM = TP ×TN−FP ×FN √(TP +FP )(TP +FN)(TN+FP )(TN+FN) where TP stands for True Positives, TN, True Negatives, FP, False Positives and FN, False Negatives. ​MCC is a more robust alternative to Accuracy and F1-score that can sometimes show overoptimistic classification performance for imbalanced data and was therefore not used for the analysis. After deriving the binary classifier, we used additional classification performance metrics outlined by Martelotto ​et al. to compare our algorithm's performance with other state-of-the-art mutation effect prediction tools. They were Positive Predictive Value (PPV), Negative Predictive Value (NPV), and a composite score, defined as the sum of Sensitivity, Specificity, PPV, and NPV. Comparison with Other Pan-Cancer Mutation Effect Predictors Similar to the benchmarking study conducted by Martelotto ​et al.​, we compared the generated binary classifiers with nine pan-cancer mutation effect prediction tools: Mutation Taster ​[86]​, FATHMM (cancer) ​[19]​, Condel ​[26]​, FATHMM (missense) ​[19]​, PROVEAN (v1.1.3) ​[16]​, SIFT (Ensemble 66) ​[87]​, Polyphen2 ​[17]​, Mutation Assessor ​[25] and VEST ​[23] using the set of 989 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?dmLx6f https://www.zotero.org/google-docs/?FXfCRN https://www.zotero.org/google-docs/?h1AAzi https://www.zotero.org/google-docs/?NRVhWV https://www.zotero.org/google-docs/?BOE9RN https://www.zotero.org/google-docs/?geq8Jo https://www.zotero.org/google-docs/?bVsA3r https://www.zotero.org/google-docs/?McLKrL https://www.zotero.org/google-docs/?Aqm2zp https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 11 literature-curated mutations. For each of these predictors, we used the prediction labels based on predefined score cutoffs published as part of ​the Martelotto ​et al. [43] study. Two new prediction algorithms (CHASMplus (pan-cancer) ​[24] and CanDrA+ (Cancer-in general) ​[27]​) were also added to the list, and the score cutoffs were decided in the following manner. For CHASMplus, we tested all possible thresholds between 0 and 1 with step sizes of 0.01 and chose the corresponding threshold with the highest composite score due to the absence of a default threshold. All mutations with predicted scores greater than this optimal threshold were labeled as drivers and vice versa. For CanDrA+, we used the default prediction categories ​[27]​. Predictions for CHASMplus and CanDrA+ were obtained from the OpenCRAVAT web server [88] and executable packages published by Mao ​et al. ​[27]​. Different mutation effect predictors were combined using the majority voting rule to obtain better predictive power, and ensemble models were created. While comparing two algorithms, to derive the significance of the difference between any two classification metrics, we adopted the same strategy as Martelotto ​et al . Briefly, we derived the 95% CI for each of these classification metrics by repeated sampling with replacement with 1000 iterations. If the generated CIʼs touched or there was no overlap, the difference was considered significant ( ) based on the results of the analysis done by Ng ​et al.​ ​[89]​..05 p < 0 Results First, we report a pan-cancer machine learning tool, NBDriver (​N​eighborhood ​D​river), which utilizes neighborhood sequences as features to discriminate missense mutations as either drivers or passengers. Our key results are three-fold. First, we use generative models to derive the distances between the underlying probability estimates of the two classes of mutations. Then, we build robust classification models using repeated cross-validation experiments to derive the median values of the metrics designed to estimate the classification performances. Finally, we demonstrate our modelsʼ ability to predict unseen coding mutations from independent test datasets derived from large mutational databases. Neighborhood Sequences of Driver and Passenger Mutations Show Markedly Different Distributions We estimated the driver and passenger neighborhood sequencesʼ underlying probability distributions using kernel density estimation. We computed the Jensen–Shannon (JS) distance metric to understand how “distinguishable” they are from one another. The JS metric is bounded between 0 (maximally similar) and 1 (maximally dissimilar). Table 3 shows the results of the KDE estimation experiments for various window sizes. We observed that, for the Brown et al. dataset ​[37]​, the maximum significant ( ) median JS distance between passenger .05 p < 0 and driver neighborhood distributions, calculated across 30 runs of bootstrapping experiments, was 0.275 (for a window size of 2), and the minimum was 0.211 (for window sizes .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?TppuVo https://www.zotero.org/google-docs/?L2Is56 https://www.zotero.org/google-docs/?vWlZpj https://www.zotero.org/google-docs/?NK6aNG https://www.zotero.org/google-docs/?DpgZOP https://www.zotero.org/google-docs/?XqDHDR https://www.zotero.org/google-docs/?K3AjBf https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 12 7-10). Figure 4 shows the variation in the JS distances between the original and the randomized KDE experiments for window sizes between 1 and 10. As evident from Figure 4, except for window size 1, all other window sizes had a significant JS distance value ( )..05 p < 0 Out of the seven different feature representations, we reported the ones that gave the maximum median JS distance. From Table 3, we observed that a TF-IDF vectorizer with ​k​-mer sizes 2,3 and 4 was the preferred form of feature representation for six window sizes (1, 4, 6, 8, 9 and 10), whereas a count vectorizer with ​k​-mer sizes 2 and 3 was chosen for three window sizes (3, 5, and 7). However, the only exception was for a window size of 2, where the one-hot encoding-based feature representation technique gave the maximum median JS distance. These results indicated the TF-IDF based feature representation was the most efficient at delineating the differences in the distributions between the driver and passenger neighborhoods. Repeated Cross-Validation Using Neighborhood Features Generates Robust Classification Models The repeated cross-validation experiments using only the neighborhood sequences as features are shown in the Supplementary Table 3A. From these results, we observed that the best median sensitivity of 0.938 (95%CI 0.919-0.940) was obtained using features derived from a count vectorizer and subsequent training using a random forest classifier for window sizes 1, 5, 6 and 9. However, the best median specificity of 0.807 (95%CI 0.791-0.811), AUC of 0.832 (95% CI 0.826-0.841), and MCC of 0.584 (95% CI 0.564-0.594) were obtained using a TF-IDF based feature representation trained using a KDE classifier for a window size of 10. The variation in the classification performances for different window sizes obtained during the repeated cross-validation experiments using the initial training set of 5265 mutations is shown in Figure 5. This figure shows that except for window sizes 1 and 2, a TF-IDF vectorizer gave the maximum median AUC, Specificity, and MCC. However, for all window sizes, the maximum median sensitivities were obtained using the count vectorizer based feature representation technique. Classification metrics such as AUC and MCC are used to measure the quality of binary classifications. Similar to our observations made from the KDE estimation experiments (Table 3), the TF-IDF vectorizer performed consistently well both in terms of the overall AUC and MCC, indicating that this particular feature representation technique was the most efficient separating the two classes of mutations. The variation in the classification performances with the increase in the window size is shown in Supplementary Table 3b. From this table, we observed that out of the 45 unique pairs of window sizes (Methods: Repeated cross-validation experiments), 27 had a significant ( ; .05 p < 0 Wilcoxon signed-rank test) increase in specificity and AUC while 31 had a significant ( ; .05 p < 0 Wilcoxon signed-rank test) increase in MCC with the addition of more nucleotides. However, for sensitivity, a significant increase was observed only when the window size was increased from 4 to 9 and 7 to 9, respectively. These results indicated that adding more nucleotides to a .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 13 particular window does not always guarantee an increase in the classifierʼs performance in distinguishing between driver and passenger mutations. Classification Models Give Performances Comparable with Other State-of-the-Art Mutation Effect Predictors Using only the neighborhood nucleotide sequences as features, the best results (Table 4A) on the independent test set ​[38]​, was obtained using an Extra Trees classifier. This neighborhood-only model was trained on features extracted using the Count Vectorizer technique on a window size of 10. We trained NBDriver by combining the neighborhood features and the descriptive genomic features. Out of the various classifiers implemented, an ensemble model consisting of a linear kernel SVM and a KDE classifier gave the best results (Table 4a). Compared to the neighborhood-only model, there was a significant increase ( ) in accuracy (=0.891), .05 p < 0 sensitivity (=0.93), NPV (=0.608), Composite Score (=3.123), and MCC (=0.561). However, this was accompanied by a significant ( ) drop in specificity (=0.643). There was no .05 p < 0 significant change in PPV, though. A ranked list of the 50 features used to train NBDriver is shown in Supplementary Table 4. Out of those 50 features, 26 were neighborhood-based features or the TF-IDF scores of the overlapping 4-mers extracted from a window size of 10. The plot displaying the variation in the AUROC with various classification thresholds is shown in Figure 6. The best results were obtained using a threshold of 0.119. Consequently, all mutations with the prediction scores above this threshold were classified as drivers and vice versa. Overall, on this benchmarking dataset, NBDriver ranked fourth in terms of the composite score, fifth in terms of specificity, and second in NPV, PPV, Sensitivity, and Accuracy. By contrast, although the neighborhood-only-model was the top-ranking tool in terms of Specificity and PPV, it did not perform well in terms of the other metrics. Owing to NBDriverʼs superior performance, all subsequent external validations were performed using this model. Voting Ensemble of Prediction Algorithms Gives Better Classification Performances We also assessed the effect of combining multiple top-ranked single predictors into an ensemble model. We evaluated NBDriverʼs contribution to the overall ensemble by obtaining predictions without the tool. The top-performing ensemble consisting of NBDriver, CHASMplus, FATHMM (cancer), Mutation Taster, and Condel resulted in a composite score of 3.504, accuracy of 0.945, and an NPV of 0.88, significantly higher ( than every single .05) p < 0 predictor evaluated in the study (Table 4b; Supplementary Table 5). The composite score and .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?r8Loip https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 14 accuracy obtained using this ensemble were also the highest among all the different combinations of single-predictors tested in this study (Supplementary Table 5). Removing NBDriver from the ensemble resulted in a significant decrease ( in the composite .05) p < 0 score, NPV, MCC, Accuracy, and Sensitivity. However, it was accompanied by a significant increase in specificity and no significant PPV change for the smaller ensemble (Table 4b). Another ensemble model consisting of NBDriver, Mutation Taster, and Condel gave similar results (Composite score=3.504) as the previous one (Table 4b; Supplementary Table 5). Compared to the previous ensemble (Table 4b), there was no significant difference in MCC, Composite Score, PPV, Sensitivity, and Accuracy. However, there was a significant increase in the NPV and a significant decrease in the specificity. A complete set of all the different combinations of the single predictors evaluated in this study is present in Supplementary Table 5. From this table, we observed that the maximum sensitivity (=0.9941) and NPV (=0.9375) were obtained by the ensemble (Mutation Taster, FATHMM (cancer), and CONDEL), which did not include NBDriver. However, the maximum specificity (=0.8357) and PPV (=0.9711) were obtained using the ensemble (NBDriver, CHASMplus, Mutation Taster, and CONDEL). Driver and Passenger Mutationsʼ Features Used to Train NBDriver are Significantly Different Our feature selection results illustrate the differences in the underlying biological processes governing driver and passenger mutations similar to Mao ​et al. ​[27]​. Using the training data used to build NBDriver, we found that driver mutations tend to occur on amino acid residues that have stiff backbones and have less solvent accessibility as denoted by the significantly lower (Wilcoxon test; ) ʻPREDRSAEʼ probability measure (Figure 7A) and the p < .45 × 10−10 significantly higher (Wilcoxon test; ) ʻPredBFactorSʼ probability measure (Figure p < .12 × 10−9 7B) respectively. We also observed that a mutation is more likely to be a driver if it occurs in genomic regions that were evolutionarily conserved. The mean GERP score for driver mutations was significantly higher (Wilcoxon test; ) ​than that of passengers p < .22 × 10−16 (Figure 7C). Similarly, driver mutations were more common in genomic sites that had a significantly higher (Wilcoxon test; ) Positional Hidden Markov Model (HMM) p < .33 × 10−16 conservation score (or HMMPHC) as compared to passengers (Figure 7D). Among the other features, ​we observed similar class-wise distributional differences among features indicative of protein domain knowledge. ʻUniprotDOM_PostModEnzʼ denotes the presence or absence of a mutation in a site within an enzymatic domain responsible for post-translational modification (or PTM). PTM-related mutations are often accountable for changes in protein functions and alterations of regulatory pathways, eventually leading to carcinogenesis. ʻUniprotREGIONSʼ is another binary feature that tells us whether a mutation occurred in an experimentally defined .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?ACANdN https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 15 region of interest in the protein sequence, such as those associated with protein-protein interactions and regulation of biological processes. Our analysis pointed out that a considerable portion (31%) of driver mutations clustered around PTM sites, contrasted by around 0.4% of passengers (Figure 7E). Similarly, about 37% of driver mutations were located in protein domains that were experimentally defined as regions of interest compared to around 11% of passengers (Figure 7F). In our approach, the TF-IDF algorithm was used to weigh a ​k​-mer and assign importance to it in the given set of neighborhood sequences. Also, a higher TF-IDF score is indicative of the greater relevance/importance of that ​k​-mer. Our feature selection results indicated that for the 26 neighborhood sequence-based features, the mean TF-IDF scores for drivers were significantly higher (Wilcoxon test; ) than that of passengers (Figure 8). This result .05 p < 0 suggested that NBDriverʼs top neighborhood features are more specific to the driver neighborhoods than the passengers. Evaluation Using Previously Unseen Coding Mutation Data To evaluate NBDriver's capability at identifying previously unseen driver mutations, we evaluated it using missense mutation data compiled from the following four databases. Cancer Mutation Census Based on the various evidence criteria set forth by the Cancer Mutation Census database, a particular mutation can be classified into tier 1, 2, or 3, with tier 1 mutations having the highest level of evidence of being a driver and so on. From the list of missense mutations in the CMC not present in our training data, NBDriver could accurately predict all 19 tier 1, 25 out of 28 tier 2, and 179 out of 230 tier 3 mutations, achieving an overall accuracy of 81%. On the other hand, the ensemble model consisting of NBDriver, Condel and MutationTaster could accurately predict all 19 tier 1, 27 out of 28 tier 2, and 214 out of 230 tier 3 mutations achieving an overall accuracy of 94%. Upon further investigation, we found that NBDriver was highly successful in identifying hotspot mutations present in the CMC. Recurrent alterations at the same genomic site in cancer genes such as MET, MPL, FLT3 and KIT have been implicated in many different cancer types ​[39]–[43]​ (Supplementary Table 7a). Cancer Genome Interpreter Database Using pathogenic mutations compiled from various sources, we found that NBDriver could accurately identify 1274 out of 1628 non-overlapping missense driver mutations, achieving an overall accuracy of 78%. The model correctly identified all three mutations from the Cancer Biomarkers Database, 39 out of 47 mutations from the DoCM database, 23 out of 31 mutations from the Martelotto ​et al. study ​[38]​, and 1209 out of 1547 mutations from the OncoKB database. On the other hand, the ensemble model comprising NBDriver, Condel and MutationTaster could accurately predict 1519 out of 1628 mutations achieving an overall accuracy of 93%. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?7JrE81 https://www.zotero.org/google-docs/?sQutB3 https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 16 Recurrent Driver Mutations Out of the top 33 hotspot mutations identified in the study conducted by Rheinbay ​et al. ​[44] as recurrently mutated, NBDriver correctly identified 27 as drivers. However, Mutation Taster displayed superior performance by identifying all 33 mutations correctly. Except for KRAS, NBDriver correctly identified all mutations from the other four genes (NRAS, TP53, PIK3CA, and IDH1) as cancer drivers. Hotspot mutations in these four genes reported by Rheinbay ​et al. [44]​, correctly identified as drivers by NBDriver have been implicated in many different cancers ​[45]–[48]​ (Supplementary Table 7a). Rare Driver Mutations Found in Glioblastoma and Ovarian Cancer Using the list of rare drivers reported by the developers of the driver prediction tool CanDrA [27]​, we evaluated NBDriverʼs ability to identify less frequent alterations in the cancer genome. Overall, NBDriver alone could identify 29 out of 34 (85%) glioblastoma mutations and 20 out of 38 (53%) ovarian cancer mutations. All these mutations belonged to eight known OVC-related genes (ARID1A, CDK12, ERBB2, MLH1, MSH2, MSH6, PIK3R1, PMS2) and seven known GBM-related genes (ATM, EGFR, MDM2, NF1, PDGFRA, PIK3CA, ROS1). All eight OVC-related genes correctly identified as drivers by NBDriver have been implicated in ovarian cancer through observations made from multiple studies ​[49]–[53] (Supplementary Table 7b). The ensemble model made up of NBDriver, Condel and Mutation Taster performed better than the single predictor by identifying 32 out of 34 (94%) glioblastoma mutations and 24 out of 38 (63%) ovarian cancer mutations. Stratification Of the Predicted Driver Genes Based on Literature Evidence We combined the list of genes with at least one true positive missense driver mutation prediction from NBDriver into a catalog of 138 putative driver genes. We then compared our gene set against those already published in six landmark pan-cancer studies for driver gene identification. Bailey ​et al. ​[54] identified 299 driver genes from 9423 tumor exomes by combining the predictions from 26 different computational tools. Martincorena ​et al. ​[55] used the normalized ratio of non-synonymous to synonymous mutations (dN/dS model) to identify driver genes from 7664 tumors and reported a total of 180 putatively positively-selected driver genes and 369 known cancer genes from three main sources: 1) 174 cancer genes from the version 73 of the COSMIC database ​[6]​. 2) 214 significantly mutated genes across 4742 tumors identified by Lawrence et al. ​[56] using the MutSigCV tool. 3) 204 genes identified through a literature search. Two marker papers from TCGA ​[57], [58] identified 132 significantly mutated genes using the MutSigCV tool. Tamborero ​et al. ​[35] identified a list of 291 high-confidence drivers from 3205 tumor samples using a rule-based approach. Deitlein ​et al. ​[59] modelled the nucleotide context .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?4HVCcW https://www.zotero.org/google-docs/?lmKA6o https://www.zotero.org/google-docs/?qF2COS https://www.zotero.org/google-docs/?NaNxnO https://www.zotero.org/google-docs/?vauxUL https://www.zotero.org/google-docs/?XnPUXD https://www.zotero.org/google-docs/?NIz7z8 https://www.zotero.org/google-docs/?onYZnR https://www.zotero.org/google-docs/?dCJWES https://www.zotero.org/google-docs/?NfKtmg https://www.zotero.org/google-docs/?q5iBQM https://www.zotero.org/google-docs/?uLpG4G https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 17 around driver mutations and identified 460 driver genes based on nucleotide context. Apart from the aforementioned studies, overlap between our list of genes and two well-established cancer gene repositories: the Cancer Gene Census ​[6], [60] and the Intogen database ​[61] was also reported. We identified 124 (=89%) of our predicted driver genes as canonical cancer genes present in the Cancer Gene Census. Among the remaining genes, six were catalogued as drivers in at least two of the pan-cancer studies or mutation databases as mentioned above (Supplementary Table 6). A total of eight genes (CTLA4, IGF1R, PIK3CD, TGFBR1, RAD54L, SHOC2, CDKN2B and XRCC2) were not identifiable from any of the landmark studies or databases and required further validation. Discussion Our investigation aimed to compare the raw neighborhood sequences of driver and passenger mutations and exploit any observed distributional differences to build robust classification models. We showed that except for one window size (n=1), a significant difference in the distributions between the neighborhoods of driver and passenger mutations was consistently present in our cohort. Using TF-IDF and Count Vectorizer scores derived from the overlapping k​-mers, we trained a KDE-based generative classifier and two other tree-based classifiers. One crucial distinction between NBDriver and other methods is the inclusion of overlapping ​k​-mers extracted from the neighborhood of mutations as features for further analysis. NBDriver was trained using a small set (=50) of highly discriminative features, 52% of which were neighborhood scores. Using this model, we could accurately predict 89% of all the literature-curated mutations outlined in the Martelotto ​et al. study ​[38]​, 81% of the high confidence list of mutations recently published by the Cancer Mutation Census, 78% of all the actionable alterations reported in the Cancer Genome Interpreter, 82% of all the hotspot mutations reported from a pan-cancer genome analysis, 85% and 53% of rare driver mutations found in glioblastoma and ovarian cancer respectively. Ensemble models obtained by combining the predictions from other state-of-the-art mutation effect predictors with NBDriver performed significantly better than the individual predictors in all five validation datasets. These results underscore the importance of including neighborhood features to build mutation effect prediction algorithms. Although our methodʼs focus was to identify missense driver mutations from sequenced cancer genomes, the majority of the genes (130 out of 138) containing at least one predicted mutation belonged to the Cancer Gene Census or other large-scale driver gene discovery studies. The protein products of the eight remaining genes not flagged as drivers by any of the databases/studies had known functional roles in maintaining the cancer genomeʼs stability and promoting tumor development. The CTLA4 gene modulates immune response by serving as checkpoints for T-cell activation, essentially decreasing the T cellsʼ ability to attack cancer cells. Immune checkpoint inhibitors, which are designed to “block” these checkpoints have drastically changed the treatment outcomes for several cancers ​[62]​. Transcriptomic profiling of blood samples drawn from cervical cancer patients identified IGF1R as a biomarker for .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?9wKzns https://www.zotero.org/google-docs/?TbT3Nk https://www.zotero.org/google-docs/?ypFwyV https://www.zotero.org/google-docs/?YtZBmX https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 18 increased risk of treatment failure ​[63]​. Overexpression of the PIK3CD gene has been associated with cell proliferation in colon cancer and is responsible for poor prognosis among patients ​[64]​. Multiple studies have indicated an association with polymorphisms observed in TGFBR1 and cancer susceptibility ​[65], [66]​. Similarly, polymorphisms detected in the RAD54L is a genetic marker associated with the development of meningeal tumours ​[67]​. SHOC2 has been reported to be a regulator of the Ras signalling pathway and is associated with poor prognosis among breast cancer patients ​[68]​. Similarly, the inactivation of the CDKN2B gene is responsible for the progression of pancreatic cancer ​[69]​. With the help of massively parallel sequencing studies, rare mutations in the XRCC2 gene have been linked to increased breast cancer susceptibility among patients ​[70]​. Our study does have some limitations. First, we used a representative dataset of driver and passenger mutations whose labels were not ​in silico ​predictions from other mutation effect prediction algorithms but derived from experimentally validated functional and transforming impacts from various sources. This resulted in a relatively small sample size for supervised classification. However, this approach also minimized the chances of inadvertently introducing false-positive mutations into the training set used to derive the driver and passenger neighborhoodsʼ class-wise density estimates or the machine learning models. Evidence ​[71] suggests that a sizeable proportion of mutations present in large mutational databases are mostly false positives, reflecting sequencing errors due to DNA damage. Moreover, NBDriver derived using this high confidence list of mutations performed reasonably well across all five independent validation sets and produced 138 driver genes with sufficient literature evidence suggesting that our initial choice of the training dataset was overall beneficial. Second, since missense mutations are the most abundant form of somatic alterations ​[72]​, our machine learning models were all trained using missense mutations only. However, in principle, our approach could be extended to other types of mutations as well. Additionally, during the external validation analysis, although NBDriver performed very well in terms of PPV (=0.941), the NPV (=0.608) was relatively low (Table 4a). To identify biologically relevant mutations for further functional validation, NPV is often overlooked as a classification metric. A high NPV allows us to exclude passenger mutations with greater confidence and reduces the number of driver mutations incorrectly labeled as passengers (false negatives). However, we observed that adding different combinations of multiple single predictors into ensemble models resulted in a significant improvement in the NPV (Table 4b). Our observations on the ensemble modelsʼ performances were similar to those made by Martelotto et al. ​[38]​. Last, we trained our machine learning models using the combined dataset containing mutational effects determined from experimental assays not specific to any cancer type. Hence, all our models were pan-cancer based. Consequently, a cancer-type specific analysis in the future would require the list of known driver and passenger mutations from specific tumor types. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?RLkMZK https://www.zotero.org/google-docs/?mDz08n https://www.zotero.org/google-docs/?ISUK1S https://www.zotero.org/google-docs/?3YS65v https://www.zotero.org/google-docs/?2sAakc https://www.zotero.org/google-docs/?KTIaXO https://www.zotero.org/google-docs/?Pjyrg7 https://www.zotero.org/google-docs/?F6RWyj https://www.zotero.org/google-docs/?qeyppw https://www.zotero.org/google-docs/?sKlt20 https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 19 Conclusion In this study, we showed that there is a significant difference in the nucleotide contexts surrounding driver and passenger mutations obtained from sequenced cancer genomes. Using efficient feature representation, we generated robust classification models that gave comparable performances across five independent validation sets. The predicted true positive mutations were part of genes with experimental support of being functionally relevant from multiple sources. Future experiments using a much larger sample size need to be performed to derive neighborhood-sequence-based classification scores for all possible missense mutations in the genome across several cancer types. This would be possible if future large-scale sequencing studies such as MSK-IMPACT ​[73]​, PCAWG ​[44]​, ICGC ​[7]​, and GENIE ​[74] produce a complete catalog of missense driver mutations with functional evidence in a cancer-type specific manner. This relatively novel strategy of utilizing the sequence neighborhoods for driver mutation identification can dramatically improve the annotation processʼs efficiency for unknown mutations. Acknowledgements This work was supported by Department of Biotechnology, Government of India (DBT) (BT/PR16710/BID/7/680/2016), IIT Madras, Initiative for Biological Systems Engineering (IBSE) and Robert Bosch Center for Data Science and Artificial Intelligence (RBC-DSAI). Conflicts of Interest The authors declare no conflict of interest. References [1] M. R. Stratton, P. J. Campbell, and P. A. Futreal, “The cancer genome,” ​Nature ​, vol. 458, no. 7239, pp. 719–724, 2009. [2] J. M. Samet, “Radon and lung cancer,” ​JNCI: Journal of the National Cancer Institute​, vol. 81, no. 10, pp. 745–758, 1989. [3] J. W. Drake, “Mutagenic mechanisms,” ​Annual Review of Genetics​, vol. 3, no. 1, pp. 247–268, 1969. [4] W. Zhu, S. Wu, and Y. A. Hannun, “Contributions of the Intrinsic Mutation Process to Cancer Mutation and Risk Burdens,” ​EBioMedicine​, vol. 24, pp. 5–6, 2017. [5] B. J. Raphael, J. R. Dobson, L. Oesper, and F. Vandin, “Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine,” Genome medicine​, vol. 6, no. 1, pp. 1–17, 2014. [6] S. A. Forbes ​et al. ​, “COSMIC: exploring the world’s knowledge of somatic mutations in human cancer,” ​Nucleic acids research​, vol. 43, no. D1, pp. D805–D811, 2015. [7] J. Zhang ​et al.​, “International Cancer Genome Consortium Data Portal—a one-stop shop for cancer genomics data,” ​Database (Oxford)​, vol. 2011, Sep. 2011, doi: .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?S3ETCO https://www.zotero.org/google-docs/?wNABHh https://www.zotero.org/google-docs/?WQuckk https://www.zotero.org/google-docs/?ieAN9j https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 20 10.1093/database/bar026. [8] J. N. Weinstein ​et al.​, “The cancer genome atlas pan-cancer analysis project,” ​Nature genetics​, vol. 45, no. 10, pp. 1113–1120, 2013. [9] E. Cerami ​et al.​, “The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data,” ​Cancer Discov​, vol. 2, no. 5, pp. 401–404, May 2012, doi: 10.1158/2159-8290.CD-12-0095. [10] B. J. Ainscough ​et al.​, “DoCM: a database of curated mutations in cancer,” ​Nature methods​, vol. 13, no. 10, pp. 806–807, 2016. [11] L. A. Garraway, “Genomics-driven oncology: framework for an emerging paradigm,” Journal of Clinical Oncology​, vol. 31, no. 15, pp. 1806–1814, 2013. [12] M. S. Lawrence ​et al.​, “Mutational heterogeneity in cancer and the search for new cancer-associated genes,” ​Nature​, vol. 499, no. 7457, pp. 214–218, 2013. [13] N. D. Dees ​et al.​, “MuSiC: identifying mutational significance in cancer genomes,” ​Genome research​, vol. 22, no. 8, pp. 1589–1598, 2012. [14] C. H. Mermel, S. E. Schumacher, B. Hill, M. L. Meyerson, R. Beroukhim, and G. Getz, “GISTIC2. 0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers,” ​Genome biology​, vol. 12, no. 4, pp. 1–14, 2011. [15] P. Kumar, S. Henikoff, and P. C. Ng, “Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm,” ​Nature protocols​, vol. 4, no. 7, p. 1073, 2009. [16] Y. Choi, G. E. Sims, S. Murphy, J. R. Miller, and A. P. Chan, “Predicting the Functional Effect of Amino Acid Substitutions and Indels,” ​PLOS ONE​, vol. 7, no. 10, p. e46688, Oct. 2012, doi: 10.1371/journal.pone.0046688. [17] I. Adzhubei, D. M. Jordan, and S. R. Sunyaev, “Predicting functional effect of human missense mutations using PolyPhen-2,” ​Curr Protoc Hum Genet​, vol. Chapter 7, p. Unit7.20, Jan. 2013, doi: 10.1002/0471142905.hg0720s76. [18] H. Carter ​et al.​, “Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations,” ​Cancer Res.​, vol. 69, no. 16, pp. 6660–6667, Aug. 2009, doi: 10.1158/0008-5472.CAN-09-1133. [19] H. A. Shihab, J. Gough, D. N. Cooper, I. N. M. Day, and T. R. Gaunt, “Predicting the functional consequences of cancer-associated amino acid substitutions,” ​Bioinformatics​, vol. 29, no. 12, pp. 1504–1510, Jun. 2013, doi: 10.1093/bioinformatics/btt182. [20] D. Chakravarty ​et al.​, “OncoKB: A Precision Oncology Knowledge Base,” ​JCO Precis Oncol​, vol. 2017, Jul. 2017, doi: 10.1200/PO.17.00011. [21] E. Cerami, E. Demir, N. Schultz, B. S. Taylor, and C. Sander, “Automated Network Analysis Identifies Core Pathways in Glioblastoma,” ​PLoS One​, vol. 5, no. 2, Feb. 2010, doi: 10.1371/journal.pone.0008918. [22] F. Vandin, E. Upfal, and B. J. Raphael, “Algorithms for detecting significantly mutated pathways in cancer,” ​Journal of Computational Biology​, vol. 18, no. 3, pp. 507–522, 2011. [23] H. Carter, C. Douville, P. D. Stenson, D. N. Cooper, and R. Karchin, “Identifying Mendelian disease genes with the Variant Effect Scoring Tool,” ​BMC Genomics​, vol. 14, no. 3, p. S3, May 2013, doi: 10.1186/1471-2164-14-S3-S3. [24] C. Tokheim and R. Karchin, “CHASMplus Reveals the Scope of Somatic Missense Mutations Driving Human Cancers,” ​Cell Systems​, vol. 9, no. 1, pp. 9-23.e8, Jul. 2019, doi: 10.1016/j.cels.2019.05.005. [25] B. Reva, Y. Antipin, and C. Sander, “Predicting the functional impact of protein mutations: application to cancer genomics,” ​Nucleic Acids Res.​, vol. 39, no. 17, p. e118, Sep. 2011, .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 21 doi: 10.1093/nar/gkr407. [26] A. Gonzalez-Perez, J. Deu-Pons, and N. Lopez-Bigas, “Improving the prediction of the functional impact of cancer mutations by baseline tolerance transformation,” ​Genome Medicine​, vol. 4, no. 11, p. 89, Nov. 2012, doi: 10.1186/gm390. [27] Y. Mao, H. Chen, H. Liang, F. Meric-Bernstam, G. B. Mills, and K. Chen, “CanDrA: Cancer-Specific Driver Missense Mutation Annotation with Optimized Features,” ​PLoS One​, vol. 8, no. 10, Oct. 2013, doi: 10.1371/journal.pone.0077945. [28] P. C. Ng and S. Henikoff, “Predicting deleterious amino acid substitutions,” ​Genome research​, vol. 11, no. 5, pp. 863–874, 2001. [29] A. Hodgkinson and A. Eyre-Walker, “Variation in the mutation rate across mammalian genomes,” ​Nat. Rev. Genet.​, vol. 12, no. 11, pp. 756–766, Oct. 2011, doi: 10.1038/nrg3098. [30] T. Sjöblom ​et al.​, “The consensus coding sequences of human breast and colorectal cancers,” ​Science​, vol. 314, no. 5797, pp. 268–274, Oct. 2006, doi: 10.1126/science.1133427. [31] A. F. Rubin and P. Green, “Mutation patterns in cancer genomes,” ​PNAS​, vol. 106, no. 51, pp. 21766–21770, Dec. 2009, doi: 10.1073/pnas.0912499106. [32] V. Aggarwala and B. F. Voight, “An expanded sequence context model broadly explains variability in polymorphism levels across the human genome,” ​Nat. Genet.​, vol. 48, no. 4, pp. 349–355, Apr. 2016, doi: 10.1038/ng.3511. [33] Z. Zhao and E. Boerwinkle, “Neighboring-Nucleotide Effects on Single Nucleotide Polymorphisms: A Study of 2.6 Million Polymorphisms Across the Human Genome,” Genome Res​, vol. 12, no. 11, pp. 1679–1686, Nov. 2002, doi: 10.1101/gr.287302. [34] L. B. Alexandrov, S. Nik-Zainal, D. C. Wedge, P. J. Campbell, and M. R. Stratton, “Deciphering Signatures of Mutational Processes Operative in Human Cancer,” ​Cell Rep​, vol. 3, no. 1, pp. 246–259, Jan. 2013, doi: 10.1016/j.celrep.2012.12.008. [35] D. Tamborero ​et al.​, “Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations,” ​Genome medicine​, vol. 10, no. 1, p. 25, 2018. [36] L. B. Alexandrov and M. R. Stratton, “Mutational signatures: the patterns of somatic mutations hidden in cancer genomes,” ​Curr. Opin. Genet. Dev.​, vol. 24, pp. 52–60, Feb. 2014, doi: 10.1016/j.gde.2013.11.014. [37] A.-L. Brown, M. Li, A. Goncearenco, and A. R. Panchenko, “Finding driver mutations in cancer: Elucidating the role of background mutational processes,” ​PLoS computational biology​, vol. 15, no. 4, p. e1006981, 2019. [38] L. G. Martelotto ​et al. ​, “Benchmarking mutation effect prediction algorithms using functionally validated cancer-related missense mutations,” ​Genome Biology​, vol. 15, no. 10, p. 484, Oct. 2014, doi: 10.1186/s13059-014-0484-1. [39] M. Jeffers ​et al. ​, “Activating mutations for the met tyrosine kinase receptor in human cancer,” ​Proceedings of the National Academy of Sciences​, vol. 94, no. 21, pp. 11445–11450, 1997. [40] T. S. Akpınar, V. S. Hançer, M. Nalçacı, and R. Diz-Küçükkaya, “MPL W515L/K Mutations in Chronic Myeloproliferative Neoplasms,” ​Turk J Haematol​, vol. 30, no. 1, pp. 8–12, Mar. 2013, doi: 10.4274/tjh.65807. [41] D. Liang ​et al.​, “FLT3-TKD mutation in childhood acute myeloid leukemia,” ​Leukemia​, vol. 17, no. 5, pp. 883–886, 2003. [42] J. A. Fletcher, C. D. Fletcher, B. P. Rubin, L. K. Ashman, C. L. Corless, and M. C. Heinrich, “KIT gene mutations in gastrointestinal stromal tumors: more complex than previously recognized?,” ​The American journal of pathology​, vol. 161, no. 2, p. 737, 2002. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 22 [43] S. Yui ​et al.​, “D816 mutation of the KIT gene in core binding factor acute myeloid leukemia is associated with poorer prognosis than other KIT gene mutations,” ​Annals of Hematology​, vol. 96, no. 10, pp. 1641–1652, 2017. [44] E. Rheinbay ​et al.​, “Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes,” ​BioRxiv​, p. 237313, 2017. [45] G. A. Hobbs, C. J. Der, and K. L. Rossman, “RAS isoforms and mutations in cancer at a glance,” ​Journal of cell science​, vol. 129, no. 7, pp. 1287–1292, 2016. [46] E. H. Baugh, H. Ke, A. J. Levine, R. A. Bonneau, and C. S. Chan, “Why are there hotspot mutations in the TP53 gene in human cancers?,” ​Cell Death & Differentiation​, vol. 25, no. 1, pp. 154–160, 2018. [47] D. A. Fruman and C. Rommel, “PI3K and Cancer: Lessons, Challenges and Opportunities,” ​Nat Rev Drug Discov​, vol. 13, no. 2, pp. 140–156, Feb. 2014, doi: 10.1038/nrd4204. [48] F. E. Bleeker ​et al. ​, “IDH1 mutations at residue p. R132 (IDH1R132) occur frequently in high-grade gliomas but not in other solid tumors,” ​Human mutation​, vol. 30, no. 1, pp. 7–11, 2009. [49] K. C. Wiegand ​et al.​, “ARID1A mutations in endometriosis-associated ovarian carcinomas,” New England Journal of Medicine​, vol. 363, no. 16, pp. 1532–1543, 2010. [50] T. Popova ​et al.​, “Ovarian cancers harboring inactivating mutations in CDK12 display a distinct genomic instability pattern characterized by large tandem duplications,” ​Cancer research​, vol. 76, no. 7, pp. 1882–1891, 2016. [51] H. Luo, X. Xu, M. Ye, B. Sheng, and X. Zhu, “The prognostic value of HER2 in ovarian cancer: a meta-analysis of observational studies,” ​PloS one​, vol. 13, no. 1, p. e0191972, 2018. [52] C. Zhao, S. Li, M. Zhao, H. Zhu, and X. Zhu, “Prognostic values of DNA mismatch repair genes in ovarian cancer patients treated with platinum-based chemotherapy,” ​Archives of gynecology and obstetrics​, vol. 297, no. 1, pp. 153–159, 2018. [53] A. J. Philp ​et al. ​, “The phosphatidylinositol 3′-kinase p85α gene is an oncogene in human ovarian and colon tumors,” ​Cancer research​, vol. 61, no. 20, pp. 7426–7429, 2001. [54] M. H. Bailey ​et al.​, “Comprehensive characterization of cancer driver genes and mutations,” ​Cell​, vol. 173, no. 2, pp. 371–385, 2018. [55] I. Martincorena ​et al.​, “Universal patterns of selection in cancer and somatic tissues,” ​Cell​, vol. 171, no. 5, pp. 1029–1041, 2017. [56] M. S. Lawrence ​et al.​, “Discovery and saturation analysis of cancer genes across 21 tumour types,” ​Nature​, vol. 505, no. 7484, pp. 495–501, 2014. [57] K. A. Hoadley ​et al.​, “Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer,” ​Cell​, vol. 173, no. 2, pp. 291–304, 2018. [58] Cancer Genome Atlas Research Network, “Comprehensive molecular profiling of lung adenocarcinoma,” ​Nature​, vol. 511, no. 7511, pp. 543–550, 2014. [59] F. Dietlein ​et al.​, “Identification of cancer driver genes based on nucleotide context,” ​Nature Genetics​, vol. 52, no. 2, pp. 208–218, 2020. [60] P. A. Futreal ​et al. ​, “A census of human cancer genes,” ​Nature reviews cancer​, vol. 4, no. 3, pp. 177–183, 2004. [61] F. Martínez-Jiménez ​et al.​, “A compendium of mutational cancer driver genes,” ​Nature Reviews Cancer​, vol. 20, no. 10, pp. 555–572, 2020. [62] A. Rotte, “Combination of CTLA-4 and PD-1 blockers for treatment of cancer,” ​Journal of Experimental & Clinical Cancer Research​, vol. 38, no. 1, p. 255, 2019. [63] P. Moreno-Acosta ​et al.​, “IGF1R Gene Expression as a Predictive Marker of Response to .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 23 Ionizing Radiation for Patients with Locally Advanced HPV16-positive Cervical Cancer,” Anticancer Res​, vol. 32, no. 10, pp. 4319–4325, Oct. 2012. [64] J. Chen ​et al.​, “PIK 3 CD induces cell growth and invasion by activating AKT/GSK-3β/β-catenin signaling in colorectal cancer,” ​Cancer science​, vol. 110, no. 3, pp. 997–1011, 2019. [65] B. Pasche, M. J. Pennison, H. Jimenez, and M. Wang, “TGFBR1 and cancer susceptibility,” ​Transactions of the american clinical and climatological association​, vol. 125, p. 300, 2014. [66] Y. Wang, X. Qi, F. Wang, J. Jiang, and Q. Guo, “Association between TGFBR1 polymorphisms and cancer risk: a meta-analysis of 35 case-control studies,” ​PloS one​, vol. 7, no. 8, p. e42899, 2012. [67] P. E. Leone, M. Mendiola, J. Alonso, C. Paz-y-Miño, and A. Pestaña, “Implications of a RAD54L polymorphism (2290C/T) in human meningiomas as a risk factor and/or a genetic marker,” ​BMC cancer​, vol. 3, no. 1, p. 6, 2003. [68] W. Geng, K. Dong, Q. Pu, Y. Lv, and H. Gao, “SHOC2 is associated with the survival of breast cancer cells and has prognostic value for patients with breast cancer,” ​Molecular Medicine Reports​, vol. 21, no. 2, pp. 867–875, 2020. [69] Q. Tu ​et al. ​, “CDKN2B deletion is essential for pancreatic cancer development instead of unmeaningful co-deletion due to juxtaposition to CDKN2A,” ​Oncogene​, vol. 37, no. 1, pp. 128–138, 2018. [70] D. Park ​et al.​, “Rare mutations in XRCC2 increase the risk of breast cancer,” ​The American Journal of Human Genetics​, vol. 90, no. 4, pp. 734–739, 2012. [71] L. Chen, P. Liu, T. C. Evans, and L. M. Ettwiller, “DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification,” ​Science​, vol. 355, no. 6326, p. 752, Feb. 2017, doi: 10.1126/science.aai8690. [72] B. Vogelstein, N. Papadopoulos, V. E. Velculescu, S. Zhou, L. A. Diaz, and K. W. Kinzler, “Cancer Genome Landscapes,” ​Science​, vol. 339, no. 6127, p. 1546, Mar. 2013, doi: 10.1126/science.1235122. [73] D. T. Cheng ​et al.​, “Comprehensive detection of germline variants by MSK-IMPACT, a clinical diagnostic platform for solid tumor molecular oncology and concurrent cancer predisposition testing,” ​BMC medical genomics​, vol. 10, no. 1, p. 33, 2017. [74] AACR Project Genie Consortium, “AACR Project GENIE: powering precision medicine through an international consortium,” ​Cancer discovery​, vol. 7, no. 8, pp. 818–831, 2017. [75] M. Olivier, R. Eeles, M. Hollstein, M. A. Khan, C. C. Harris, and P. Hainaut, “The IARC TP53 database: New online mutation analysis and recommendations to users,” ​Human Mutation​, vol. 19, no. 6, pp. 607–614, 2002, doi: 10.1002/humu.10081. [76] B. B. Campbell ​et al.​, “Comprehensive analysis of hypermutation in human cancer,” ​Cell​, vol. 171, no. 5, pp. 1042–1056, 2017. [77] P. K.-S. Ng ​et al. ​, “Systematic functional annotation of somatic mutations in cancer,” Cancer cell​, vol. 33, no. 3, pp. 450–462, 2018. [78] L. M. Starita ​et al. ​, “Massively parallel functional analysis of BRCA1 RING domain variants,” ​Genetics​, vol. 200, no. 2, pp. 413–422, 2015. [79] K. Mahmood ​et al.​, “Variant effect prediction tools assessed using independent, functional assay-based datasets: implications for discovery and diagnostics,” ​Hum. Genomics​, vol. 11, no. 1, p. 10, 16 2017, doi: 10.1186/s40246-017-0104-8. [80] W. Zhou ​et al. ​, “TransVar: a multilevel variant annotator for precision genomics,” ​Nature Methods​, vol. 12, no. 11, Art. no. 11, Nov. 2015, doi: 10.1038/nmeth.3622. [81] M. J. Landrum ​et al.​, “ClinVar: public archive of interpretations of clinically relevant .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 24 variants,” ​Nucleic acids research​, vol. 44, no. D1, pp. D862–D868, 2016. [82] K. Wang, M. Li, and H. Hakonarson, “ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data,” ​Nucleic acids research​, vol. 38, no. 16, pp. e164–e164, 2010. [83] F. Pedregosa ​et al.​, “Scikit-learn: Machine Learning in Python,” ​MACHINE LEARNING IN PYTHON​, p. 6. [84] G. R. Warnes, B. Bolker, T. Lumley, and R. C. Johnson, “gmodels: Various R programming tools for model fitting,” ​R package version​, vol. 2, no. 3, 2015. [85] D. L. Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” ​IEEE Transactions on Systems, Man, and Cybernetics​, no. 3, pp. 408–421, 1972. [86] J. M. Schwarz, C. Rödelsperger, M. Schuelke, and D. Seelow, “MutationTaster evaluates disease-causing potential of sequence alterations,” ​Nature Methods​, vol. 7, no. 8, Art. no. 8, Aug. 2010, doi: 10.1038/nmeth0810-575. [87] N.-L. Sim, P. Kumar, J. Hu, S. Henikoff, G. Schneider, and P. C. Ng, “SIFT web server: predicting effects of amino acid substitutions on proteins,” ​Nucleic Acids Research​, vol. 40, no. W1, pp. W452–W457, Jul. 2012, doi: 10.1093/nar/gks539. [88] K. A. Pagel ​et al. ​, “Integrated informatics analysis of cancer-related variants,” ​JCO clinical cancer informatics​, vol. 4, pp. 310–317, 2020. [89] C. K. Y. Ng ​et al. ​, “Predictive Performance of Microarray Gene Signatures: Impact of Tumor Heterogeneity and Multiple Mechanisms of Drug Resistance,” ​Cancer Res​, vol. 74, no. 11, pp. 2946–2961, Jun. 2014, doi: 10.1158/0008-5472.CAN-13-3375. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://www.zotero.org/google-docs/?HsLTKM https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 25 Table 1: Summary of datasets used in this study Type Study/ Database Name Description Sample size Training Brown ​et al. Missense mutations from 58 cancer genes generated from experimental assays 5265 mutations (Driver: 1134 Passenger: 4131) Validation Martelotto ​et al. A Literature curated list of mutations from 15 cancer genes used to benchmark 15 mutation-effect prediction algorithms 989 mutations (Driver: 849 Passenger: 140) Validation Catalog of Validated Oncogenic Mutations High confidence pathogenic missense variants compiled from several sources 1628 driver mutations Validation Rheinbay ​et al. Recurrent single point driver mutations in the coding region compiled from the Pan-Cancer Analysis of Whole Genomes Consortium 33 driver mutations Validation Mao ​et al. Rare driver mutations from GBM and OVC cancer types GBM: 34 driver mutations OVC: 38 driver mutations Validation Cancer Mutation Census (COSMIC v92) COSMIC mutation data categorized into different functional classes both through manual curation and computational predictions 277 driver mutations .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 26 Table 2: Number of one-hot encoded features and possible ​k​-mers for a given window size. The size of the vocabulary (or ​N​) is given in brackets Window Size Number of one-hot encoded features Number of ​k​-mers possible for a given ​k​-mer size k=2 (N=16) k=3 (N= 64) k=4 (N= 256) w= 1 8 2 1 0 w= 2 16 4 3 2 w= 3 24 6 5 4 w= 4 32 8 7 6 w= 5 40 10 9 8 w= 6 48 12 11 10 w= 7 56 14 13 12 w= 8 64 16 15 14 w= 9 72 18 17 16 w= 10 80 20 19 18 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 27 Table 3: Median JS distances for both the original and randomized experiments for different window sizes Window Size Feature Type Median JS distance (original) Median JS distance (randomized) p​-value 1 TF (​k​=2) 0.345 0.34 Not significant 2 OHE 0.275 0.221 <0.05 3 CV (​k​=2) 0.219 0.170 <0.05 4 TF (​k​=3) 0.214 0.167 <0.05 5 CV (​k​=3) 0.211 0.166 <0.05 6 TF (​k​=4) 0.210 0.166 <0.05 7 CV (​k​=2) 0.211 0.165 <0.05 8 TF (​k​=3) 0.211 0.164 <0.05 9 TF ​(k​=3) 0.211 0.166 <0.05 10 TF (​k​=4) 0.211 0.165 <0.05 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 28 Table 4a: Comparison of the generated binary classifiers with other mutation effect prediction algorithms using the benchmarking dataset by Martelotto ​et al. Table 4b: Evaluating the contribution of NBDriver to the top performing ensemble Algorithm Accuracy Sensitivity Specificity PPV NPV CS MCC Mutation Taster​ 0.8857​ 0.9081​ 0.75​ 0.9566​ 0.5738​ 3.1885​ 0.590​ FATHMM (Cancer)​ 0.91​ 0.9788​ 0.4929​ 0.9213​ 0.7931​ 3.1861​ 0.580​ CHASMplus (Pancancer)​ 0.85 0.852 0.85​ 0.972​ 0.486​ 3.16​ 0.570​ NBDriver​ 0.891 0.931​ 0.643​ 0.941​ 0.608 3.123 0.561 Neighborhood-only model 0.85 0.629​ 0.907​ 0.9744​ 0.285​ 2.7954​ 0.370​ Condel​ 0.8584​ 0.9258​ 0.45​ 0.9108​ 0.5​ 2.7866​ 0.392​ FATHMM (missense)​ 0.8251​ 0.8775​ 0.5071​ 0.9152​ 0.4057​ 2.7055​ 0.351​ PROVEAN​ 0.7371​ 0.7444​ 0.6929​ 0.9363​ 0.3089​ 2.6825​ 0.327​ SIFT​ 0.8099​ 0.861​ 0.5​ 0.9126​ 0.3723​ 2.6459​ 0.32​ Polyphen-2​ 0.7978​ 0.8422​ 0.5286​ 0.9155​ 0.3558​ 2.6421​ 0.317​ Mutation Assessor​ 0.747​ 0.7665​ 0.6286​ 0.9259​ 0.3077​ 2.6287​ 0.3​ VEST​ 0.7503​ 0.8269​ 0.2857​ 0.8753​ 0.2139​ 2.2018​ 0.1​ CanDrAplus (Cancer-in-general)​ 0.592​ 0.857​ 0​ 0.99​ 0​ 1.847​ -0.03​ Algorithm Accuracy Sensitivity Specificity PPV NPV CS MCC NBDriver ​+ CHASMplus+ FATHMM (cancer) + Mutation Taster + Condel 0.945 0.985 0.689 0.95 0.88 3.504 0.746 CHASMplus+ FATHMM (cancer) + Mutation Taster + Condel 0.921 0.942 0.771 0.96 0.71 3.384 0.691 A smaller ensemble that gave no significant change in Composite score and MCC compared to the previous ensemble (First row of Table 4b) NBDriver + ​Mutation Taster + Condel 0.942 0.99 0.65 0.945 0.919 3.504 0.745 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 29 Figure Legends Figure 1: A diagram representing the features derived from the neighborhood nucleotide sequences of the point mutations for an arbitrary window size of 4 is shown here. The mutated position is represented as a triplet (Chromosome: Position: Substitution Type). (I) ​The original sequence is represented here with the mutated nucleotide (ch17:109889:G>T) in bold. ​(II) One-hot encoding was used to derive the 4-bit binary one-hot encoded vector for each nucleotide. ​(III) Overlapping ​k​-mers of sizes 2,3 and 4 have been represented here . In this case, the neighborhood features also include the wildtype nucleotide at the mutated position. The overlapping ​k​-mers were encoded into a numerical format using the countvectorizer and the TFIDF vectorizer and the resulting word matrix was derived. The samples (or individual neighborhoods) are represented as rows and the ​k​-mers are represented as columns. For both types of feature representation, the chromosome number and the substitution type (A>T, G>C etc) were included as additional features. Figure 2: ​The workflow depicting one run of the kernel density estimation experiment is shown in this figure. All 5265 mutations from the Brown ​et al. study were used to derive the estimates. ​(A) First, an equal number of driver and passenger mutations were sampled with replacement. ​(B) The “bandwidth” hyperparameter was tuned using a 5-fold cross-validation approach, and the resulting tuned hyperparameter was used to estimate the densities. ​(C) The kernel density estimates for the driver and passenger neighborhoods were obtained separately, and the distance between them was calculated using the Jensen-Shannon (JS) distance. The JS distance is used to quantify how “distinguishable” two probability distributions are from each other. It is bounded between 0 and 1, where 0 represents the case where the two probability distributions are equal and vice versa. ​(D) ​The bootstrapping experiment to compute the significance of the density estimates calculated in (C) is shown in this figure. First, it involved random sampling of twice the driver or passenger mutations from (A) irrespective of the labels, followed by randomly splitting the data into driver and passenger labels. ​(E) Hyperparameter tuning and density estimation was performed similarly to (B). ​(F) ​The bootstrapped JS distance between the driver and passenger neighborhoods was derived. All six steps (A-F) of the density estimation experiments were repeated 30 times for all possible window sizes between 1 and 10 and seven different feature representations. The significance of the difference between the medians of the original and the bootstrapped JS distances was then reported. Figure 3: ​The workflow depicting one run of the 10-fold cross-validation experiments is shown in this figure. ​(A) In the first step, the entire dataset was split into ten equal parts. Nine of the ten subsets were combined into one training set, and one part was left as the test set. ​(B) ​Seven different feature representations [OHE, Count Vectorizer (​k​=2,3,4) and TF-IDF Vectorizer (k=2,3,4)] were considered for further analysis. After feature selection using a tree-based classifier, hyperparameter tuning was performed for three classifiers, and the corresponding .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 30 models were derived. Finally, validation of each of the classifiers on the test set was performed, and the corresponding performance metrics were reported. Figure 4: ​Variation in JS distances between the estimated densities for every window size between 1 and 10 is shown in this figure. All 5265 mutations from the original study were used here. Two types of boxplots, one for the original and another for the randomized experiments have been shown here along with the p-values, which approximates the probability that the original median distance can be obtained by chance. Except window 1, all other window sizes had a significant (** ) difference between the original and the randomized JS distances..05 p < 0 Figure 5: ​The variation in the classification performances with different window sizes obtained during the repeated cross-validation experiments using the initial training set of 5265 mutations is shown in this figure. For each window size, feature representations among CV (CountVectorizer), TF (TF-IDF Vectorizer) and OHE (One-hot encoding) that gave the best performances in terms of ​(A)​ Sensitivity ​(B) ​Specificity ​(C)​ AUC and ​(D)​ MCC is displayed. Figure 6: Plot showing the variation in AUROC with the different classification thresholds obtained while deriving NBDriver is shown here. NBDriver was trained on a reduced training set of 4549 mutations after removing all overlapping mutations from the original study and Martelotto ​et al​. For an imbalanced classification problem, using the default threshold of 0.5 is often not advisable. In our case, the best AUROC was obtained using a threshold of 0.119. Consequently, all mutations with prediction scores greater than this threshold were classified as drivers and vice versa. Figure 7: ​Differences in the distribution of features between driver and passenger mutations observed from the training data used to derive NBDriver. ​(A) PREDRSAE (Predicted Residue Solvent Accessibility - Exposed) gives the probability of the wild type residue being exposed. From the plot it is clear that probability of driver mutations occurring in residues that are exposed is significantly less (Wilcoxon test; ​P​=5.4E-10) than that of passengers. ​(B) PredBFactorS ​(High Predicted Bfactor) gives the probability that the wild type residue backbone is stiff. From the plot it is clear that the probability of driver mutations occurring in residues with stiff backbones is significantly higher (Wilcoxon test; ​P=​2.1E-09) ​than that of passengers. ​(C) GERP conservation scores ​give the evolutionary conservativeness scores for specific sites where mutations have occurred. From the plot it is clear that driver mutations occur in sites with GERP scores that are significantly higher (Wilcoxon test; ​P<​2.2E-16) than passenger mutations. ​(D) HMMPHC ​(Positional Hidden Markov Model (HMM) conservation score) is a measure which is calculated on the basis of the degree of conservation of the residue, the mutation and the most probable amino acid. From the plot it is clear that driver mutations tend to occur in residues with HMMPHC scores significantly higher (Wilcoxon test; P=​3.3E-16) than passenger mutations. ​(E) UniprotDOM_PostModEnz ​is a feature based on protein domain knowledge which tells us whether a site in an enzymatic domain is responsible for any kind of post translational modification (or PTM). ʻPresenceʼ indicates that the mutation occurs in a site responsible for PTM and vice versa. From the plot it is clear that more driver .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 31 mutations occur in PTM-associated sites as compared to passengers. ​(F) UniprotREGIONS is a binary variable which tells us whether a mutation occurs in a region of interest in the protein sequence. ʻPresenceʼ indicates that the mutation occurs in a region of interest and vice versa. From the plot it is clear that more driver mutations cluster in regions of interest in the protein sequence as compared to passengers thereby making them mechanistically influential for the progression of the disease. Figure 8: ​Plot showing the class-wise variation in the mean TF-IDF scores for the 26 neighborhood-sequence features used to train NBDriver. The x-axis represents the 4-mers used in the analysis, and the y-axis represents the mean TF-IDF scores. From the plot, it is evident that the mean TF-IDF scores are consistently higher for drivers as compared to passengers. Since a higher TF-IDF score indicates the relevance or importance of a particular ​k​-mer, we can conclude that the 4-mers used to derive NBDriver are more specific to the driver neighborhoods than passengers. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430460doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430460 http://creativecommons.org/licenses/by-nc/4.0/ 10_1101-2021_02_09_430550 ---- scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling Dongyuan Song 1,†, Kexin Aileen Li 2,†, Zachary Hemminger 3, 4, Roy Wollman 3, 4, 5, and Jingyi Jessica Li 2,6,7,∗ Abstract Single-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of indi- vidual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity, and extra (e.g., spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data. Here we develop the single-cell Projective Non-negative Matrix Factorization (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we show that scPNMF can guide the design of targeted gene profiling experiments and cell-type annotation on targeted gene profiling data. 1Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA 90095-7246, 2Department of Statistics, University of California, Los Angeles, CA 90095-1554, 3Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, CA 90095, 4Department of Integrative Biology and Physiology, University of California, Los Angeles, CA 90095-7239, 5Department of Chemistry and Biochemistry, University of California, Los Angeles, CA 90095-1569, 6Department of Human Genetics, University of California, Los Angeles, CA 90095-7088, 7Department of Computational Medicine, University of California, Los Angeles, CA 90095-1766, USA. † These authors contributed equally to this work. ∗ To whom correspondence should be addressed. Contact: jli@stat.ucla.edu 1 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ 1 Introduction The recent development of single-cell RNA sequencing (scRNA-seq) technologies provides un- precedented opportunities to decipher transcriptome heterogeneity among individual cells [1–3]. A typical scRNA-seq dataset contains thousands to tens of thousands of genes; however, a subset of genes, which we call informative genes, are usually sufficient for representing the underlying biological variations of cells in the dataset for two reasons. First, variations of many genes are not related to the biological variations of interest. For instance, fluctuations in the expression levels of housekeeping genes are irrelevant to cell types [4, 5]. Second, many genes have strongly correlated expression levels, suggesting that one gene may represent a group of genes without much loss of information [6]. Therefore, for scRNA-seq data analysis, informative gene selection has three advantages: (1) enhancing biological signals by removing unwanted technical variations, (2) improving the interpretability of analysis results by focusing on informative genes, and (3) reducing the number of genes to save computational resources. Besides scRNA-seq data analysis, informative gene selection is also crucial for designing single-cell targeted gene profiling experiments, which we define to include all technologies that measure only a specific sets of genes’ expression levels in individual cells. Unlike scRNA-seq, targeted gene profiling requires a limited number (often no more than hundreds) of genes to be specified before sequencing. Examples of targeted gene profiling include spatial technologies (e.g., smFISH [7] and MERFISH [8]) and non-spatial technologies (e.g., BART-Seq [9], HyPR- seq [10] and 10x-Genomics Targeted Gene Expression). Compared with scRNA-seq, targeted gene profiling technologies have advantages such as capturing spatial information (by smFISH and MERFISH), having a lower cost per cell (by BART-Seq), and exhibiting a higher sensitivity for detecting lowly expressed genes (by HyPR-seq). However, it remains an open and challenging question to optimize the gene selection for targeted gene profiling under a gene number limitation. Given the importance of informative gene selection, researchers have developed many gene selection methods for scRNA-seq data. Most existing methods select genes based on the rela- tionship between per-gene expression means and per-gene expression variances (with the mean and variance of each gene calculated across cells). Popular example methods include variance stabilization transformation (vst) [11] and mean-variance plot (mvp) in the R package Seurat [12], as well as modelGeneVar in the R package scran [13]. These methods select highly variable genes that have large expression variances in relation to their expression means. Other methods use various metrics of gene importance instead of the per-gene expression variance. For example, M3Drop selects the genes that have zero expression levels in many cells [14]; GiniClust selects the genes with large Gini indices of expression levels [15]; SCMarker selects the genes that have expression levels bi/multi-modally distributed and are co-expressed or mutually-exclusively expressed with some other genes [16]. A common limitation of these existing methods is that they are all designed to select a relatively large number of genes; thus, their performance in selecting a small number of genes remains unclear. For instance, in Seurat, the default gene number is 2 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2000; SCMarker selects 700-900 genes in its exemplar applications [16]. All these gene numbers are much greater than 200, the maximum gene number allowed by multiple targeted gene profiling technologies. Therefore, existing gene selection methods may not be suitable for selecting genes for targeted gene profiling. Another drawback of these methods is that their selected genes lack functional interpretability; that is, their selected genes are not categorized as functional gene groups. In addition to these gene selection methods, linear dimensionality reduction methods, such as principal component analysis (PCA) and non-negative matrix factorization (NMF), can also used for gene selection. Specifically, genes can be selected based on their contributions to the projected low dimensions found by PCA or NMF [17–19]. Although many variants of PCA and NMF algorithms have been developed for scRNA-seq data analysis, they are not designed for gene selection [20–26]. Here we propose an unsupervised method scPNMF to simultaneously select informative genes and project scRNA-seq data onto an interpretable low-dimensional space. Leveraging the Projec- tive Non-negative Matrix Factorization (PNMF) algorithm [27], scPNMF combines the advantages of PCA and NMF by outputting a non-negative sparse weight matrix that can project cells in a high-dimensional scRNA-seq dataset onto a low-dimensional space. Unlike the weight matrix (a.k.a., loading matrix) found by PCA, the non-negative sparse weight matrix output by scPNMF correspond to bases that each correspond to a group of co-expressed genes. Compared with the original PNMF, a unique feature of scPNMF is basis selection: scPNMF uses correlation screening and multimodality testing to remove the bases that cannot reveal potential cell clusters in the input scRNA-seq dataset. There are two functionalities of scPNMF: (1) given a pre-specified gene number and a scRNA-seq dataset, scPNMF selects informative genes based on its weight matrix; (2) given a targeted gene profiling dataset containing the informative genes, scPNMF projects this dataset onto the same low-dimensional space of a reference scRNA-seq dataset containing cell type labels, thus enabling cell type annotation on the targeted gene profiling dataset. Comprehen- sive benchmark shows that scPNMF outperforms existing gene selection methods in two aspects. First, the informative genes selected by scPNMF lead to the most accurate cell clustering. Second, the informative genes and weight matrix of scPNMF lead to the best cell type prediction accuracy for targeted gene profiling data. Therefore, scPNMF is a powerful gene selection method that can guide the experimental design and data analysis of single-cell targeted gene profiling. 2 Methods The core of scPNMF is to learn a low-dimensional embedding of cells so that the bases of the low-dimensional space correspond to sparse and mutually exclusive gene groups, and that genes in each group are co-expressed and thus functionally related. Fig.1 illustrates the work- flow of scPNMF. The input of scPNMF is a log-transformed gene-by-cell count matrix measured by scRNA-seq. There are two main steps in scPNMF: (I) it learns a low-dimensional sparse 3 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ min ! #$ ||𝐗 − 𝐖𝐖𝐓 𝐗|| 𝐾 Basis 𝑝 G en es 𝐖 Weight Matrix 𝐖 = [𝒘!,𝒘",…,𝒘#] 𝑛 Cells 𝑝 G en es 𝐗 2. Pearson Correlation (w/ Cell Library Size ) 𝑅 = 0.9 C el l l ib s iz e 𝒔! 𝒔# … 𝑝-value = 0.9 𝑝-value = 0.0 3. Multimodality Test … 1. Functional Annotations (Optional) 𝒘! 𝒘# Housekeeping Genes … Cell-type Genes Unselected Basis Selected Basis 𝒔! 𝒔# D en si ty 𝐖 𝑅 = 0.1 𝐖$ 𝑝 G en es Max Gene Weights 𝑤! 𝑤" 𝑤# … … 𝐖$ 𝑤(!) 𝑤(") 𝑤(#) … … Gene(!) Gene(") Gene(#) … … Order Genes by Weights 𝑤(!) ≥ 𝑤(") ≥… ≥ 𝑤(#) Gene(!) Gene(") Gene(&) … … Truncate by 𝑤(() and Keep First 𝑀 Genes 𝑀-Truncation Max Gene Weights 𝑀: User-defined Gene Number Score Matrix 𝐒 = 𝐖𝐓𝐗 𝐾 B as is = × 𝐗𝐖𝐓𝐒 𝑛 Cells = 𝒘!𝐓𝐗 𝒘"𝐓𝐗 ⋮ 𝒘#𝐓𝐗 = 𝒔! 𝒔" ⋮ 𝒔# 𝐗(') Gene(!) Gene(") Gene(&) … 𝑛 Cells Informative Gene Selection Clustering Visualization …… Informative Genes: {Gene * ,Gene + ,…,Gene(()} 𝐖/,(2) New Data Projection New Data Projection onto Reference Data Space Reference Data Space = × 𝑛 Cells 𝐗(') )*+= ×𝐒(') )*+ Gene(!) Gene(") Gene(&) … Trained Model 𝑓(𝒔) New Cells Prediction 𝑓 Cell Type Prediction Gene(!) Gene(") Gene(&) … 𝑛 Cells 𝒔 𝐗(')𝐖$,(') 𝐓 𝐖$,(') 𝐓 Step I: PNMF Step II: Basis Selection 𝐾' Basis Applications 𝐒(') -*. Figure 1: An overview of scPNMF. Taking a log-transformed gene-by-cell count matrix as the input, scPNMF first learns a low-dimensional sparse weight matrix W and a low-dimensional cell embedding matrix S. Second, it remove the bases irrelevant to cell type variations by examining bases’ functional annotations (optional), Pearson correlations with cell library sizes, and multimodality. Given a user-defined gene number M, scPNMF performs M-truncation to facilitate two main applications: (1) selecting the desired number of informative genes; (2) projecting new targeted gene profiling data onto the low-dimensional space defined by reference scRNA-seq data. The details are in the ”Methods” section. 4 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ weight matrix by PNMF; (II) it selects bases in the weight matrix based on functional annotations (optional), correlation screening, and multimodality testing to remove uninformative bases that cannot distinguish cell types. The output of scPNMF includes (1) the selected weight matrix, a sparse and mutually exclusive encoding of genes as new, low dimensions, and (2) the score matrix containing embeddings of input cells in the low dimensions. The selected weight matrix has two main applications: extracting informative gene for downstream analyses, such as cell clustering and new marker gene identification, and projecting new targeted gene profiling data for data integration and cell type annotation. 2.1 scPNMF step I: PNMF In this section, we review the PNMF algorithm [27, 28] as the foundation of scPNMF. We first compare the formulation of PNMF with that of principal component analysis (PCA) and non- negative matrix factorization (NMF), and we show that PNMF has the advantages of both PCA and NMF so that it can be a useful tool for scRNA-seq data analysis. Next, we introduce our PNMF implementation. Given a log-transformed count matrix X ∈ Rp×n≥0 , whose p rows correspond to genes and whose n columns represent cells, and a positive integer K ≤ p, PNMF aims to find a K-dimensional space, whose dimensions correspond to non-negative, sparse and mutually exclusive linear com- binations of the p genes, so that projecting the n cells onto the K-dimensional space does not cause much information loss (i.e., projecting the K-dimensional embeddings of the n cells back to the original p-dimensional space can largely restore the original n cells). PNMF tackles this task by solving the optimization problem: min W∈Rp×K≥0 ‖X−WWTX‖ , (2.1) where ‖ · ‖ denotes the Frobenius matrix norm. The solution W is referred to as a weight matrix. Each column of W is a basis, whose p entries are the weights of the p genes. PNMF requires all weights to be non-negative, leading to a sparse W with most weights as zeros. PCA is similar to PNMF but does not require all weights to be non-negative. We can write the optimization problem of PCA as min W∈Rp×K,WTW=I ‖X−WWTX‖ , (2.2) whose solution W is also a weight matrix but not sparse, and W is often referred to as the loading matrix. A common property of PNMF and PCA is that the transpose of their weight matrix, WT ∈ RK×p, can be used to project a new cell with p gene measurements, x ∈ Rp, onto the K-dimensional space as WTx. In contrast to PMNF and PCA, NMF finds two non-negative matrices W and H so that their 5 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ product approximates the original matrix X. NMF solves the optimization problem: min W∈Rp×K≥0 ,H∈R K×n ≥0 ‖X−WH‖ , (2.3) whose solution W still has K columns representing bases, and H has n columns as K-dimensional embeddings of the n cells. Due to the non-negative constraint on W and H, W is a sparse matrix [29]. However, the transpose WT cannot be used as a projection matrix from the original p-dimensional space to a K-dimensional space. The reason is that, if WT is a projection matrix, then by the definition of H we have WTX = H, which would converts the objective function (2.3) of NMF to the objective function (2.1) of PNMF. In other words, PNMF is a constrained version of NMF by requiring WT to be a projection matrix. Hence, PNMF inherits the property of NMF by having non-negative, sparse bases that are mostly mutually exclusive (i.e., different bases correspond to different gene groups). Moreover, based on the similarities of the objective functions of PNMF (2.1) and PCA (2.2), we can see that PNMF also resembles PCA by finding a weight matrix whose transpose can serve as a projection matrix and whose bases are largely orthogonal to each other. Table 1 summarizes the properties of PNMF, PCA, and NMF. Table 1: Comparison of the properties of PNMF, PCA and NMF Optimization Problem Non- Sparsity Mutually New Data negativity Exclusiveness Projection PNMF min W ‖X−WWTX‖ s.t. W ≥ 0 Yes Very high Very high Yes PCA min W ‖X−WWTX‖ s.t. WTW = I No Low Low Yes NMF min W,H ‖X−WH‖ s.t. W, H ≥ 0 Yes High High No In the context of scRNA-seq data analysis, the above advantages of PNMF lead to an inter- pretable and useful weight matrix W. First, the high sparsity of W makes each basis (column) depend on only a small set of genes, which has been defined as a meta-gene for NMF [30]. Second, the mutual exclusiveness of W makes different bases correspond to different gene sets, easing the interpretation of bases as meta-genes or functional units. Third, the projection matrix WT allows the alignment of new data to reference data, thus facilitating cell type annotation on the new data. Algorithm 1 summarizes the key steps of PNMF implementation in scPNMF. Our implemen- tation mainly follows the two papers that proposed the PNMF algorithm [27, 28], and we change the initialization of W to the weight matrix found by PCA, WPCA, with the absolute value taken on every entry. Our initialization is motivated by the desired orthogonality of bases (i.e., columns of W). With the weight matrix W ∈ Rp×K≥0 learned by PNMF, we obtain the score matrix S = W TX ∈ RK×n≥0 , whose K rows correspond to the bases and whose n columns represent the cells. Specif- ically, the j-th column of S is the K-dimensional embedding of the j-th cell; the k-th row of S, 6 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ Algorithm 1 Pseudocode of PNMF implementation in scPNMF Initialize: W = abs(WPCA) ∈ R p×K ≥0 1: while not converge do 2: for i = 1, · · · , p; k = 1, · · · , K do 3: Wik ← Wik 2 ( XXTW ) ik (WWTXXTW)ik + (XX TWWTW)ik 4: end for 5: W ← 1 ‖W‖2 W 6: end while Output: W ∈ Rp×K≥0 , S = W TX ∈ RK×n≥0 denoted by sTk , contains the scores (i.e., coordinates) of all n cells in the k-th basis: sk = w T kX , (2.4) where wk is the k-th column of W, k = 1, . . . , K. The low rank K needs to be pre-specified in PNMF, same as in PCA and NMF, A larger K preserves more information in X but also removes less noise (technical variation of cells that is not of biological interest), impedes the interpretation of W (more bases are more difficult to interpret), and increases the computational burden. To choose K in a data-driven way, we propose an orthogonality measure, which shows that K = 20 is a reasonable choice for multiple scRNA-seq datasets (Section S1.1). 2.2 scPNMF step II: basis selection The second key step of scPNMF is to select informative bases among the K bases found by PNMF (i.e., columns of W and rows of S) to remove unwanted variations of cells (e.g., variations irrelevant to cell types). The columns of W enjoy high sparsity and mutual exclusiveness; that is, each column contains positive weights corresponding to a unique small set of genes, so it is expected to reflect a certain biological function. However, some biological functions may not be relevant to the cell heterogeneity of interest, e.g., cell type composition. Motivated by this, we propose three strategies for selecting informative bases (columns of W and rows of S): functional annotations (optional), correlations with cell library sizes, and tests of multimodality. 2.2.1 Strategy 1: examine bases by functional annotations (optional) The first, optional strategy is to annotate the biological function(s) of each basis in the weight matrix. For example, scPNMF may apply gene ontology (GO) analysis to the top 10% genes with the highest weights in each basis (column of W) and record the enriched GO terms as the basis’ functional annotation. Then, users with prior knowledge can interpret the functional 7 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ annotation on each basis and decide whether or not to remove the basis. For example, if the goal is to delineate cell types in scRNA-seq data, a basis corresponding to cell-cycle genes should be removed because they would obscure the distinction of cell types. However, it is worth noting that filtering bases by biological annotations is optional in scPNMF. Conservative users can keep all K bases output by PNMF and directly use data-driven basis selection (Section 2.2.2). For our results in this paper, scPNMF removes the bases corresponding to well-known housekeeping genes (Section S2). 2.2.2 Data-driven strategies 2.2.2.1 Strategy 2: examine bases by correlations with cell library sizes Note that the input of scPNMF is a log-transformed unnormalized count matrix for users’ conve- nience. Hence, scPNMF does not adjust for cell library sizes in the computation of W and S in step I. Given that the variance of cell library sizes contributes to unwanted variations of cells [11], it is necessary to remove the bases whose corresponding rows in S are strongly correlated with cell library sizes. We use the total log-transformed counts to approximate the library size of each cell, and we calculate the Pearson correlation between each sk and the library sizes of n cells. The strategy is to retain the bases whose Pearson correlations are under a pre-defined threshold, which we set to 0.7 based on empirical observations (Section S1.2). 2.2.2.2 Strategy 3: examine bases by multimodality tests Another data-driven strategy is to retain the bases whose corresponding scores are multi-modally distributed. If a basis’ score vector (row in S) contains n scores with a multimodality pattern, then it is likely to distinguish cell types and should be retained. To implement this strategy, we use the ACR test [31] to check the multimodality of each basis’ score vector. The null hypothesis is that the score vector contains n scores sampled from a unimodal distribution, and the alternative hypothesis is that the distribution has more than one mode. After performing multiple multimodality tests, one per basis, we use the Benjamini-Hochberg procedure to set a p-value threshold by controlling the false discovery rate under 1%. The bases whose p-values are under this threshold will be retained. In summary, scPNMF step II allows users to use strategy 1 to filter out uninformative bases based on functional annotations if available; then it implements data-driven strategies 2 and 3 to further remove bases that have strong correlations with cell library sizes and exhibit unimodality patterns. The retained bases will have their corresponding columns in W selected and stacked into the selected weight matrix WS ∈ R p×K0 ≥0 , where K0 is the number of selected bases. 8 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2.3 Applications of scPNMF output: informative gene selection and new data projection The selected weight matrix WS output by scPNMF has two main applications: selection of a desired number of informative genes and projection of new targeted gene profiling data onto the low-dimensional space defined by WS. Given a gene number M (e.g., 200), scPNMF uses M- truncation, a step to select M rows in WS, resulting in M informative genes and a truncated, selected weight matrix WS,(M) ∈ R M×K0 ≥0 for new data projection. 2.3.1 M-truncation and informative gene selection We denote the desired number of informative genes by M ∈ N, with M ≤ # of non-zero rows in WS. M-truncation has three steps. 1. For each gene i, calculate its largest weight wi across bases in WS: wi = max k=1,...,K0 (WS)ik, i = 1, 2, . . . , p . (2.5) 2. Order genes by their maximum weights w(1) ≥ w(2) ≥ ··· ≥ w(p) and set the truncation threshold as w(M). Identify the first M genes as informative genes. 3. Construct the truncated, selected weight matrix WS,(M): (1) Truncate the selected weight matrix WS by setting all (WS)ik < w(M) to be 0; (2) Keep the M rows with non-zero entries; stack them by row into WS,(M) based on the order of the informative genes. In short, scPNMF selects informative genes based on their maximum weights in the selected bases. The rationale is that a gene’s maximum weight reflects the gene’s contribution to the establishment of the K0-dimensional space, which preserves the n cells’ biological variations of interest. Hence, genes with larger maximum weights are more informative in the sense of encoding cells’ biological variations. An important application of informative gene selection is to guide the design of targeted gene profiling experiments. 2.3.2 New data projection Given the selected M informative genes, once new cells are measured by targeted gene profiling on these genes, WS,(M) can be used to project the new cells onto the K0-dimensional space where the cells in the input scRNA-seq data are embedded in. If the input data has cell type annotations, we refer to the input data as reference data, then we can predict the new cells’ types from the types of the cells in the reference data. In detail, new data projection has the following steps: 9 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ 1. Apply scPNMF with M-truncation to input, reference data X ∈ Rp×n≥0 with n cells to obtain the truncated, selected weight matrix WS,(M). Construct X(M) ∈ R M×n ≥0 as a submatrix of X, with rows corresponding to the rows of WS,(M), i.e., the M informative genes. Hence, the K0-dimensional embeddings of the n cells in the reference data are the columns of SRef(M) = W T S,(M) ×X(M) ∈ R K0×n . (2.6) 2. Denote the targeted gene profiling data of n′ new cells with M informative genes measured by XNew (M) ∈ RM×n ′ ≥0 . Note that X New (M) contains log-transformed counts and has rows (genes) corresponding to the rows of X(M). Project the n ′ cells to the K0-dimensional space by SNew(M) = W T S,(M) ×X New (M) ∈ R K0×n′ (2.7) 3. (Optional) Normalize SNew (M) and SRef (M) to remove batch effects, if existent, by using a single-cell integration method such as Harmony [32]. Now the n reference cells and the n′ new cells are in the same K0-dimensional space with biological variations preserved. Then a classifier can be trained on the n reference cells’ types and SRef (M) for cell type prediction, and it can be used to predict the n′ cells’ types from SNew (M) . 3 Results 3.1 scPNMF outputs a sparse and functionally interpretable repre- sentation of scRNA-seq data We first demonstrate that scPNMF step I, PNMF, outputs a sparse and functionally interpretable gene encoding of cells. We use the FregGold dataset [33], which consists of three cell types (three human lung adenocarcinoma cell lines), and set the basis number K = 5 for demonstration purpose. Both PCA and PNMF learn a weight matrix that can project the original scRNA-seq data onto a 5-dimensional space. Unlike the weight matrix of PCA that has no zero entries, the weight matrix of PNMF is non-negative, highly sparse, containing 42.6% of entries as zeros, and has bases that are largely mutually exclusive (i.e., non-zero entries in different columns correspond to different rows/genes) (Fig. 2a). GO enrichment analysis shows that high weight genes in each PNMF basis are enriched with conceptually-similar GO terms, and high weight genes in different PNMF bases are enriched with conceptually-different GO terms (Fig. 2b). This result indicates that PNMF bases correspond to gene groups with distinct functions. On the contrary, the PCA bases do not have good functional interpretations: the high weight genes in each PCA basis are not enriched with conceptually-similar GO terms, and different PCA bases share many high weight genes (Fig. S3). 10 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 2: Illustration of the sparse and interpretable projection found by scPNMF. We use the FregGold dataset as an example. (a) Comparison of the weight matrices of PCA and PNMF. Heatmaps visualize the learned weight matrices of PCA (top) and PNMF (bottom), where rows are genes and columns are bases. Red represents positive weights while blue represents negative weights. The rows are ordered by gene-wise hierarchical clustering. Compared to PCA, the weight matrix of PNMF is strictly non-negative, much more sparse and mutually exclusive between bases. (b) GO analysis result of each basis in the weight matrix of PNMF. Texts in black boxes summarize the functions of genes in each basis. The enriched GO terms are almost mutually exclusive, implying that each basis represents a unique gene functional cluster. (c) Statistical tests on each basis in the score matrix of PNMF. Top row: scatter plots of scores and total log-counts (cell library sizes). Each dot represents a cell. Cell scores in bases 1 and 4 are highly correlated with cell library sizes. Bottom row: histograms of cell scores in each basis. Scores in bases 2 and 3 show strong multimodality patterns (adjusted p-value ≤ 0.05). (d) UMAP visualizations of cells based on high weight genes in the unselected bases 1 and 4 and those in the selected bases 2, 3, and 5. Genes in the unselected bases completely fail to distinguish the three cell types, while genes in the selected bases lead to a clear separation of the three cell types. 11 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ To further analyze the PNMF bases, we list the top 10 high weight genes in each basis (Table S1), from which we identify many well-known genes with important functions. For instance, basis 1 contains classic housekeeping genes, such as GAPDH [34] and ribosomal protein genes (RPS-) [35]; basis 3 contains well-known tumor-related genes, including EGFR [36] and CDK4 [37]. In particular, the cells of the HCC827 cell line (one of the three cell types) have overall high scores in basis 3 (Fig. S4), a reasonable result because the HCC827 cell line contains an EGFR activating mutation [38]. In summary, scPNMF step I outputs bases representing sparse and functionally interpretable gene sets. 3.2 Basis selection is an essential step in scPNMF Here we explain why basis selection is an essential step in scPNMF. In the last section, we show that each PNMF basis of the FregGold dataset approximately represents one functional gene group. It is well known that housekeeping genes (basis 1) and cell-cycle genes (basis 4) are usually irrelevant to cell type distinctions. However, such biological knowledge is not always available or certain. Therefore, scPNMF mainly relies on the two data-driven strategies: correlations with cell library sizes and multimodality tests (Section 2.2.2) for selecting informative bases. Fig. 2c visualizes the two strategies: cell scores in bases 1 and 4 are highly correlated with cell library sizes (Pearson correlations > 0.9); cell scores in bases 2 and 3 show strong evidence as multi-modally distributed (adjusted p-value < 0.05). Hence, strategy 1 will not retain bases 1 and 4, and strategy 2 will not retain bases 1, 4, and 5; together, bases 1 and 4 will be removed, and bases 2, 3, and 5 will be selected. To verify the effectiveness of basis selection, we use UMAP to visualize cells based on the top 50 high weight genes in the unselected bases 1 and 4 vs. those in the selected bases 2, 3, and 5 (Fig. 2d). We observe that the top genes in the unselected bases completely fail to separate the three cell types, while the top genes in the selected bases perfectly distinguish the three cell types. This result strongly supports that basis selection is a necessary step of scPNMF. 3.3 scPNMF outperforms state-of-the-art gene-selection methods on diverse scRNA-seq datasets In this section, we demonstrate scPNMF’s capacity for informative gene selection. We compre- hensively benchmark scPNMF against 11 other single cell informative selection methods (Table S2) on seven scRNA-seq datasets (Table S3) using three clustering methods (Louvain clustering, K-means clustering, and hierarchical clustering). For fair benchmarking, the seven scRNA-seq datasets cover both unique molecule identifier (UMI) and non-UMI protocols and include various biological samples. Using the adjusted Rank index (ARI) as the metric of clustering accuracy, we calculate the ARI values of the three clustering methods on each dataset using 100 informative 12 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 3: Benchmarking scPNMF against 11 informative gene selection methods on seven scRNA-seq datasets. (a) Clustering accuracies (ARI values) of three clustering methods based on the informative genes selected. Gene selection methods are ordered from left to right by their average ARI across the three clustering methods and the seven datasets. (b) UMAP visualization of cells in the Zheng4 dataset based on 100 informative genes selected by each method. Genes selected by scPNMF lead to a clear separation between naive cytotoxic T cells and regulatory T cells, while the genes selected by others methods do not. genes selected by each gene selection method, as 100 genes are commonly used in targeted gene profiling. Fig. 3a shows that scPNMF has overall the highest ARI values across datasets and clustering methods. In particular, scPNMF has the highest average ARI value with each clustering method (Louvain: 0.83; K-means: 0.74; hierarchical clustering: 0.69) and the highest overall average ARI (0.75) across datasets and clustering methods. Note that the mean of the overall average ARI values of all methods except scPNMF is only 0.66. We further show the UMAP visualization of cells in the Zheng4 dataset based on the informa- tive genes selected by each of the 12 gene selection methods (Fig. 3b). Only scPNMF leads to a clear separation of naive cytotoxic T cells and regulatory T cells, while the informative genes selected by other methods except corFS and irlbaPcaFS cannot distinguish the two cell types at all. We also compare the 12 methods under a varying number of informative genes: 20, 50, 200, and 500, the commonly used gene numbers in targeted gene profiling. We observe that the overall average ARI values of scPNMF are consistently higher than those of other methods, across all informative gene numbers (Fig. S6). Moreover, compared with other methods, scPNMF leads to more stable overall average ARI values under varying numbers of informative genes, indicating its stronger robustness to the gene number constraint of targeted gene profiling. These results strongly support the superior performance of scPNMF as an informative gene selection method. 13 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3.4 scPNMF guides targeted gene profiling experimental design and cell-type prediction In this section, we demonstrate how scPNMF can guide the selection of genes to be measured in a targeted gene profiling experiment, and how scPNMF enables subsequent cell type annotation on the targeted gene profiling data. We design two case studies with paired scRNA-seq reference data and “pseudo” targeted gene profiling data, whose per-cell sequencing depth is higher than that of the corresponding scRNA-seq data. In the first case study, we use the Zheng8 dataset (measured by the 10x protocol) as the refer- ence dataset. To generate the pseudo targeted gene profiling data, we use a new single-cell gene expression simulator that captures gene correlations, scDesign2 [39], to generate data with a 100- time higher per-cell sequencing depth. In the second case study, we use the PBMC10x dataset (measured by 10x protocol) as the reference dataset, and we use PBMCSmartseq (measured by Smart-Seq2) as the pseudo targeted gene profiling data because Smart-Seq2 has a higher per- gene sequencing depth than 10x does. In both case studies, for each gene selection method, the corresponding pseudo targeted gene profiling datasets only contain the M informative genes selected by the method. We benchmark scPNMF against the 11 gene selection methods in terms of cell type prediction on the pseudo targeted gene profiling data. To avoid the bias for a specific classification algorithm, we apply three popular algorithms for cell type prediction: random forest (RF) [40], k-nearest neighbors (KNN) [41], and support vector machine (SVM) [41]. In each case study, we first train each classification algorithm on the low-dimensional embeddings of the reference cells SRef (M) given the M = 100 informative genes selected by each gene selection method. Then we apply the trained classifier to the low-dimensional embeddings of the cells in the pseudo targeted gene profiling data SNew (M) . Table 2 shows that scPNMF leads to the highest average prediction accuracy (0.81) across six combinations (two case studies × three classification algorithms). Moreover, scPNMF achieves the highest accuracy in each combination except Zheng8 + random forest where it is the second best. These results confirm that scPNMF effectively guides the selection of genes to measure in targeted gene profiling experiments, and it enables accurate cell type annotation on newly generated targeted gene profiling datasets. 4 Discussion We propose scPNMF, an unsupervised gene selection and data projection method for scRNA-seq data. The major goal of scPNMF is to select a fixed number of informative genes to distinguish cell types and guide gene selection for targeted gene profiling experiments. Moreover, scPNMF can project a new targeted gene profiling dataset with the selected genes to the low-dimensional space that embeds a reference scRNA-seq dataset. We perform a comprehensive benchmark to evaluate scPNMF in terms of informative gene selection against the state-of-the-art gene selection 14 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ Table 2: Prediction accuracy of cell types based on 100 informative genes selected by 12 gene selection methods in the two case studies with paired reference scRNA-seq data and targeted gene profiling data Method Zheng8 PBMC Average RF KNN SVM RF KNN SVM Accuracy scPNMF 0.85 (0.83,0.87) 0.80 (0.78,0.83) 0.87 (0.85,0.89) 0.84 (0.79,0.88) 0.84 (0.79,0.88) 0.67 (0.61,0.73) 0.81 M3Drop 0.85 (0.83,0.87) 0.80 (0.77,0.83) 0.87 (0.84,0.89) 0.84 (0.79,0.88) 0.77 (0.71,0.82) 0.63 (0.57,0.69) 0.79 SeuratDISP 0.84 (0.81,0.86) 0.78 (0.75,0.81) 0.86 (0.84,0.88) 0.80 (0.75,0.84) 0.75 (0.70,0.80) 0.64 (0.58,0.70) 0.78 corFS 0.80 (0.77,0.82) 0.75 (0.73,0.78) 0.82 (0.80,0.85) 0.82 (0.77,0.86) 0.81 (0.76,0.86) 0.62 (0.56,0.68) 0.77 GiniClust 0.86 (0.83,0.88) 0.79 (0.76,0.81) 0.86 (0.83,0.88) 0.80 (0.75,0.84) 0.76 (0.71,0.81) 0.53 (0.47,0.60) 0.75 scran 0.79 (0.76,0.81) 0.72 (0.69,0.75) 0.82 (0.80,0.85) 0.78 (0.72,0.82) 0.73 (0.67,0.78) 0.67 (0.61,0.72) 0.75 SeuratMVP 0.83 (0.81,0.85) 0.77 (0.74,0.80) 0.85 (0.82,0.87) 0.82 (0.77,0.86) 0.74 (0.69,0.79) 0.47 (0.40,0.53) 0.74 Scanpy 0.79 (0.77,0.82) 0.71 (0.68,0.74) 0.80 (0.78,0.83) 0.80 (0.75,0.84) 0.76 (0.71,0.81) 0.52 (0.46,0.58) 0.73 SCMarker 0.77 (0.74,0.79) 0.68 (0.65,0.71) 0.74 (0.71,0.77) 0.77 (0.71,0.81) 0.71 (0.65,0.76) 0.45 (0.39,0.52) 0.69 SeuratVST 0.73 (0.70,0.76) 0.68 (0.65,0.71) 0.75 (0.73,0.78) 0.74 (0.68,0.79) 0.68 (0.63,0.74) 0.40 (0.34,0.46) 0.67 DANB 0.71 (0.68,0.73) 0.69 (0.66,0.71) 0.75 (0.73,0.78) 0.73 (0.67,0.78) 0.74 (0.68,0.79) 0.28 (0.23,0.34) 0.65 irlbaPcaFS 0.68 (0.65,0.71) 0.61 (0.58,0.64) 0.71 (0.68,0.74) 0.71 (0.65,0.76) 0.77 (0.71,0.82) 0.16 (0.12,0.21) 0.61 Parentheses are 95% confidence intervals. Highest number within each column is labeled by underline. methods. Our results show that scPNMF consistently outperforms existing methods for a wide range of informative gene numbers (from 20 to 500) on diverse scRNA-seq datasets. We also demonstrate that the informative genes selected by scPNMF can effectively guide gene selection for targeted gene profiling and lead to accurate cell type annotation on targeted gene profiling data based on reference scRNA-seq data. Besides gene selection and data projection, scPNMF also works as a dimensionality reduction method with good interpretability. Each dimension in the low-dimensional space found by scPNMF can be considered as a new functional “feature” (as a linear combination of correlated and thus functionally related genes). Moreover, the mutual exclusiveness makes the PNMF bases used in scPNMF advantageous over the PCA bases in terms of removing confounding effects. For example, cell-cycle genes obscure the identification of cell types and should be removed from low-dimensional embeddings of cells. For PCA, cell-cycle genes affect many PCA bases, so the popular scRNA-seq pipeline Seurat implements a complicated approach that first calculates “cell- cycle scores” and then regresses each basis (principal component) on these scores to remove the effects of cell-cycle genes [12]. In contrast, cell-cycle genes are concentrated in only one PNMF basis, so it is easy to remove that basis to clear the effects of cell-cycle genes. Therefore, scPNMF has great potentials in deciphering cell heterogeneity in single-cell data by working as an interpretable dimensionality reduction method. The current implementation of scPNMF focuses on single-cell gene expression data. Consid- 15 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ ering the rapid development of single-cell multi-omics technologies, we plan to extend scPNMF to accommodate other technologies that measure other genomics features such chromatin ac- cessibility landscapes measured by single-cell ATAC-seq [42], or even to integrate data across multi-omics datasets. Another note is that the multimodality test for basis selection in scPNMF only accounts for discrete cell types but not continuous cell trajectories. Therefore, other tests or strategies are needed to select informative bases to capture biological variations along continuous cell trajectories. An important question for gene selection is: how many genes should be selected as informative genes to fully capture the biological variations of interest? In our studies, we observe that, after the informative gene number reaches 200, the clustering accuracies based on the selected informative genes plateau for most gene selection methods including scPNMF. Therefore, 200 genes may be sufficient for capturing biological variations in scRNA-seq data. However, it remains challenging to decide the minimum number of informative genes, given that the underlying cell sub-population structure is data-specific and might be complex. We plan to explore this problem in future with the possible use of information theory. Software and code The R package scPNMF is available at https://github.com/JSB-UCLA/scPNMF. Acknowledgements We acknowledge the comments and feedback from the members of the Junction of Statistics and Biology at UCLA (http://jsb.ucla.edu). Funding This work was supported by the following grants: NSF DMS-1613338 and DBI-1846216, NIH/NIGMS R01GM120507, PhRMA Foundation Research Starter Grant in Informatics, Johnson and Johnson WiSTEM2D Award, and Sloan Research Fellowship (to J.J.L.); NIH/NINDS R01NS117148 (to R.W). Competing interests None. 16 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://github.com/JSB-UCLA/scPNMF http://jsb.ucla.edu https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ References [1] S Steven Potter. Single-cell rna sequencing for the study of development, physiology and disease. Nature Reviews Nephrology, 14(8):479–492, 2018. [2] Kenneth D Birnbaum. Power in numbers: single-cell rna-seq strategies to dissect complex tissues. Annual review of genetics, 52:203–221, 2018. [3] Chenxu Zhu, Sebastian Preissl, and Bing Ren. Single-cell multimodal omics: the power of many. Nature methods, 17(1):11–14, 2020. [4] Olivier Thellin, Willy Zorzi, Bernard Lakaye, B De Borman, Bernard Coumans, Georges Hennen, Thierry Grisar, Ahmed Igout, and Ernst Heinen. Housekeeping genes as internal standards: use and limits. Journal of biotechnology, 75(2-3):291–295, 1999. [5] Eli Eisenberg and Erez Y Levanon. Human housekeeping genes, revisited. TRENDS in Genetics, 29(10):569–574, 2013. [6] Aravind Subramanian, Rajiv Narayan, Steven M Corsello, David D Peck, Ted E Natoli, Xiaodong Lu, Joshua Gould, John F Davis, Andrew A Tubelli, Jacob K Asiedu, et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell, 171 (6):1437–1452, 2017. [7] Arjun Raj, Patrick Van Den Bogaard, Scott A Rifkin, Alexander Van Oudenaarden, and Sanjay Tyagi. Imaging individual mrna molecules using multiple singly labeled probes. Nature methods, 5(10):877–879, 2008. [8] Jeffrey R Moffitt, Junjie Hao, Guiping Wang, Kok Hao Chen, Hazen P Babcock, and Xiaowei Zhuang. High-throughput single-cell gene-expression profiling with multiplexed error-robust fluorescence in situ hybridization. Proceedings of the National Academy of Sciences, 113 (39):11046–11051, 2016. [9] Fatma Uzbas, Florian Opperer, Can Sönmezer, Dmitry Shaposhnikov, Steffen Sass, Christian Krendl, Philipp Angerer, Fabian J Theis, Nikola S Mueller, and Micha Drukker. Bart-seq: cost-effective massively parallelized targeted sequencing for genomics, transcriptomics, and single-cell analysis. Genome biology, 20(1):1–16, 2019. [10] Jamie L Marshall, Benjamin R Doughty, Vidya Subramanian, Philine Guckelberger, Qingbo Wang, Linlin M Chen, Samuel G Rodriques, Kaite Zhang, Charles P Fulco, Joseph Nasser, et al. Hypr-seq: Single-cell quantification of chosen rnas via hybridization and sequencing of dna probes. Proceedings of the National Academy of Sciences, 117(52):33404–33413, 2020. 17 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ [11] Christoph Hafemeister and Rahul Satija. Normalization and variance stabilization of single- cell rna-seq data using regularized negative binomial regression. Genome biology, 20(1): 1–15, 2019. [12] Tim Stuart, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck III, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. Comprehensive integration of single-cell data. Cell, 177:1888–1902, 2019. doi: 10.1016/j.cell.2019.05.031. URL https://doi.org/10.1016/j.cell.2019.05.031. [13] Aaron TL Lun, Karsten Bach, and John C Marioni. Pooling across cells to normalize single- cell rna sequencing data with many zero counts. Genome biology, 17(1):75, 2016. [14] Tallulah S Andrews and Martin Hemberg. M3drop: dropout-based feature selection for scrnaseq. Bioinformatics, 35(16):2865–2867, 2019. [15] Lan Jiang, Huidong Chen, Luca Pinello, and Guo-Cheng Yuan. Giniclust: detecting rare cell types from single-cell gene expression data with gini index. Genome biology, 17(1):144, 2016. [16] Fang Wang, Shaoheng Liang, Tapsi Kumar, Nicholas Navin, and Ken Chen. Scmarker: ab initio marker selection for single cell transcriptome profiling. PLoS computational biology, 15 (10):e1007445, 2019. [17] Evan Z Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, Allison R Bialas, Nolan Kamitaki, Emily M Martersteck, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, 2015. [18] Maayan Baron, Adrian Veres, Samuel L Wolock, Aubrey L Faust, Renaud Gaujoux, Amedeo Vetere, Jennifer Hyoje Ryu, Bridget K Wagner, Shai S Shen-Orr, Allon M Klein, et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell systems, 3(4):346–360, 2016. [19] Xun Zhu, Travers Ching, Xinghua Pan, Sherman M Weissman, and Lana Garmire. Detecting heterogeneity in single-cell rna-seq data by non-negative matrix factorization. PeerJ, 5:e2888, 2017. [20] Philippe Boileau, Nima S Hejazi, and Sandrine Dudoit. Exploring high-dimensional biological data with sparse contrastive principal component analysis. Bioinformatics, 36(11):3422– 3430, 2020. [21] Zhana Duren, Xi Chen, Mahdi Zamanighomi, Wanwen Zeng, Ansuman T Satpathy, Howard Y Chang, Yong Wang, and Wing Hung Wong. Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations. Proceedings of the National Academy of Sciences, 115(30):7723–7728, 2018. 18 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1016/j.cell.2019.05.031 https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ [22] Ghislain Durif, Laurent Modolo, Jeff E Mold, Sophie Lambert-Lacroix, and Franck Picard. Probabilistic count matrix factorization for single cell expression data analysis. Bioinformatics, 35(20):4011–4019, 2019. [23] Shuqin Zhang, Liu Yang, Jinwen Yang, Zhixiang Lin, and Michael K Ng. Dimensionality reduction for single cell rna sequencing data using constrained robust non-negative matrix factorization. NAR Genomics and Bioinformatics, 2(3):lqaa064, 2020. [24] Chao Gao and Joshua D Welch. Iterative refinement of cellular identity from single-cell data using online learning. In International Conference on Research in Computational Molecular Biology, pages 248–250. Springer, 2020. [25] Zi Yang and George Michailidis. A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data. Bioinformatics, 32(1):1–8, 2016. [26] Joshua D Welch, Velina Kozareva, Ashley Ferreira, Charles Vanderburg, Carly Martin, and Evan Z Macosko. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell, 177(7):1873–1887, 2019. [27] Zhijian Yuan, Zhirong Yang, and Erkki Oja. Projective nonnegative matrix factorization: Sparseness, orthogonality, and clustering. Neural Process. Lett, pages 11–13, 2009. [28] Zhirong Yang and Erkki Oja. Linear and nonlinear projective nonnegative matrix factorization. IEEE Transactions on Neural Networks, 21(5):734–749, 2010. [29] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999. [30] Jean-Philippe Brunet, Pablo Tamayo, Todd R Golub, and Jill P Mesirov. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences, 101(12):4164–4169, 2004. [31] Jose Ameijeiras-Alonso, Rosa M Crujeiras, and Alberto Rodrı́guez-Casal. Mode testing, critical bandwidth and excess mass. Test, 28(3):900–919, 2019. [32] Ilya Korsunsky, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-ru Loh, and Soumya Raychaudhuri. Fast, sensitive and accurate integration of single-cell data with harmony. Nature methods, 16(12):1289–1296, 2019. [33] Saskia Freytag, Luyi Tian, Ingrid Lönnstedt, Milica Ng, and Melanie Bahlo. Comparison of clustering tools in r for medium-sized 10x genomics single-cell rna-sequencing data. F1000Research, 7, 2018. 19 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ [34] Robert D Barber, Dan W Harmer, Robert A Coleman, and Brian J Clark. Gapdh as a housekeeping gene: analysis of gapdh mrna expression in a panel of 72 human tissues. Physiological genomics, 21(3):389–395, 2005. [35] Nicholas Silver, Steve Best, Jie Jiang, and Swee Lay Thein. Selection of housekeeping genes for gene expression studies in human reticulocytes using real-time pcr. BMC molecular biology, 7(1):33, 2006. [36] Collin M Blakely, Thomas BK Watkins, Wei Wu, Beatrice Gini, Jacob J Chabon, Caroline E McCoach, Nicholas McGranahan, Gareth A Wilson, Nicolai J Birkbak, Victor R Olivas, et al. Evolution and clinical impact of co-occurring genetic alterations in advanced-stage egfr- mutant lung cancers. Nature genetics, 49(12):1693–1704, 2017. [37] Ben O’leary, Richard S Finn, and Nicholas C Turner. Treating cancer with selective cdk4/6 inhibitors. Nature reviews Clinical oncology, 13(7):417–430, 2016. [38] Carminia Maria Della Corte, Umberto Malapelle, Elena Vigliar, Francesco Pepe, Giancarlo Troncone, Vincenza Ciaramella, Teresa Troiani, Erika Martinelli, Valentina Belli, Fortunato Ciardiello, et al. Efficacy of continuous egfr-inhibition and role of hedgehog in egfr acquired resistance in human lung cancer cells with activating mutation of egfr. Oncotarget, 8(14): 23020, 2017. [39] Tianyi Sun, Dongyuan Song, Wei Vivian Li, and Jingyi Jessica Li. scdesign2: an interpretable simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. bioRxiv, 2020. [40] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001. [41] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, 1992. [42] Sebastian Pott and Jason D Lieb. Single-cell atac-seq: strength in numbers. Genome Biology, 16(1):1–4, 2015. [43] Angelo Duò, Mark D Robinson, and Charlotte Soneson. A systematic performance evaluation of clustering methods for single-cell rna-seq data. F1000Research, 7, 2018. [44] Jiarui Ding, Xian Adiconis, Sean K Simmons, Monika S Kowalczyk, Cynthia C Hession, Nemanja D Marjanovic, Travis K Hughes, Marc H Wadsworth, Tyler Burks, Lan T Nguyen, et al. Systematic comparison of single-cell and single-nucleus rna-sequencing methods. Nature biotechnology, pages 1–10, 2020. [45] Jose Alquicira-Hernandez, Anuja Sathe, Hanlee P Ji, Quan Nguyen, and Joseph E Powell. scpred: accurate supervised method for cell-type classification from single-cell rna-seq data. Genome biology, 20(1):1–17, 2019. 20 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ [46] Spyros Darmanis, Steven A Sloan, Ye Zhang, Martin Enge, Christine Caneda, Lawrence M Shuer, Melanie G Hayden Gephart, Ben A Barres, and Stephen R Quake. A survey of human brain transcriptome diversity at the single cell level. Proceedings of the National Academy of Sciences, 112(23):7285–7290, 2015. [47] Itay Tirosh, Benjamin Izar, Sanjay M Prakadan, Marc H Wadsworth, Daniel Treacy, John J Trombetta, Asaf Rotem, Christopher Rodman, Christine Lian, George Murphy, et al. Dissect- ing the multicellular ecosystem of metastatic melanoma by single-cell rna-seq. Science, 352 (6282):189–196, 2016. 21 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ Supplementary Materials S1 Choice of parameters and robustness analysis S1.1 Low rank K In the development of scPNMF, motivated by the objective function of the PNMF method, min W∈Rp×K≥0 ‖X−WWTX‖ , (S1) PNMF aims to inherit the advantages such as basis orthogonality and the ability to project the new data from PCA. However, a key constraint in PCA, WTW = I, is sacrificed in order to meet with the condition W ≥ 0 in PNMF. To get closer to PCA and thus attain its nice properties, we propose to use the normalized difference between WTW and I to measure the orthonality of W: dev.ortho = ‖I−WTW‖/K2, (S2) which is an implication of the performance in the downstream analysis as well. It naturally follows a method for determining the number of basis, K: we perform PNMF for a sequence of K’s, calculate the dev.ortho measure for each W ∈ Rp×K≥0 optimized by PNMF for each K, and then look at the plot of dev.ortho against K. Users can decide cutoff where it reaches stability or there is a clear elbow in the graph. In Fig. S1, with Zheng4 [43] dataset, we demonstrate that (1) the dev.ortho measure is highly correlated with the performance of W in the downstream analysis; (2) in real data application, the dev.ortho measure shows a clear elbow pattern, which is helpful for users to determine K. Empirically, we see that dev.ortho reaches stability at K = 20 for most scRNA-seq data. For the purpose of providing suggestion for users and saving computational energy, we set the default number of bases in scPNMF to be K = 20. S1.2 R0: threshold for correlations between score vectors and cell library sizes in scPNMF step II: basis selection In real data application, the threshold for correlations between score vectors and cell library sizes in scPNMF step II: basis selection, R0, needs to be pre-defined. In the field, researchers often use thresholds as accurate as with one decimal digit, such as 0.5. By empirically running K-means clustering on the seven datasets (see Table S3) with different thresholds {0.5, 0.6, 0.7, 0.8, 0.9}, as shown in Fig. S2, we suggest setting R0 = 0.7 for K ≥ 10, and more conservatively, R0 = 0.8 when the basis number K is small (K < 10). 22 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ S2 Functional annotation We use the R package clusterProfiler [Y] to perform GO analysis. We set the gene ontology as “BP”, adjusted p-value cutoff as 0.1. The output GO terms are simplified by clusterProfiler. In this paper, we only perform a very conservative filtering based on functionality. We define the common housekeeping gene list as ACTB, ACTG1, B2M, GAPDH, MALAT1. If the top 10 high weight genes from one basis contain any of these genes, this basis will be filtered out. S3 Data preprocessing scPNMF only performs minimum data preprocessing to avoid information loss. Denote a scRNA- seq count matrix scPNMF further investigates as XC ∈ Np×n, with rows representing p genes and columns representing n cells. Users make the log count matrix X ∈ Rp×n≥0 by taking the log transformation with a pseudo count 1: Xij = log ( XCij + 1 ) , i = 1, · · · , p, j = 1, · · · , n. (S1) scPNMF takes the log count matrix X ∈ Rp×n≥0 as the input. With log transformation, the effect of a few extremely large counts will be alleviated, and the transformed continuous values are more flexible to model. We introduce the pseudo count 1 to avoid negative and infinite values in the later PNMF optimization step. For scRNA-seq data used in this paper (Table S1), we filtered out genes that are expressed in fewer than 5% of the cells, and then filtered out cells that are expressed in fewer than 5% of the remaining genes. Additionally, MALAT1, mitochondrial and ribosomal genes are filtered for datasets PBMC10x and PBMCSmartSeq according to the reference paper [44]. Users are able to adjust the filtering process before they input the log count matrix into scPNMF. S4 Details in informative gene selection and clustering In this paper, we compare scPNMF with other 11 different informative gene selection methods (Table S2). Some gene selection methods cannot let users pre-define an arbitrary gene number; for those methods (e.g., SCMarker [16]), we shift the tuning parameters until their output gene numbers equals the desired gene number. Therefore, their outputs might not achieve their the optimal results. We apply three clustering algorithm, Louvain clustering (by Seurat), K-means clustering (by R function kmeans), hierarchical clustering (by R function hclust). We perform PCA on informative genes and use the top 20 PCs for clustering. The Adjusted Rank Index (ARI) is as the metric of 23 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ clustering accuracy. ARI is defined as: ARI (P, T) = ∑ l,s ( nls 2 ) − [∑ l ( al 2 )∑ s ( bs 2 )] / ( n 2 ) 1 2 [∑ l ( al 2 ) + ∑ s ( bs 2 )] − [∑ l ( al 2 )∑ s ( bs 2 )] / ( n 2 ) , (S1) where P = (p1, · · · , pl) denotes the inferred cluster labels, and T = (t1, · · · , ts) denotes the true cluster labels. l and s are not necessarily to be equal. nls = ∑ ij I(pi = l)I(tj = s), al = ∑ s nls, bs = ∑ l nls. ARI ∈ [0, 1], an ARI value close to 1 means more accurate inferred clusters. To minimize the effects caused by parameters (resolution r in Louvain and number of cluster k in K-means and hierarchical clustering), we try a sequence of parameters: r ∈{0.02, 0.04, 0.06, 0.08, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} , k ∈{2, 3, 4, · · · , 15} , (S2) and use the average of top three high ARI across different parameters as the final output. S5 Details in new data projection and cell type predic- tion We use two datasets, Zheng8 and PBMC10x, as the reference scRNA-seq datasets. For Zheng8 dataset, we first use scDesign2 [39] to learn the underlying parameters, and then simulate a new dataset with same genes and cell types but 100 times higher sequencing depth compared to the Zheng8 dataset. For PBMC10x dataset, we use the PBMCSmartSeq dataset, which measures the exact same example and contains all genes measured in PBMC10x. Given M selected genes, the simulated Zheng8 and PBMC10x are extracted with those certain genes, and play role as the “pseudo” targeted gene profiling only measuring M genes. For cell type prediction, we project every targeted gene profiling dataset and its scRNA-seq reference on the same low-dimensional space, which mainly follows the idea from scPred [45]. When applying scPNMF, we use the weight matrix WS,(M) to project both the reference dataset and the targeted gene profiling dataset. For other gene selection methods, we first subset the reference dataset with only M selected genes, run PCA to get a weight matrix WPCA, and use it to project both the reference dataset (with only M genes) and targeted gene profiling dataset. After getting two low-dimensional embeddings of reference and targeted gene profiling data, we run the Harmony algorithm [32] to remove the technical variations between two low-dimensional em- beddings. Then we apply three classification algorithms, random forest (rf), k-nearest neighbors (knn) and support vector machine with radial kernel (svmRadial) in R package caret [K]. When fitting the training model, we use 5-fold cross-validation with three repeats. 24 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ Table S1: Top 10 high weight genes in each PNMF basis of FretagGold dataset Basis Gene symbol Description 1 RPS2, TMSB4X, GAPDH, RPL41, RPL13, FTH1, MALAT1, COX2, RPL10, RPS18 Highly expressed housekeeping genes 2 CD74, PTGR1, HLA-B, ALDH3A1, C15orf48, LCN2, IGFBP3, SAA1, CXCL1, HLA-DRA Immune-related genes 3 SEC61G, CDK4, CCN1, G0S2, ELOC, VOPP1, EGFR, F3, CDKN2A, EPCAM Tumor-related genes (oncogenes, tumor suppressor genes) 4 H4C3, CKS1B, HMGB2, SMC4, PTTG1, KPNA2, CCNB1, CDKN3, CKS2, CDC20 Genes related to mitotic cell cycle 5 HSPB1, UBE2S, CALD1, TMEM256, FIS1, ISOC2, ZN- HIT1, C20orf27, NDUFA3, PPP2R1A Genes related to mitochondrion 25 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ Table S2: Overview of informative gene selection used in this study Method User-defined gene # Language Package Reference corFS Yes R M3Drop (version 1.14.0) [14] DANB Yes R M3Drop (version 1.14.0) [14] GiniClust Yes R M3Drop (version 1.14.0) [14] irlbaPcaFS Yes R M3Drop (version 1.14.0) [14] M3Drop Yes R M3Drop (version 1.14.0) [14, 15] Scanpy Yes Python Scanpy (version 1.6.0) [W] SCMarker No R SCMarker1 [16] scran Yes R scran (version 1.18.3) [13] SeuratDISP Yes R Seurat (version 3.2.2) [11, 12] SeuratMVP No R Seurat (version 3.2.2) [12] SeuratVST Yes R Seurat (version 3.2.2) [12] 1: Due to failure in SCMarker R package installation, we run the R script downloaded from https://github.com/KChen-lab/SCMarker on September 17, 2020. 26 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://github.com/KChen-lab/SCMarker https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ Table S3: Overview of datasets used in this study Dataset Sequencing proto- col Gene # Cell # Cell type # True label Description Ref Darmanis Smart-Seq2 13256 420 8 No Human adult corti- cal samples [46] FreytagGold 10xGenomics Chromium 15410 925 3 Yes Mixture of human lung adenocarcinoma cell lines [33] Tirosh Smart-Seq2 11934 2887 6 No Human melanoma tumors [47] PBMC10x 10xGenomics Chromium 11714 3308 9 No Human peripheral blood mononuclear cells. 10x-v2 for sample 1 in the original paper. [44] PBMCSmartSeq Smart-Seq2 17479 273 6 No Human peripheral blood mononuclear cells. Smart-Seq2 for sample 1 in the original paper. [44] Zheng4 10xGenomics GemCode 2192 3994 4 Yes Mixture of human peripheral blood mononuclear cells [43, Z] Zheng8 10xGenomics GemCode 2390 3994 8 Yes Mixture of human peripheral blood mononuclear cells [43, Z] 27 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ 0.000 0.005 0.010 0.015 0.020 0.00 0.25 0.50 0.75 0 10 20 30 40 50 60 70 80 90 100 K d e v. o rt h o A R I Choice of K Figure S1: Comparison of dev.ortho and K-means ARI against low rank K on Zheng4 [43] dataset. 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 R0 A R I Choice of R0 Figure S2: Comparison of K-means ARI against R0, the threshold for correlations between score vectors and cell library sizes in scPNMF step II: basis selection. The mean ARI and the error bars are calculated across seven datasets (See Table S3). 28 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure S3: GO annotation on weight matrix of PCA. The enriched GO terms between basis are largely overlapped. 29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure S4: scPNMF scores versus total log-counts of FregGold dataset colored by cell types. Basis 2 distinguishes H2228 from the other two cell types and basis 3 distinguishes HCC827 from the other two cell types. 30 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure S5: Benchmarking scPNMF and other informative gene selction methods using 20, 50, 200, 500 genes. 31 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure S6: Comparison of overall average ARI of different methods versus gene numbers. The y-axis indicates the average ARI values across seven datasets and three clustering methods for each gene selection methods. 32 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ References [1] S Steven Potter. Single-cell rna sequencing for the study of development, physiology and disease. Nature Reviews Nephrology, 14(8):479–492, 2018. [2] Kenneth D Birnbaum. Power in numbers: single-cell rna-seq strategies to dissect complex tissues. Annual review of genetics, 52:203–221, 2018. [3] Chenxu Zhu, Sebastian Preissl, and Bing Ren. Single-cell multimodal omics: the power of many. Nature methods, 17(1):11–14, 2020. [4] Olivier Thellin, Willy Zorzi, Bernard Lakaye, B De Borman, Bernard Coumans, Georges Hennen, Thierry Grisar, Ahmed Igout, and Ernst Heinen. Housekeeping genes as internal standards: use and limits. Journal of biotechnology, 75(2-3):291–295, 1999. [5] Eli Eisenberg and Erez Y Levanon. Human housekeeping genes, revisited. TRENDS in Genetics, 29(10):569–574, 2013. [6] Aravind Subramanian, Rajiv Narayan, Steven M Corsello, David D Peck, Ted E Natoli, Xiaodong Lu, Joshua Gould, John F Davis, Andrew A Tubelli, Jacob K Asiedu, et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell, 171 (6):1437–1452, 2017. [7] Arjun Raj, Patrick Van Den Bogaard, Scott A Rifkin, Alexander Van Oudenaarden, and Sanjay Tyagi. Imaging individual mrna molecules using multiple singly labeled probes. Nature methods, 5(10):877–879, 2008. [8] Jeffrey R Moffitt, Junjie Hao, Guiping Wang, Kok Hao Chen, Hazen P Babcock, and Xiaowei Zhuang. High-throughput single-cell gene-expression profiling with multiplexed error-robust fluorescence in situ hybridization. Proceedings of the National Academy of Sciences, 113 (39):11046–11051, 2016. [9] Fatma Uzbas, Florian Opperer, Can Sönmezer, Dmitry Shaposhnikov, Steffen Sass, Christian Krendl, Philipp Angerer, Fabian J Theis, Nikola S Mueller, and Micha Drukker. Bart-seq: cost-effective massively parallelized targeted sequencing for genomics, transcriptomics, and single-cell analysis. Genome biology, 20(1):1–16, 2019. [10] Jamie L Marshall, Benjamin R Doughty, Vidya Subramanian, Philine Guckelberger, Qingbo Wang, Linlin M Chen, Samuel G Rodriques, Kaite Zhang, Charles P Fulco, Joseph Nasser, et al. Hypr-seq: Single-cell quantification of chosen rnas via hybridization and sequencing of dna probes. Proceedings of the National Academy of Sciences, 117(52):33404–33413, 2020. 33 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ [11] Christoph Hafemeister and Rahul Satija. Normalization and variance stabilization of single- cell rna-seq data using regularized negative binomial regression. Genome biology, 20(1): 1–15, 2019. [12] Tim Stuart, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck III, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. Comprehensive integration of single-cell data. Cell, 177:1888–1902, 2019. doi: 10.1016/j.cell.2019.05.031. URL https://doi.org/10.1016/j.cell.2019.05.031. [13] Aaron TL Lun, Karsten Bach, and John C Marioni. Pooling across cells to normalize single- cell rna sequencing data with many zero counts. Genome biology, 17(1):75, 2016. [14] Tallulah S Andrews and Martin Hemberg. M3drop: dropout-based feature selection for scrnaseq. Bioinformatics, 35(16):2865–2867, 2019. [15] Lan Jiang, Huidong Chen, Luca Pinello, and Guo-Cheng Yuan. Giniclust: detecting rare cell types from single-cell gene expression data with gini index. Genome biology, 17(1):144, 2016. [16] Fang Wang, Shaoheng Liang, Tapsi Kumar, Nicholas Navin, and Ken Chen. Scmarker: ab initio marker selection for single cell transcriptome profiling. PLoS computational biology, 15 (10):e1007445, 2019. [17] Evan Z Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, Allison R Bialas, Nolan Kamitaki, Emily M Martersteck, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, 2015. [18] Maayan Baron, Adrian Veres, Samuel L Wolock, Aubrey L Faust, Renaud Gaujoux, Amedeo Vetere, Jennifer Hyoje Ryu, Bridget K Wagner, Shai S Shen-Orr, Allon M Klein, et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell systems, 3(4):346–360, 2016. [19] Xun Zhu, Travers Ching, Xinghua Pan, Sherman M Weissman, and Lana Garmire. Detecting heterogeneity in single-cell rna-seq data by non-negative matrix factorization. PeerJ, 5:e2888, 2017. [20] Philippe Boileau, Nima S Hejazi, and Sandrine Dudoit. Exploring high-dimensional biological data with sparse contrastive principal component analysis. Bioinformatics, 36(11):3422– 3430, 2020. [21] Zhana Duren, Xi Chen, Mahdi Zamanighomi, Wanwen Zeng, Ansuman T Satpathy, Howard Y Chang, Yong Wang, and Wing Hung Wong. Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations. Proceedings of the National Academy of Sciences, 115(30):7723–7728, 2018. 34 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1016/j.cell.2019.05.031 https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ [22] Ghislain Durif, Laurent Modolo, Jeff E Mold, Sophie Lambert-Lacroix, and Franck Picard. Probabilistic count matrix factorization for single cell expression data analysis. Bioinformatics, 35(20):4011–4019, 2019. [23] Shuqin Zhang, Liu Yang, Jinwen Yang, Zhixiang Lin, and Michael K Ng. Dimensionality reduction for single cell rna sequencing data using constrained robust non-negative matrix factorization. NAR Genomics and Bioinformatics, 2(3):lqaa064, 2020. [24] Chao Gao and Joshua D Welch. Iterative refinement of cellular identity from single-cell data using online learning. In International Conference on Research in Computational Molecular Biology, pages 248–250. Springer, 2020. [25] Zi Yang and George Michailidis. A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data. Bioinformatics, 32(1):1–8, 2016. [26] Joshua D Welch, Velina Kozareva, Ashley Ferreira, Charles Vanderburg, Carly Martin, and Evan Z Macosko. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell, 177(7):1873–1887, 2019. [27] Zhijian Yuan, Zhirong Yang, and Erkki Oja. Projective nonnegative matrix factorization: Sparseness, orthogonality, and clustering. Neural Process. Lett, pages 11–13, 2009. [28] Zhirong Yang and Erkki Oja. Linear and nonlinear projective nonnegative matrix factorization. IEEE Transactions on Neural Networks, 21(5):734–749, 2010. [29] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999. [30] Jean-Philippe Brunet, Pablo Tamayo, Todd R Golub, and Jill P Mesirov. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences, 101(12):4164–4169, 2004. [31] Jose Ameijeiras-Alonso, Rosa M Crujeiras, and Alberto Rodrı́guez-Casal. Mode testing, critical bandwidth and excess mass. Test, 28(3):900–919, 2019. [32] Ilya Korsunsky, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-ru Loh, and Soumya Raychaudhuri. Fast, sensitive and accurate integration of single-cell data with harmony. Nature methods, 16(12):1289–1296, 2019. [33] Saskia Freytag, Luyi Tian, Ingrid Lönnstedt, Milica Ng, and Melanie Bahlo. Comparison of clustering tools in r for medium-sized 10x genomics single-cell rna-sequencing data. F1000Research, 7, 2018. 35 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ [34] Robert D Barber, Dan W Harmer, Robert A Coleman, and Brian J Clark. Gapdh as a housekeeping gene: analysis of gapdh mrna expression in a panel of 72 human tissues. Physiological genomics, 21(3):389–395, 2005. [35] Nicholas Silver, Steve Best, Jie Jiang, and Swee Lay Thein. Selection of housekeeping genes for gene expression studies in human reticulocytes using real-time pcr. BMC molecular biology, 7(1):33, 2006. [36] Collin M Blakely, Thomas BK Watkins, Wei Wu, Beatrice Gini, Jacob J Chabon, Caroline E McCoach, Nicholas McGranahan, Gareth A Wilson, Nicolai J Birkbak, Victor R Olivas, et al. Evolution and clinical impact of co-occurring genetic alterations in advanced-stage egfr- mutant lung cancers. Nature genetics, 49(12):1693–1704, 2017. [37] Ben O’leary, Richard S Finn, and Nicholas C Turner. Treating cancer with selective cdk4/6 inhibitors. Nature reviews Clinical oncology, 13(7):417–430, 2016. [38] Carminia Maria Della Corte, Umberto Malapelle, Elena Vigliar, Francesco Pepe, Giancarlo Troncone, Vincenza Ciaramella, Teresa Troiani, Erika Martinelli, Valentina Belli, Fortunato Ciardiello, et al. Efficacy of continuous egfr-inhibition and role of hedgehog in egfr acquired resistance in human lung cancer cells with activating mutation of egfr. Oncotarget, 8(14): 23020, 2017. [39] Tianyi Sun, Dongyuan Song, Wei Vivian Li, and Jingyi Jessica Li. scdesign2: an interpretable simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. bioRxiv, 2020. [40] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001. [41] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, 1992. [42] Sebastian Pott and Jason D Lieb. Single-cell atac-seq: strength in numbers. Genome Biology, 16(1):1–4, 2015. [43] Angelo Duò, Mark D Robinson, and Charlotte Soneson. A systematic performance evaluation of clustering methods for single-cell rna-seq data. F1000Research, 7, 2018. [44] Jiarui Ding, Xian Adiconis, Sean K Simmons, Monika S Kowalczyk, Cynthia C Hession, Nemanja D Marjanovic, Travis K Hughes, Marc H Wadsworth, Tyler Burks, Lan T Nguyen, et al. Systematic comparison of single-cell and single-nucleus rna-sequencing methods. Nature biotechnology, pages 1–10, 2020. [45] Jose Alquicira-Hernandez, Anuja Sathe, Hanlee P Ji, Quan Nguyen, and Joseph E Powell. scpred: accurate supervised method for cell-type classification from single-cell rna-seq data. Genome biology, 20(1):1–17, 2019. 36 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ [46] Spyros Darmanis, Steven A Sloan, Ye Zhang, Martin Enge, Christine Caneda, Lawrence M Shuer, Melanie G Hayden Gephart, Ben A Barres, and Stephen R Quake. A survey of human brain transcriptome diversity at the single cell level. Proceedings of the National Academy of Sciences, 112(23):7285–7290, 2015. [47] Itay Tirosh, Benjamin Izar, Sanjay M Prakadan, Marc H Wadsworth, Daniel Treacy, John J Trombetta, Asaf Rotem, Christopher Rodman, Christine Lian, George Murphy, et al. Dissect- ing the multicellular ecosystem of metastatic melanoma by single-cell rna-seq. Science, 352 (6282):189–196, 2016. 37 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430550 http://creativecommons.org/licenses/by-nc-nd/4.0/ Introduction Methods scPNMF step I: PNMF scPNMF step II: basis selection Strategy 1: examine bases by functional annotations (optional) Data-driven strategies Applications of scPNMF output: informative gene selection and new data projection M-truncation and informative gene selection New data projection Results scPNMF outputs a sparse and functionally interpretable representation of scRNA-seq data Basis selection is an essential step in scPNMF scPNMF outperforms state-of-the-art gene-selection methods on diverse scRNA-seq datasets scPNMF guides targeted gene profiling experimental design and cell-type prediction Discussion Choice of parameters and robustness analysis Low rank K R0: threshold for correlations between score vectors and cell library sizes in scPNMF step II: basis selection Functional annotation Data preprocessing Details in informative gene selection and clustering Details in new data projection and cell type prediction 10_1101-2021_02_09_430536 ---- Genome-wide prediction and integrative functional characterization of Alzheimer’s disease-associated genes 1 Genome-wide prediction and integrative 1 functional characterization of Alzheimer’s 2 disease-associated genes 3 Cui-Xiang Lin1, #, Hong-Dong Li1, #, Chao Deng1, Weisheng Liu1, Shannon Erhardt2, 4 Fang-Xiang Wu3, Xing-Ming Zhao4,5, Jun Wang2, Daifeng Wang6,7, Bin Hu8,*, Jianxin 5 Wang1,* 6 7 1 Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and 8 Engineering, Central South University, Changsha, Hunan 410083, P.R. China 9 2 Department of Pediatrics, McGovern Medical School, The University of Texas Health 10 Science Center at Houston, Houston, TX 77030, USA 11 3 Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, 12 SKS7N5A9, Canada. 13 4 Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, 14 Shanghai 200433, China 15 5 Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, 16 Ministry of Education, China 17 6 Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 18 Madison, WI 53705, USA 19 7 Waisman Center, University of Wisconsin - Madison, Madison, WI, 53705 USA 20 8 Institute of Engineering Medicine, Beijing Institute of Technology, Beijing, 100081, 21 China. 22 # Authors contributing equally 23 *Correspondence: bh@bit.edu.cn; jxwang@mail.csu.edu.cn 24 25 26 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 2 Abstract 27 The mechanism of Alzheimer’s disease (AD) remains elusive, partly due to the incomplete 28 identification of risk genes. We developed an approach to predict AD-associated genes 29 by learning the functional pattern of curated AD-associated genes from brain gene 30 networks. We created a pipeline to evaluate disease-gene association by interrogating 31 heterogeneous biological networks at different molecular levels. Our analysis showed that 32 top-ranked genes were functionally related to AD. We identified gene modules associated 33 with AD pathways, and found that top-ranked genes were correlated with both 34 neuropathological and clinical phenotypes of AD on independent datasets. We also 35 identified potential causal variants for genes such as FYN and PRKAR1A by integrating 36 brain eQTL and ATAC-seq data. Lastly, we created the ALZLINK web interface, enabling 37 users to exploit the functional relevance of predicted genes to AD. The predictions and 38 pipeline could become a valuable resource to advance the identification of therapeutic 39 targets for AD. 40 Keywords: Alzheimer’s disease; disease gene prediction; functional gene networks 41 Introduction 42 Alzheimer’s disease (AD) is a complex and progressive neurodegenerative disorder that 43 accounts for the majority of all dementia cases1. Its clinical symptoms include progressive 44 memory loss, personality change, and impairments in thinking, judgment, language, 45 problem-solving, and movement2. The two neuropathological hallmarks of AD are 46 extracellular amyloid-β (Aβ) plaques and intracellular neurofibrillary tangles (NFTs), which 47 are known to contribute to the degradation and death of neurons in the brain3. The number 48 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 3 of patients with AD worldwide is currently rising. Specifically, it is estimated that 49 approximately 50 million people are currently living with AD or other forms of dementia, 50 and this number is expected to increase to over 152 million by 20501. AD not only causes 51 suffering in both patients and their families but also places a severe burden on society. 52 However, the drug development for AD is slowly progressing4, partly due to the 53 incomplete understanding of the neuropathological mechanisms. 54 AD is partly caused by genetic mutations4. Its two subtypes, i.e., early-onset AD (EOAD, 55 onset age before 65 years) and late-onset AD (LOAD, onset age later than 65 years), 56 have different genetic risk factors. In EOAD, rare mutations in APP, PSEN1 and PSEN2 57 have been identified4. LOAD is markedly more complex, with APOE being a well-known 58 risk gene for this subtype. Most known or putative AD-associated genes were discovered 59 through genome-wide association studies (GWAS). Previously, GWAS identified CLU, 60 CR1, and PICALM, along with approximately 20 more genes4. In addition, network 61 approaches are used to identify AD-associated molecular networks or pathways. For 62 example, a module-trait network approach was proposed and applied to identify gene 63 coexpression modules that were associated with cognitive decline5, while a large-scale 64 proteomic analysis identified an energy metabolism-linked protein module, strongly 65 associated with AD pathology6. However, a large proportion of the phenotypic variances 66 in AD cannot be explained by known risk genes7, 8, 9, which suggests additional AD-67 associated genes that remain to be discovered. Since experimental approaches are often 68 time consuming and expensive, computational approaches provide a promising 69 alternative to discovering AD-associated genes. 70 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 4 Previous studies have shown that functional gene networks (FGNs) are promising for 71 predicting disease-associated genes10, 11. In a FGN, a node represents a gene and the 72 edge between two genes represents the co-functional probability (CFP) that the two 73 genes take participate in the same biological process or pathway12. For example, Guan 74 et al. constructed a global (i.e., non-tissue specific) FGN for mice, and identified Timp2 75 and Abcg8 as two novel genes associated with bone-mineral density13, 14. Using the same 76 network, Recla et al. discovered Hydin as a new thermal pain gene13, 14. Because gene 77 interactions might be rewired in different tissues, global networks cannot reveal the 78 differences of gene networks among tissues. To address this limitation, tissue-specific 79 networks have been proposed to more accurately capture gene interactions in tissues. 80 Greene et al. established 144 human tissue-specific networks and investigated these 81 networks for the interpretation of gene functions and diseases15. Using the brain-specific 82 network15, Krishnan et al. predicted disease genes for autism spectrum disorder11. By 83 leveraging the functional genomic data of model species with similar genetic backgrounds, 84 including mice and rats, a human brain-specific network was constructed, and its 85 application to the identification of brain disorder-associated genes was illustrated in our 86 previous work16. 87 Because AD is a brain disorder with genetic contributions, we hypothesized that brain-88 specific FGNs are informative for predicting AD-associated genes. It should be pointed 89 out that our predictions of AD-associated genes do not indicate any causality, that is, the 90 predicted genes may be either directly or indirectly associated with AD. To build models 91 for AD-associated gene prediction, we first compiled AD-associated genes from multiple 92 resources. These genes were used as positives for training models. We proposed a 93 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 5 functional enrichment-based approach to identify negative genes that are not likely 94 associated with AD. Next, we obtained ten brain-specific FGNs from the GIANT15 and 95 BaiHui16 databases. After assessing the predictivity of each network by cross-validation 96 of state-of-the-art machine learning models, we built a final model for predicting AD-97 associated genes through an optimal selection of networks and machine learning 98 methods. We scored all the other human genes that were not used in model training for 99 their association with AD. We created a pipeline to evaluate top-ranked novel candidate 100 genes by interrogating multiple biological networks. We then identified gene modules from 101 an AD-related network. We assessed the association of these modules and top-ranked 102 genes with AD-related phenotypes, including Consortium to Establish a Registry for 103 Alzheimer’s Disease (CERAD) score, Braak stage, and clinical dementia rating (CDR) on 104 an independent dataset. We next identified a set of genes by combining our predictions 105 and seven types of genomic evidence. We further identified potential variants that may 106 affect the expression of prioritized genes. Lastly, we developed the ALZLINK web 107 interface to enable the expoitation of predicted AD-associated genes. The resulting 108 predictions and pipeline could be valuable to advance the identification of risk genes for 109 AD. 110 111 Results 112 Prediction of AD-associated genes 113 Our approach leverages machine learning and a brain FGN to predict AD-associated 114 genes. The approach consists of three main components: compilation of AD-associated 115 (positive) and non-AD (negative) genes, construction of a feature matrix based on a brain 116 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 6 FGN, and prediction of AD-associated genes using machine learning models (Fig. 1). We 117 first compiled a set of AD-associated genes and non-AD genes to train models (see the 118 Methods section; Supplementary Note 1). We showed that the negative genes were 119 superior to those selected by the random sampling approach (Supplementary Fig. 1) and 120 that the negative genes were poorly associated with AD (Supplementary Fig. 2). In 121 addition, we tested their enrichment in three AD-related gene sets associated with 122 cognitive decline (the m109 module with 390 genes)5, amyloid-beta (15 genes), and Tau 123 pathology (28 genes)17 respectively, from two recent studies5, 17. The results showed that 124 the negative genes were not enriched in any of the three modules or pathways (p-values 125 = 0.99, 1, 1 respectively). Next, we extracted a feature matrix for the positive and negative 126 genes based on FGNs. For each gene (positives, negatives, or the other genes), its CFPs 127 with the positive genes in the network were collected into a 147-dimensional feature 128 vector. We considered the 10 collected brain FGNs (nine from GIANT and one from 129 BaiHui) and evaluated their ability to predict AD-associated genes using state-of-the-art 130 machine learning methods, including LR, SVM, RF, and ExtraTrees, which were shown 131 to be promising in a previous study18. We found that the network in the BaiHui database 132 achieved the best performance based on the four methods tested and that ExtraTrees 133 performed better than the other methods in terms of both the area under the receiver 134 operating characteristic curve (AUROC) and the area under the precision-recall curve 135 (AUPRC) (Fig. 2A; Supplementary Fig. 3-5). Finally, we selected this network in 136 combination with ExtraTrees to construct the model for predicting AD-associated genes. 137 We performed five-fold cross-validation with ExtraTrees. Each of the five models 138 established during cross-validation was used to score all other human genes that were 139 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 7 not included in the training dataset. To achieve robust predictions, we repeated the cross-140 validation 100 times and calculated an average score for each gene. The average 141 AUROC and AUPRC based on cross validation are 0.91 and 0.76, respectively, 142 suggesting the model is accurate. A higher score indicates that a gene is more likely to 143 be associated with AD. The scores for predicted genes are provided in our developed 144 web interface (www.alzlink.com). Our literature search showed that 12 of the top-ranked 145 20 genes were likely associated with AD with some evidence (Supplementary Table 1), 146 suggesting that our model has captured molecular signature of AD and makes confident 147 predictions. Note that our prediction for AD-associated genes was based on only the 148 machine learning model; the subsequent analysis such as enrichment, coexpression, and 149 PPI relatedness was used separately to evaluate the association of predicted genes with 150 AD. 151 The top-ranked genes are functionally related to AD based on multiple lines 152 of genomic evidence 153 The top-ranked genes are enriched in AD-associated functions and phenotypes 154 We hypothesize that genes with higher scores are more likely to be enriched in AD 155 phenotype-related gene sets. To test this hypothesis, we excluded all genes in the training 156 dataset, ranked the remaining ones based on their scores, and tested their enrichment in 157 AD-related gene sets. We collected four gene sets associated with AD pathology. The 158 first gene set was collected from AlzGene, which contained 277 genes. The other three 159 gene sets, namely, the learning or memory pathway (214 genes), the cognition pathway 160 (247 genes), and the amyloid-beta related pathway (51 genes), were collected from the 161 Gene Ontology (GO) database. Using the decile enrichment test (see the Methods 162 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 8 section), we observed that the top-ranked genes were significantly enriched in the four 163 gene sets: AlzGene (p-value = 7.3×10-13), learning or memory pathway (p-value=6.6×10-164 12), cognition pathway (p-value = 1.4×10-11), and amyloid-beta pathway (p-value=1.1×10-165 9) (Fig. 2B). 166 We next tested whether the top-ranked genes were functionally similar to AD-167 associated genes. From the ranked genes, we selected the same number of top-ranked 168 genes as the curated positive genes (n=147). We then performed GO enrichment analysis 169 of both the curated positive genes and the top-ranked genes using PANTHER19. The 170 known positive genes and our predicted AD-associated genes were enriched in 771 and 171 2573 terms, respectively, with 518 of these terms being shared, which was significant 172 compared with the baseline in that no more than 1 pathway was shared (p-value<0.01). 173 The 10 most significant shared terms are listed in Supplementary Table 2. We found that 174 many known AD-related functions, including learning or memory, cognition, regulation of 175 endocytosis, regulation of immune system process, regulation of cell death, and 176 regulation of amyloid-beta formation, were shared pathways, implying that our predicted 177 genes might be involved in AD pathology. Specifically, we tested whether the top-scored 178 genes (score > 0.7) were involved in neuron development. Based on GO enrichment 179 analysis, we found that they were enriched in both neuron development (GO:0048666) 180 (FDR = 3.86×10-76) and central nervous system neuron development (GO:0021954) 181 (FDR = 5.63×10-14). 182 We further tested whether the top-ranked genes overlap with gene modules that were 183 associated with AD in published studies. A recent study identified gene coexpression 184 modules that were related to AD5. Module 109 (m109) containing 390 genes was most 185 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 9 strongly associated with cognitive decline. 350 genes overlapped with the brain FGN used 186 in our work and therefore had predicted scores. We found that 101 genes in m109 were 187 among the top-scored genes (score > 0.7), which was significant compared to the random 188 baseline (p < 0.0001). We also obtained two gene sets from another recently published 189 network association study on AD17. For protein phosphorylation events in AD, the study 190 derived 28 kinases which were possibly implicated in AD, with 22 kinases having 191 scores >0.7. Among the 14 genes in the amyloid-beta correlated cascade reported by the 192 authors (after removing CLU because it is in the training set), nine had scores > 0.7. 193 These results provide additional evidence that our predicted genes are associated with 194 AD. 195 196 The top-ranked genes show higher sequence similarity with AD-associated genes 197 We evaluated whether the sequences of the top-ranked genes were similar to those of 198 AD-associated genes using the sequence similarity method (see the Methods section). 199 Let k∊[100, 200, 500] denote the number of top-ranked genes for testing. We found that 200 the top-ranked genes had significantly higher sequence similarity with AD-associated 201 genes than randomly selected genes (p-value < 0.0001, Supplementary Fig. 6). Taking 202 the top-ranked 200 genes as an example (Fig. 2C), the standardized SEQSIM-score was 203 6.09, which was significantly higher than that of the randomly selected genes (SEQSIM-204 score=-0.0006). The sequence similarity implies the functional similarity between 205 predicted and known AD-associated genes. 206 207 The top-ranked genes are coexpressed with AD-associated genes 208 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 10 For the top-ranked k∊[100, 200, 500] genes, we showed that they were coexpressed with 209 more AD-associated genes than random baseline on the independent Mayo RNA-seq 210 dataset20 (p-value<0.0001) (Supplementary Fig. 7; see Methods). For example, the 211 number of coexpressed gene pairs between the top-ranked 200 genes and the AD-212 associated genes was significantly higher than that of randomly selected genes (p-value 213 < 0.0001, Fig. 2C), suggesting an association of our top predicted genes with AD. 214 215 The top-ranked genes interact strongly with AD-associated genes in PPI networks 216 We hypothesized that the top-ranked k genes were more likely to interact with AD-217 associated genes if the prediction is accurate. We obtained PPI networks from two 218 databases: HuRI and STRING (see Methods). To avoid circularity, we removed those 219 interactions which were used to construct the brain FGN from the two databases. We 220 found that the top-ranked k∊[100, 200, 500] genes showed significantly more interactions 221 with AD-associated genes (p-value < 0.0001, Supplementary Fig. 8). Taking the top-222 ranked 200 genes as an example, the total number of interactions with AD-associated 223 genes was 48 in HuRI, whereas only 11 interactions were found for the randomly selected 224 genes (p-value <0.0001, Fig. 2C). 225 226 The top-ranked genes are associated with AD based on miRNA-target networks 227 miRNAs are important post-transcriptional regulators and have been implicated in AD21. 228 We investigated whether top-ranked genes were functionally related to AD-associated 229 genes or miRNAs. First, we observed that they shared more miRNAs with AD-associated 230 genes than randomly selected genes (Supplementary Fig. 9; Methods). For instance, the 231 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 11 top-ranked 200 genes shared a significant number of miRNAs with AD-associated genes 232 (Fig. 2C, p-value<0.0001). Second, we found that the top-ranked genes interacted with a 233 significant number of AD-associated miRNAs (Fig. 2C; Supplementary Fig. 9). These 234 results imply that top-ranked genes are likely to be involved in post-transcriptional 235 regulatory pathways associated with AD. 236 237 AD-related regulatory networks reveal hub genes and hub miRNAs 238 associated with AD 239 We constructed two regulatory networks. One is a transcriptional regulatory network (TRN) 240 extracted from the TRRUST database22 (version 2.0) that included only known and top-241 ranked AD-associated genes (Fig. 3A and the Methods section). From this network, we 242 identified hub genes based on outdegrees and indegrees. The genes with outdegree and 243 indegree represent transcription factors (TFs) and target genes, respectively. The other 244 regulatory network is a miRNA-target interaction network (Fig. 3B) extracted from 245 mirTarBase23 (version 7.0) by considering only AD-associated genes and miRNAs 246 (Methods). 247 We found that the hub genes in the AD-related TRN were supported by the literature 248 and interaction evidence (Table 1). For example, RELA regulates 13 AD-associated 249 genes including APOE and BACE1, interacts with 8 AD-associated genes in PPI networks, 250 and is coexpressed with 16 AD-associated genes. Furthermore, RELA was shown to be 251 associated with neuroprotection, learning, and memory24, 25. Another hub gene is JUN. It 252 regulates 11 known AD-associated genes such as APP, BCL3, RELB, and PLAU, and 253 interacts with the proteins encoded by 10 AD-associated genes such as MS4A2 and 254 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 12 GSK3B. Besides, JUN is also responsible for Aβ-induced neuroinflammation through a 255 signaling pathway26. 256 We identified genes such as CCND1 and CDKN1A as hubs in the miRNA-based 257 regulatory network (Fig. 3B). Although some studies have reported their associations with 258 AD27, 28, the mechanisms underlying these associations are not well understood. These 259 genes might contribute to AD by perturbing the post-transcriptional regulatory network 260 mediated by miRNAs (Table 1 and Fig. 3B). For example, CCND1 was associated with 261 16 miRNAs that also bind to known AD-associated genes, including six miRNAs (miR-16-262 5p, miR-106b-5p, miR-106a-5p, miR-20a-5p, miR-17-5p and miR-101-3p) that bind to 263 APP and four miRNAs (miR-29b-3p, miR-186-5p, miR-29c-3p and miR-124-3p) that bind 264 to BACE1. In addition, knockout experiments of CCND1 showed its protective role in 265 neurodegeneration in the hippocampus29. Comparing the two networks focusing on only 266 predicted (Fig. 3B) and known (Fig. 3C) AD-associated genes, we observed hub miRNAs 267 such as miR-17b-5p, miR-26b-5p, miR-155-5p, miR-124-3p, and miR-106b-5p that were 268 shared between them, indicating that the shared miRNAs might play roles in the 269 pathology of AD. 270 Gene modules in the integrated gene interaction network are associated with 271 AD-related functions, neuropathological and clinical phenotypes in 272 independent data 273 We constructed an integrated gene interaction network by aggregating multiple lines of 274 genomic evidence and identified four gene modules with a community cluster algorithm 275 (Methods). The modules (denoted by M1, M2, M3, and M4) are shown in Fig. 4 (the genes 276 in each module are provided in Supplementary Table 3). For each module, we performed 277 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 13 enrichment analysis using PANTHER19 and identified the significantly enriched biological 278 process terms (FDR <0.05). As many of the enriched terms were redundant, we selected 279 representative GO terms with REVIGO30. All four modules were enriched in AD-280 associated biological processes (Fig. 4). For example, M1 was enriched in regulation of 281 cell death and regulation of neurogenesis; M2 was enriched in functions including 282 response to amyloid-beta; M3 was enriched in learning or memory, regulation of synaptic 283 plasticity; M4 was enriched in functions such as regulation of lipid transport and 284 cholesterol efflux. These enrichments imply that the gene modules are not only 285 biologically meaningful but also related to AD. 286 Next, we tested whether the modules were correlated with AD-related traits using a 287 well established method31. For each module, we extracted the gene expression matrix 288 containing the genes only in that module. We then computed the eigengene (i.e. the first 289 principal component) of the expression matrix followed by correlating the eigengene with 290 the AD-related traits of interest. We performed this analysis on the independent MSBB 291 RNA-seq dataset with data available for three traits: the CERAD, Braak and CDR score. 292 We conducted a total of twelve correlation tests resulting from all combinations of the four 293 modules and the three traits. We found that the results of all correlation tests were 294 significant (FDR < 0.05), suggesting that our identified modules were associated with AD 295 traits. Taking the eigengene of M1 as an example, it was significantly correlated with the 296 CERAD (r=-0.37, FDR=2.2×10-7), Braak (r=-0.41, FDR=1.5×10-8), and CDR score (r=-297 0.42, FDR=6.1×10-9) (Figure 4B). Another example was M2, whose eigengene was 298 significantly correlated with the three traits (Figure 4B). The correlation of M3 and M4 with 299 the AD-related traits are provided in Supplementary Fig. 10. 300 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 14 Individual top-ranked genes are associated with neuropathological and 301 clinical phenotypes on independent datasets 302 We hypothesized that the top-ranked genes were more likely to be associated with AD-303 related phenotypes if our prediction was accurate. We tested this hypothesis using the 304 independent MSBB RNA-seq dataset described above. For each gene, we calculated its 305 PCC with the CERAD, Braak and CDR score (see the Methods section). To better 306 investigate the trends between our prediction and the gene’s absolute correlation with 307 AD-related phenotypes, we ranked all the predicted genes, divided them into 50 groups, 308 and calculated the mean PCC for each bin. We found that higher ranks (higher predicted 309 scores) were associated with higher mean PCC values for all three phenotypes. The 310 predicted ranks were well correlated with the CERAD (r = 0.68), Braak (r = 0.70) and CDR 311 (r = 0.73) score. The eigengenes for the top-ranked 100, 200 and 500 genes were all 312 significantly correlated with CERAD, Braak and CDR scores (Supplementary Fig. 11). 313 We then examined the correlations of individual top-ranked genes (those not included 314 in the training set) with AD-related phenotypes5. Among the top-ranked 200 genes, we 315 identified 95, 98 and 108 genes that were significantly correlated with CERAD, Braak and 316 CDR scores, respectively (FDR < 0.05). Of them, 84 were correlated with all three 317 phenotypes (Supplementary Table 4). Looking at FYN, its correlations with CERAD, 318 Braak and CDR scores were 0.37, 0.35 and 0.37, while PRKAR1A had Pearson 319 correlation coefficients of -0.25, -0.31 and -0.29 for the three traits respectively. These 320 results indicate that our top-ranked genes were likely candidate genes for AD. 321 Multiple evidence-supported AD-associated genes and their regulatory variants 322 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 15 In the above sections, we have shown that the top-ranked genes are associated with AD 323 based on multiple lines of functional genomic evidence. Here we performed further 324 screening for AD-associated genes by aggregating these evidence, which are divided into 325 two categories: (1) molecular interaction evidence reflecting the interaction of predicted 326 genes with compiled AD-associated genes, and (2) phenotypic correlation evidence 327 supported by correlation of predicted genes with AD traits. The former includes three 328 types of evidence, which are protein interaction, mRNA coexpression, and miRNA sharing 329 with AD-associated genes. The latter includes four types of evidence, which were the 330 correlation with CERAD, Braak and CDR scores based on the MSBB dataset, and 331 differential expression based on the ROSMAP dataset32. 332 To narrow down the predicted candidates, we focused on the top-ranked 200 genes 333 (after excluding the compiled AD-associated genes). The seven types of genomic 334 evidence for these genes are visualized as a circus plot (Figure 5), from which the 335 evidence for each gene can be easily identified. We also obtained their enriched GO 336 biological process terms and showed the functional annotation of these genes (Figure 5). 337 We then applied strict criteria on functional evidence to screen for potentially confident 338 AD-associated genes. That is, only one molecular interaction evidence and one 339 phenotypic correlation evidence is allowed to be missing for each gene. From this, 36 out 340 of the top-ranked 200 genes were retained (Supplementary Table 5), providing a set of 341 multiple evidence-based candidate genes to the community for further functional 342 experiments. As the function of a gene is directly related to the cell type it is expressed 343 in, we further investigated the cell type specificity of their expression. Zhang et al. provides 344 a set of genes that show cell type-specific expression in five major brain cell types 345 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 16 including astrocyte, microglia, endothelial, oligodendrocytes and neuron33. Using this 346 dataset, we found that 14 of the 36 genes showed specific expression in cell types such 347 as astrocytes and microglia (Supplementary Table 6), while the others are expressed in 348 two or more cell types. 349 Taking FYN as an example, it encodes a membrane-associated tyrosine kinase that 350 is implicated in the control of cell growth and shows specific expression in astrocytes 351 (Supplementary Table 6). It interacts with proteins encoded by 13 AD-associated genes 352 such as APP and MAPT in PPI, shows significant coexpression with 10 AD-associated 353 genes like CLU and interacts with 5 AD-assocaited miRNAs like hsa-mir-106b. Its 354 expression was up-regulated based on the ROSMAP dataset (posterior error probability 355 (PEP) =0.04)32. Its up-regulation in AD patients was further supported by the positive 356 correlation with CERAD (PCC = 0.37), Braak (PCC = 0.35) and CDR (PCC = 0.37) scores 357 (FDR < 0.001) on the MSBB dataset. The expression of FYN for the sample groups 358 partitioned based on CERAD, Braak and CDR scores is shown in Figure 5A. PRKAR1A 359 encodes a regulatory subunit of the cAMP-dependent protein kinases involved in the 360 cAMP signaling pathway. It is functionally related with AD-associated genes through PPI, 361 coexpression and miRNA-target network, and its expression is negatively correlated with 362 the above three neuropathological traits (Figure 5A). Altered expression of PRKAR1A in 363 AD patients was also identified34, providing independent evidence supporting our 364 prediction. 365 Having shown that the expression level of the above genes was correlated with AD 366 traits, we next exploited which genetic variants (SNP) might causally regulate the 367 expression of these genes by integrating genetic and regulatory data. A SNP is likely 368 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 17 causal if it is not only an eQTL but also resides in the transcriptional factor binding site 369 (TFBS) within the promoter of the target gene34. By integrating eQTL and ATAC-seq data, 370 we identified seven genes (FYN, PRKAR1A, PPP3R1, BMPR1A, LMNA, EGFR and 371 KRAS), for which their eQTLs are also located in the TFBS (Supplementary Table 7). For 372 instance, the SNP rs61202914 is an QTL for the expression of a FYN isoform. Further, 373 we found that this SNP also resided in the TFBS of multiple transcription factors within 374 the promoter region of FYN, thus likely affecting the binding affinity of the transcription 375 factor and therefore expression level. As an illustration, RFX1_HUMAN.H11MO.0.B, 376 which is a motif representing the TFBS of the transcription factor RFX1, harbors the SNP 377 rs61202914 (Figure 6B). This evidence suggests that rs61202914 is likely a variant 378 causally affecting the expression of FYN. For PRKAR1A, one TFBS in its promoter region 379 harbors its eQTL (rs8080306) (Figure 6B), indicating that rs8080306 is likely a causal 380 variant that regulates the expression of PRKAR1A. To summarize, our integrated analysis 381 of eQTL and TFBS in active promoters suggests potential genetic variants that may be 382 associated with AD through regulating the expression of their corresponding target gene. 383 These results may be valuable to prioritize genes for further experimental studies. 384 385 ALZLINK: a web resource for interrogating AD-associated genes 386 To facilitate the interrogation of AD-associated genes and the use of the statistical 387 evaluation pipeline developed in this work, we created the interactive web resource 388 ALZLINK (available at: www.alzlink.com). This site provides the predicted genes along 389 with their predicted scores and functional genomic evidence, facilitating experts in the 390 field of AD to select candidates for further experimental testing. Also, the statistical 391 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 18 methods to evaluate the association of an individual gene or a gene set with AD are 392 implemented and available as an online pipeline. For an individual gene, users can query 393 its interactions with known AD-associated genes in heterogeneous interaction networks 394 and its correlation with AD-related traits including CERAD, CDR and Braak scores. For a 395 gene set, users can statistically test its association with AD using the sequence or 396 network-based methods, outputting the distribution of the test metric along with a p-value 397 measuring the significance. For each interaction network such as PPI, the local network 398 involving the queried gene or gene set and the known AD-associated genes is visualized 399 on the web. The data and pipelines on ALZLINK could serve as a valuable resource for 400 experts to prioritize AD-associated genes for further testing. 401 402 Discussion 403 AD is a neurodegenerative disease with heterogeneous pathologies8, 35, 36, 37. However, 404 predicting AD-associated genes is challenging because AD, as a complex disease, is 405 caused mainly by common variants of multiple genes and the disruption of related 406 pathways. FGNs are an important model for characterizing complex functional 407 relationships between genes and have been successfully applied to predict candidate 408 genes for complex diseases, including autism11 and Parkinson’s disease38. Since AD is 409 caused by gene dysregulation in the brain, we considered brain FGNs as the basis for 410 predicting AD-associated genes. The key idea of our approach was to discover the 411 pattern of AD-associated genes from a brain FGN using machine learning methods. Using 412 our model, we were able to predict novel candidate genes for AD. 413 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 19 We evaluated the association of top-ranked genes with AD by investigating their 414 enrichment in AD-related functions and phenotypes along with examining their 415 association with AD through multiple heterogeneous biological networks. We found that 416 the top-ranked genes were associated with AD. Based on the analyses of the 417 independent MSBB data, we observed that the top-ranked genes were correlated with 418 AD-related neuropathological (CERAD and Braak scores) and clinical (CDR) phenotypes, 419 suggesting that they were likely associated with AD. We also explored gene modules from 420 the AD-related network. We found that these modules were enriched in many AD-related 421 pathways and phenotypes and were also correlated with three AD-related phenotypes, 422 implicating their biological relevance. Combining the genomic data and our predictions, 423 we identified a set of 36 genes whose association with AD was supported by multiple 424 lines of evidence, indicating these genes as potential promising candidates. We further 425 identified potential causal variants for 7 of the 36 genes by integrating brain eQTL and 426 ATAC-seq data. 427 Our contributions are mainly three-fold. First, we compiled a set of genes that were 428 likely related to AD by performing an intensive, stringent hand curation of multiple 429 resources, providing a potential resource for the community. For negative gene selection, 430 we proposed a pathway-based approach that works by removing any gene that was likely 431 to be associated with AD. Thus, it can be expected that negative genes have been 432 identified. We illustrated that this approach helped improve the accuracies of models in 433 terms of both AUROC and AUPRC. Our model for predicting AD-associated genes 434 depends on the non-AD (negative) genes. Different ways of negative gene selection could 435 lead to bias in the model and thus the prediction. As our method selects negative genes 436 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 20 by removing any gene that has a potential association with AD, a possible bias is that the 437 predicted genes are more likely to be functionally related to and share GO terms with the 438 compiled AD-associated genes. Second, we predicted novel candidate genes and 439 showed that the top-ranked genes exhibit significant associations with AD through 440 functional enrichment analysis and the investigation of multiple biological networks. 441 Moreover, the genes were found to be correlated with AD-related phenotypes on 442 independent datasets. Taking advantage of the functional genomic data, we identified a 443 set of 36 AD-associated genes supported by multiple lines of evidence, indicating 444 promising candidates. Third, we developed ALZLINK, a web interface to facilitate the use 445 of data and pipeline developed in this study. It should be pointed out that the pipeline to 446 evaluate the relevance of the predicted genes to AD is generic and can be applied to any 447 other diseases. 448 Although our predictions are promising, as supported by our systematic analysis, our 449 model for predicting AD-associated genes could be improved in several ways. First, our 450 predictions were made at the gene level without differentiating the splice isoforms 451 generated from the same gene through alternative splicing39, 40. This factor is essential 452 because isoforms of the same gene might have different or even opposite functions. 453 Isoforms have been implicated in diseases such as ovarian cancers41. The prediction of 454 AD-associated genes at the isoform level could have the potential to promote our 455 understanding of AD. Second, the human brain consists of multiple heterogeneous 456 structures, each of which contains many different cell types. The association of the 457 predicted genes with AD in different cell types remains to be resolved. Integrating single-458 cell genomic data42, 43, 44 with our predicted genes could be helpful for addressing this 459 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 21 question. Lastly, our predictions do not implicate causality. The genes predicted using our 460 method are statistically significantly associated with AD. 461 In summary, we predicted novel AD-associated genes and provided evidence for their 462 association with AD. However, further studies are needed to test the validity of our 463 predictions. This pipeline of prediction and validation is generic and can be readily used 464 for other diseases, such as Parkinson’s disease, cancers and heart diseases. We expect 465 that the predicted genes might become a useful resource for experimental testing by the 466 community and that our proposed pipeline could be used in other diseases. 467 468 Methods 469 Compilation of AD-associated and non-AD genes 470 AD-associated (positives) and non-AD (negatives) genes are needed to build a machine 471 learning model. First, we performed intensive hand-curation to identify confident AD-472 associated genes from various disease gene resources, including AlzGene45, AlzBase46, 473 OMIM47, DisGenet48, DistiLD49, and UniProt50, Open Targets51, GWAS Catalog52, 474 differentially expressed genes (DEGs) in ROSMAP32 and published literature. The 475 curated genes from each resource as well as the corresponding criteria were provided in 476 Supplementary Note 1. As the AD-associated genes and their reliability vary across these 477 resources, we applied a voting strategy and selected only those that were present in at 478 least two resources to ensure higher reliability (see details in Supplementary Note 1). In 479 this way, we obtained 147 AD-associated genes. Second, we selected a set of non-AD 480 genes, which had no or minimal association with AD. The main idea of our method for 481 non-AD gene selection was to remove any genes that exhibit potential associations with 482 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 22 AD. We removed genes that (i) were annotated to the same Gene Ontology (GO) term 483 enriched for the AD-associated genes or (ii) showed any association with AD based on 484 the above-described resources (see details in Supplementary Note 1). In this way, we 485 identified 1651 non-AD genes. 486 Model development for predicting AD-associated genes 487 We first constructed the feature matrix for all human genes based on the brain-specific 488 FGN. This FGN was built by integrating heterogeneous functional genomic data, including 489 gene expression, protein-protein interaction (PPI), protein docking and gene-to-490 phenotype annotation using the well-established Bayesian framework16. The Bayesian 491 network model predicts a co-functional probability (CFP) for every pair of genes by using 492 the following formula: 493 𝑃(𝐹!,𝐹",…,𝐹#) = ! $ 𝑃(𝑦 = 1)𝛱%&! # 𝑃(𝑦 = 1) [1] 494 where P(y=1) is the prior probability for a sample (i.e. a gene pair in this study) to be 495 positive, P(Fi|y = 1), i = 1, 2, …, n, is the probability of observing the value of the i-th 496 feature under the condition that the gene pair is functionally related, and C is a constant 497 normalization factor. In the resulting network, a node is a gene, and an edge represents 498 CFP that two linked genes participate in the same biological process or pathway. 499 For each gene, we extracted its CFP with the compiled AD-associated genes (147 500 genes) from the network as features based on a previously proposed method18. As a 501 result, each gene is characterized by a 147-dimensional vector. The feature data for the 502 training set (147 positives and 1651 negatives, resulting in a total of 1798 genes) are 503 represented by a 1798x147 matrix X. The label (1 for positives and 0 for negatives) of 504 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 23 each gene is stored in a vector y. The feature matrix of all other genes not in the training 505 set was extracted. 506 To develop a model for predicting AD-associated genes, we compared the different 507 combinations of FGNs and machine learning models. To identify optimal FGNs for feature 508 matrix construction, we obtained ten networks for the whole brain or brain-regions, 509 including the brain, forebrain, frontal lobe, temporal lobe, hippocampus, thalamus, 510 amygdala, glia and astrocytes from the GIANT database15 and the BaiHui database16. 511 We considered these ten regions because they have been implicated in AD53, 54. As AD-512 associated genes are likely to operate in immune cells55, 56, we investigated how well 513 immune cells were represented in these networks. As microglia is the dominant immune 514 cell in the brain and cell type-specific genes are indicators of the cell type of interest, we 515 analyzed how microglia-specific genes were represented in these networks. We obtained 516 a set of microglia-specific genes from the work33. We found that more than 95% of them 517 existed in each of these networks, suggesting that immune cells are well represented in 518 these networks. For the machine learning models, we considered logistic regression (LR), 519 support vector machine (SVM), random forest (RF) and extremely randomized trees 520 (ExtraTrees) for their promising accuracy shown in our previous work18. 521 Statistical assessment of the relevance of top-ranked genes to AD 522 We evaluated the relevance of the top-ranked genes to AD using the following method 523 (the genes in the training set were excluded). These methods are based on the sequence, 524 pathway and various biological networks, as described below. 525 Decile enrichment test for AD pathways and phenotypes 526 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 24 If the prediction is accurate, it is expected that AD-associated genes are more likely to be 527 enriched in the top-ranked genes. Using the decile enrichment test proposed in the 528 previous study11, we statistically assessed whether a larger proportion of a given AD-529 related gene set falls into the first decile of the ranked genes. To do so, we excluded the 530 genes in the training set, ranked the remaining genes, and split genes into 10 evenly 531 binned deciles. Let Pnet and Prandom denote the proportion of a given gene set that falls 532 into the first decile based on our prediction and random chance, respectively. We tested 533 whether Pnet was significantly larger than Prandom by using the binomial test (see details in 534 the previous work11). 535 536 Evaluation based on sequence similarity 537 Genes with similar sequences are likely to carry out similar functions. For a set of k 538 predicted genes denoted by Gk, we evaluate its functional relationship with AD-associated 539 genes using a sequence similarity-based score (SEQSIM-score), which measures the 540 average similarity between predicted and known AD-associated genes. It is calculated 541 as: 542 SEQSIM-score(𝑔!) = " ! ∑ max 𝑔𝑗∈𝐺𝑃 (𝑠𝑐𝑜𝑟𝑒(𝑔#,𝑔$))!#%" [2] 543 , where GP denotes the set of compiled positive genes, score(gi, gj) is the sequence 544 identity between a predicted gene gi and the AD-associated gene gj calculated using 545 BLAST57. The higher the SEQSIM-score is, the more similar to AD-associated genes the 546 predicted gene is. SEQSIM-score was standardized to have zero mean and unit variance 547 using z-transform. For the top-ranked k∊[100, 200, 500] genes, their scores are denoted 548 by the SEQSIM-scoreobserved. In the same way, we also calculated the SEQSIM-549 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 25 scorerandom for a set of k randomly selected genes. We calculated 10,000 such scores 550 from 10,000 randomly sampled gene sets. Let Nsig denote the number of random scores 551 that are higher than SEQSIM-scoreobserved. We computed the p-value as Nsig/10000. 552 553 Evaluation based on coexpression with AD-associated genes 554 Compared to randomly selected genes, reliably predicted genes are more likely to be 555 coregulated with AD-associated genes. Based on this hypothesis, we calculated the 556 number of coexpressed gene pairs between top-ranked k genes and known AD-557 associated genes using independent gene expression data. That’s to say, in each pair, 558 one is a predicted gene and the other is a known AD-associated gene. The coexpression 559 was measured with Pearson correlation coefficient (PCC). A gene pair was considered to 560 be coexpressed if the PCC ≥ 0.7. To test whether the coexpression is significant, we 561 generated 10,000 gene lists, each containing k randomly sampled genes. We calculated 562 the number of coexpressed gene pairs for the top-ranked genes and for the randomly 563 selected genes, denoted by Eobserved and Erandom. We calculated the p-value to measure 564 whether Eobserved is significantly higher than Erandom. 565 We used the Mayo RNA-seq dataset generated from the Accelerating Medicines 566 Partnership-Alzheimer’s Disease (AMP-AD) project (publicly available at 567 https://www.synapse.org/#!Synapse: syn2580853) for coexpression evaluation. Note that 568 this dataset was not used for constructing the brain FGN that was used to build the model 569 for predicting AD-associated genes, so circularity was avoided. This dataset contains 570 gene expression data of the temporal cortex obtained from 82 cases and 80 controls. The 571 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 26 log2-transformed Fragments Per Kilobase of transcript per Million mapped reads (FPKM) 572 was used for this analysis. 573 574 Evaluation based on PPI networks 575 We tested whether the top-ranked k genes were more likely to interact with AD-associated 576 genes in PPI networks. We used the PPI data from Human Reference Interactome 577 (HuRI)58 and Search Tool for the Retrieval of Interacting Genes/Proteins (STRING)59. 578 Because some PPI data were integrated to build the brain FGN, such PPIs have been 579 first removed from the two databases to avoid circularity. The interaction data in HuRI 580 were experimentally identified. In STRING, a score is used to measure the interaction 581 strength between two proteins; a score > 700 indicates an interaction with high confidence. 582 Only the confident interaction was considered. We tested k values in [100, 200, 500]. For 583 a given k value, we computed the number of genes in the top-ranked k genes that 584 interacted with at least one AD-associated gene, denoted by Nobserved. Similarly, we also 585 calculated Nrandom, which represents the number of genes in k randomly sampled genes 586 that interacted with at least one AD-associated gene. With the same method described in 587 the previous section, a p-value was calculated to measure the significance. 588 589 Evaluation based on miRNA-target interaction networks 590 This analysis was motivated by the assumption that top-ranked genes were more likely 591 related to AD-associated genes or miRNAs based on miRNA-target interaction networks. 592 First, we tested whether top-ranked genes and AD-associated genes share more miRNAs. 593 We downloaded miRNA-target interaction data from miRTarBase23, a high-quality 594 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 27 database of validated interactions. We computed the number of shared miRNAs of the 595 top-ranked k∊[100, 200, 500] genes with AD-associated genes. Based on randomly 596 sampled genes, we calculated a p-value to test whether the number of shared miRNAs 597 was significant. Second, we tested top-ranked genes for their binding to AD-associated 598 miRNAs. We retrieved AD-associated miRNAs from the Human microRNA Disease 599 Database (HMDD) (v3.2). Similarly, for the top-ranked k genes, we calculated a p-value 600 to measure their significance of binding to AD-associated miRNAs. 601 602 Construction of AD-related regulatory networks 603 To analyze the regulatory relationship between the predicted candidates and AD-604 associated genes and obtain hub genes60, 61, we constructed two AD-related regulatory 605 networks: one was a transcriptional regulation network, the other was a miRNA-target 606 interaction network. 607 The human transcriptional regulatory network was downloaded from the Transcriptional 608 Regulatory Relationships Unraveled by Sentence-based Text mining (TRRUST) 609 database22. The full network contains 795 transcription factors (TFs) and 2492 target 610 genes. First, we extracted an AD-related transcriptional regulatory network by retaining 611 only the TF-target gene pairs in which one node is known or predicted AD-associated 612 gene (among the top-ranked 200). We identified hub genes according to the outdegree 613 or indegree. 614 For constructing the AD-related miRNA-target interaction network, we first collected 44 615 AD-associated miRNAs from an up-to-date review23. Then from the above-described 616 miRTarBase23 (version 7.0), we extracted two networks. One contains only the interaction 617 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 28 between AD-associated miRNAs and AD-associated genes, and the other contains only 618 the interaction between AD-associated miRNAs and predicted AD-associated genes. 619 620 Identification of gene modules in the integrated network 621 To better understand the functions of the predicted genes, we constructed an integrated 622 network by aggregating evidence from the brain FGN, PPI, coexpression network, 623 miRNA-target network and transcriptional regulatory network. This network included the 624 top-ranked 200 genes and the compiled 147 AD-associated genes. Two genes were 625 connected with an edge if they were direct neighbors in any of the networks above. In 626 detail, all TF-target interactions, which satisfy the above condition, were extracted from 627 the transcriptional regulatory network in the TRRUST database22. We also included the 628 genes with a CFP ≥ 0.7, and then expanded the resulting network by including other 629 genes that have a CFP ≥ 0.95 with at least one known AD-associated gene. From the 630 gene coexpression network, we retained only edges with PCCs higher than 0.7. From the 631 PPI network, we included gene pairs whose encoded proteins show interaction in HuRI 632 or STRING. For the miRNA-target interaction data, we computed a network in which the 633 weight of the edge between two genes was calculated as w=Nshare/Nmax, where Nshare 634 represents the number of miRNAs shared by the two genes and Nmax =max(N1, N2) with 635 N1 and N2 denoting the number of miRNAs binding to the two genes, respectively. The 636 range of w is from 0 to 1. The interaction with w ≥0.3 was considered. By applying the 637 GLay algorithm implemented in Cytoscape[44] to the integrated network, we identified 638 gene modules within which genes were closely connected. 639 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 29 The Independent Mountain Sinai Brain Bank (MSBB) dataset with AD-related 640 neuropathological and clinical traits 641 We obtained an independent dataset with AD-related neuropathological and clinical traits 642 from the MSBB study62. We used the data from Brodmann area 36 (parahippocampal 643 gyrus), which is one of the most vulnerable regions to AD63. This dataset contains gene 644 expression data from 215 donors for which AD-related phenotypes are also available. 645 These phenotypes include the neuritic plaque density assessed by CERAD score, 646 neurofibrillary tangle severity by Braak score, and severity of dementia by CDR score. 647 The dataset contains 23021 genes measured for the 215 individuals and is available at 648 the AMP-AD portal (https://www.synapse.org/#!Synapse:syn3159438). For each gene, its 649 PCC with the CERAD, Braak and CDR scores was calculated. 650 Based on the CERAD score, we extracted control and AD samples using the criteria 651 provided on https://www.synapse.org/#!Synapse:syn6101474; based on the Braak score, 652 we followed the practice in 63 and divided samples into three groups in the ranges of [0, 653 2], [3, 4] and [5, 6], representing different levels of tau pathology; Based on CDR, the 654 samples were partitioned into three groups in the range of [0], [0.5, 2] and [3, 5] in the 655 same way as used in 63, representing different degrees of severity of clinical dementia. 656 657 Brain eQTL and ATAC-seq data 658 We identify potentially causal regulatory variants by testing whether eQTL for a target 659 gene also resides in the transcriptional factor binding site (TFBS) in its promoters through 660 the integration of eQTL and ATAC-seq data. Both gene- and isoform-expression eQTLs 661 were considered. We obtained brain gene eQTLs from GTEx (version: v8), PsychEncode 662 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 30 (http://resource.psychencode.org/) and the CommonMind Consortium 663 (https://www.synapse.org/#!Synapse:syn4622659). The latter two resources contain 664 isoform eQTLs, which were also used. We used active promoters from the human brain 665 ATAC-seq peak data in the BOCA database64. We identified TFBSs in these promoters 666 using the FIMO tool65, with the transcription factor binding motif in the HOCOMOCO 667 database (version 11) as reference. 668 Data Availability 669 All accession codes, unique identifiers, or web links for publicly available datasets are 670 described in the paper. All data supporting the findings of the current study are listed in 671 Supplementary Tables 1-7, Supplementary Figures 1-11, and our web interface 672 (www.alzlink.com). 673 Code Availability 674 The codes for model development are publicly available at 675 https://github.com/genemine/alzlink. 676 677 References 678 1. Calsolaro V, Antognoli R, Okoye C, Monzani F. The use of antipsychotic drugs for 679 treating behavioral symptoms in Alzheimer's Disease. Front Pharmacol 10, 1465 (2019). 680 681 2. Fredericks CA, et al. Early affective changes and increased connectivity in preclinical 682 Alzheimer's disease. Alzheimers Dement (Amst) 10, 471–479 (2018). 683 684 3. Giri M, Shah A, Upreti B, Rai JC. Unraveling the genes implicated in Alzheimer's 685 disease. Biomed Rep 7, 105–114 (2017). 686 687 4. Sims R, Hill M, Williams J. The multiplex model of the genetics of Alzheimer’s disease. 688 Nature Neuroscience 23, 311-322 (2020). 689 690 5. Mostafavi S, et al. A molecular network of the aging human brain provides insights into 691 the pathology and cognitive decline of Alzheimer’s disease. Nat Neurosci 21, 811-819 692 (2018). 693 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 31 694 6. Johnson ECB, et al. Large-scale proteomic analysis of Alzheimer’s disease brain and 695 cerebrospinal fluid reveals early changes in energy metabolism associated with 696 microglia and astrocyte activation. Nat Med 26, 769-780 (2020). 697 698 7. Ridge PG, Mukherjee, S., Crane, P. K., Kauwe, J. S. & Alzheimer's Disease Genetics 699 Consortium. Alzheimer's disease: analyzing the missing heritability. PLoS ONE 8, 700 e79771 (2013). 701 702 8. Cuyvers E, Sleegers K. Genetic variations underlying Alzheimer's disease: evidence 703 from genome-wide association studies and beyond. Lancet Neurol 15, 857–868 (2016). 704 705 9. Ridge PG, et al. Assessment of the genetic variance of late-onset Alzheimer's disease. 706 Neurobiol Aging 41, 200.e213–200.e220 (2016). 707 708 10. Guan Y, Myers CL, Lu R, Lemischka IR, Bult CJ, Troyanskaya OG. A genomewide 709 functional network for the laboratory mouse. PLOS Comput Biol 4, e1000165 (2008). 710 711 11. Krishnan A, et al. Genome-wide prediction and functional characterization of the genetic 712 basis of autism spectrum disorder. Nat Neurosci 19, 1454–1462 (2016). 713 714 12. Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D. A Bayesian framework 715 for combining heterogeneous data sources for gene function prediction (in 716 Saccharomyces cerevisiae). Proc Natl Acad Sci USA 100, 8348–8353 (2003). 717 718 13. Guan Y, Ackert-Bicknell CL, Kell B, Troyanskaya OG, Hibbs MA. Functional genomics 719 complements quantitative genetics in identifying disease-gene associations. PLoS 720 Comput Biol 6, e1000991 (2010). 721 722 14. Recla JM, Robledo RF, Gatti DM, Bult CJ, Churchill GA, Chesler EJ. Precise genetic 723 mapping and integrative bioinformatics in Diversity Outbred mice reveals Hydin as a 724 novel pain gene. Mamm Genome 25, 211–222 (2014). 725 726 15. Greene CS, et al. Understanding multicellular function and disease with human tissue-727 specific networks. Nat Genet 47, 569–576 (2015). 728 729 16. Li H-D, Bai T, Sandford E, Burmeister M, Guan Y. BaiHui: cross-species brain-specific 730 network built with hundreds of hand-curated datasets. Bioinformatics 35, 2486–2488 731 (2019). 732 733 17. Bai B, et al. Deep Multilayer Brain Proteomics Identifies Molecular Networks in 734 Alzheimer's Disease Progression. Neuron 105, 975-991.e977 (2020). 735 736 18. Duda M, Zhang H, Li HD, Wall DP, Burmeister M, Guan Y. Brain-specific functional 737 relationship networks inform autism spectrum disorder gene prediction. Transl 738 Psychiatry 8, 56 (2018). 739 740 19. Mi H, Muruganujan A, Ebert D, Huang X, Thomas PD. PANTHER version 14: more 741 genomes, a new PANTHER GO-slim and improvements in enrichment analysis tools. 742 Nucleic Acids Res 47, D419–D426 (2018). 743 744 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 32 20. Allen M, et al. Human whole genome genotype and transcriptome data for Alzheimer’s 745 and other neurodegenerative diseases. Sci Data 3, 160089 (2016). 746 747 21. Wang M, Qin L, Tang B. MicroRNAs in Alzheimer's Disease. Front Genet 10, 153 748 (2019). 749 750 22. Han H, et al. TRRUST v2: an expanded reference database of human and mouse 751 transcriptional regulatory interactions. Nucleic Acids Res 46, D380–D386 (2017). 752 753 23. Chou CH, et al. miRTarBase update 2018: a resource for experimentally validated 754 microRNA-target interactions. Nucleic Acids Res 46, D296–D302 (2018). 755 756 24. Kaltschmidt B, Kaltschmidt C. NF-kappaB in the nervous system. Cold Spring Harbor 757 perspectives in biology 1, a001271-a001271 (2009). 758 759 25. Pizzi M, et al. NF-kappaB factor c-Rel mediates neuroprotection elicited by mGlu5 760 receptor agonists against amyloid beta-peptide toxicity. Cell Death Differ 12, 761-772 761 (2005). 762 763 26. Vukic V, et al. Expression of inflammatory genes induced by beta-amyloid peptides in 764 human brain endothelial cells and in Alzheimer's brain is mediated by the JNK-AP1 765 signaling pathway. Neurobiol Dis 34, 95–106 (2009). 766 767 27. Kim H, et al. Overexpression of cell cycle proteins of peripheral lymphocytes in patients 768 with Alzheimer's disease. Psychiatry Investig 13, 127–134 (2016). 769 770 28. Scacchi R, Gambina G, Moretto G, Corbo RM. P21 gene variation and late-onset 771 Alzheimer's disease in the Italian population. Dementia and geriatric cognitive disorders 772 35, 51–57 (2013). 773 774 29. Marathe S, Liu S, Brai E, Kaczarowski M, Alberi L. Notch signaling in response to 775 excitotoxicity induces neurodegeneration via erroneous cell cycle reentry. Cell Death 776 Differ 22, 1775-1784 (2015). 777 778 30. Supek F, Bosnjak M, Skunca N, Smuc T. REVIGO summarizes and visualizes long lists 779 of gene ontology terms. PLoS ONE 6, e21800 (2011). 780 781 31. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network 782 analysis. BMC Bioinformatics 9, 559 (2008). 783 784 32. Canchi S, et al. Integrating Gene and Protein Expression Reveals Perturbed Functional 785 Networks in Alzheimer’s Disease. Cell Rep 28, 1103-1116.e1104 (2019). 786 787 33. McKenzie AT, et al. Brain Cell Type Specific Gene Expression and Co-expression 788 Network Architectures. Sci Rep 8, 8868 (2018). 789 790 34. Liang WS, et al. Altered neuronal gene expression in brain regions differentially affected 791 by Alzheimer's disease: a reference data set. Physiol Genomics 33, 240-256 (2008). 792 793 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 33 35. Liu J, Li M, Lan W, Wu F, Pan Y, Wang J. Classification of Alzheimer's disease using 794 whole brain hierarchical network. IEEE/ACM Trans Comput Biol Bioinform 15, 624–632 795 (2018). 796 797 36. Cummings J, Feldman HH, Scheltens P. The “rights” of precision drug development for 798 Alzheimer’s disease. Alzheimer's Res Ther 11, 76 (2019). 799 800 37. Lambert J-C, et al. Genome-wide association study identifies variants at CLU and CR1 801 associated with Alzheimer’s disease. Nat Genet 41, 1094 (2009). 802 803 38. Yao V, et al. An integrative tissue-network approach to identify and test human disease 804 genes. Nat Biotechnol 36, 1091–1099 (2018). 805 806 39. Li H-D, Menon R, Omenn GS, Guan Y. The emerging era of genomic data integration for 807 analyzing splice isoform function. Trends Genet 30, 340–347 (2014). 808 809 40. Baralle FE, Giudice J. Alternative splicing as a regulator of development and tissue 810 identity. Nat Rev Mol Cell Biol 18, 437–451 (2017). 811 812 41. Barrett CL, DeBoever C, Jepsen K, Saenz CC, Carson DA, Frazer KA. Systematic 813 transcriptome analysis reveals tumor-specific isoforms for ovarian cancer diagnosis and 814 therapy. Proc Natl Acad Sci USA 112, E3050–E3057 (2015). 815 816 42. Tian T, Wan J, Song Q, Wei Z. Clustering single-cell RNA-seq data with a model-based 817 deep learning approach. Nat Mach Intell 1, 191–198 (2019). 818 819 43. Zheng R, Li M, Liang Z, Wu F-X, Pan Y, Wang J. SinNLRR: a robust subspace 820 clustering method for cell type detection by non-negative and low-rank representation. 821 Bioinformatics 35, 3642–3650 (2019). 822 823 44. Cao J, et al. The single-cell transcriptional landscape of mammalian organogenesis. 824 Nature 566, 496–502 (2019). 825 826 45. Bertram L, McQueen MB, Mullin K, Blacker D, Tanzi RE. Systematic meta-analyses of 827 Alzheimer disease genetic association studies: the AlzGene database. Nat Genet 39, 828 17–23 (2007). 829 830 46. Bai Z, et al. AlzBase: an integrative database for gene dysregulation in Alzheimer’s 831 disease. Mol Neurobiol 53, 310–319 (2016). 832 833 47. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian 834 Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. 835 Nucleic Acids Res 33, D514–D517 (2005). 836 837 48. Pinero J, et al. DisGeNET: a discovery platform for the dynamical exploration of human 838 diseases and their genes. Database (Oxford) 2015, bav028 (2015). 839 840 49. Palleja A, Horn H, Eliasson S, Jensen LJ. DistiLD Database: diseases and traits in 841 linkage disequilibrium blocks. Nucleic Acids Res 40, D1036–D1040 (2012). 842 843 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 34 50. Wu CH, et al. The Universal Protein Resource (UniProt): an expanding universe of 844 protein information. Nucleic Acids Res 34, D187–D191 (2006). 845 846 51. Carvalho-Silva D, et al. Open Targets Platform: new developments and updates two 847 years on. Nucleic Acids Res 47, D1056-D1065 (2018). 848 849 52. Buniello A, et al. The NHGRI-EBI GWAS Catalog of published genome-wide association 850 studies, targeted arrays and summary statistics 2019. Nucleic Acids Res 47, D1005-851 D1012 (2018). 852 853 53. Xie A, Gao J, Xu L, Meng D. Shared mechanisms of neurodegeneration in Alzheimer's 854 disease and Parkinson's disease. Biomed Res Int 2014, 648740 (2014). 855 856 54. Dubois B. The emergence of a new conceptual framework for Alzheimer's disease. J 857 Alzheimers Dis 62, 1059–1066 (2018). 858 859 55. Young AMH, et al. A map of transcriptional heterogeneity and regulatory variation in 860 human microglia. bioRxiv doi: https://doi.org/10.1101/2019.12.20.874099, (2019). 861 862 56. Tansey KE, Cameron D, Hill MJ. Genetic risk for Alzheimer's disease is concentrated in 863 specific macrophage and microglial transcriptional networks. Genome Med 10, 14-14 864 (2018). 865 866 57. McGinnis S, Madden TL. BLAST: at the core of a powerful and diverse set of sequence 867 analysis tools. Nucleic Acids Res 32, W20–W25 (2004). 868 869 58. Luck K, et al. A reference map of the human binary protein interactome. Nature 580, 870 402-408 (2020). 871 872 59. Szklarczyk D, et al. STRING v10: protein-protein interaction networks, integrated over 873 the tree of life. Nucleic Acids Res 43, D447–D452 (2015). 874 875 60. Wang M, et al. Molecular networks and key regulators of the dysregulated neuronal 876 system in Alzheimer’s Disease. bioRxiv doi: https://doi.org/10.1101/788323, (2019). 877 878 61. Scelsi MA, Napolioni V, Greicius MD, Altmann A. Network propagation of rare mutations 879 in Alzheimer’s disease reveals tissue-specific hub genes and communities. bioRxiv doi: 880 https://doi.org/10.1101/781203, (2019). 881 882 62. Wang M, et al. The Mount Sinai cohort of large-scale genomic, transcriptomic and 883 proteomic data in Alzheimer's disease. Sci Data 5, 180185 (2018). 884 885 63. Wang M, et al. Integrative network analysis of nineteen brain regions identifies molecular 886 signatures and networks underlying selective regional vulnerability to Alzheimer's 887 disease. Genome Med 8, 104-104 (2016). 888 889 64. Fullard JF, et al. An atlas of chromatin accessibility in the adult human brain. Genome 890 research 28, 1243-1252 (2018). 891 892 65. Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given motif. 893 Bioinformatics (Oxford, England) 27, 1017-1018 (2011). 894 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 35 895 66. Hooper C, Meimaridou E, Tavassoli M, Melino G, Lovestone S, Killick R. p53 is 896 upregulated in Alzheimer's disease and induces tau phosphorylation in HEK293a cells. 897 Neurosci Lett 418, 34–37 (2007). 898 899 67. Qin W, et al. Neuronal SIRT1 activation as a novel mechanism underlying the prevention 900 of Alzheimer disease amyloid neuropathology by calorie restriction. J Biol Chem 281, 901 21745-21754 (2006). 902 903 68. Feio dos Santos AC, et al. Decrease of PTEN expression levels among normal, 904 symptomatic and asymptomatic Alzheimer's disease (Ad) subjects, measured in 905 hippocampus, temporal and entorhinal cortices. Alzheimer's & dementia : the journal of 906 the Alzheimer's Association 7, S701 (2011). 907 908 69. Sonoda Y, et al. Accumulation of tumor-suppressor PTEN in Alzheimer neurofibrillary 909 tangles. Neurosci Lett 471, 20–24 (2010). 910 911 912 Acknowledgments 913 This work is supported by the National Key R&D Program of China (No. 914 2018YFC0910504), the National Natural Science Foundation of China (No. U1909208, 915 61772552, 61772557), 111 Project (No. B18059), and Hunan Provincial Science and 916 Technology Program (2018WK4001). 917 The results published here are in part based on data obtained from the AMP-AD 918 Knowledge Portal (https://adknowledgeportal.synapse.org/). The Mayo RNA-seq data 919 were provided by the following sources: The Mayo Clinic Alzheimer's Disease Genetic 920 Studies, led by Dr. Nilufer Ertekin-Taner and Dr. Steven G. Younkin, Mayo Clinic, 921 Jacksonville, FL using samples from the Mayo Clinic Study of Aging, the Mayo Clinic 922 Alzheimer’s Disease Research Center, and the Mayo Clinic Brain Bank. Data collection 923 was supported through funding by NIA grants P50 AG016574, R01 AG032990, U01 924 AG046139, R01 AG018023, U01 AG006576, U01 AG006786, R01 AG025711, R01 925 AG017216, R01 AG003949, NINDS grant R01 NS080820, CurePSP Foundation, and 926 support from Mayo Foundation. Study data includes samples collected through the Sun 927 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 36 Health Research Institute Brain and Body Donation Program of Sun City, Arizona. The 928 Brain and Body Donation Program is supported by the National Institute of Neurological 929 Disorders and Stroke (U24 NS072026 National Brain and Tissue Resource for 930 Parkinson’s Disease and Related Disorders), the National Institute on Aging (P30 931 AG19610 Arizona Alzheimer’s Disease Core Center), the Arizona Department of Health 932 Services (contract 211002, Arizona Alzheimer’s Research Center), the Arizona 933 Biomedical Research Commission (contracts 4001, 0011, 05-901 and 1001 to the Arizona 934 Parkinson's Disease Consortium) and the Michael J. Fox Foundation for Parkinson’s 935 Research. The MSBB data were generated from postmortem brain tissue collected 936 through the Mount Sinai VA Medical Center Brain Bank and were provided by Dr. Eric 937 Schadt from Mount Sinai School of Medicine. 938 939 Author contributions 940 C.X.L., H.D.L. and W.S.L. developed the statistical method, performed the analysis, and 941 wrote the manuscript. D.C. and C.X.L developed the web interface. X.M.Z., J.W., F.X.W. 942 and D.W. provided instructions on the analysis. J.X.W. conceived and supervised the 943 research and contributed to the manuscript. 944 945 Additional information 946 Supplementary Information accompanies this paper at http://www.nature.com/ nature 947 communications. 948 Competing financial interests: The authors declare no competing financial interests. 949 950 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 37 Supplementary information 951 Supplementary Notes 952 Supplementary Note 1. Description for compiling AD-associated genes. 953 954 Supplementary Figures 955 Supplementary Fig. 1. Comparison in model performance of two methods in negative 956 non-AD gene selection. 957 Supplementary Fig. 2. Comparison of the negative controls and randomly selected genes 958 based on their association with AD. 959 Supplementary Fig. 3. Performances of different brain-region networks based on Random 960 Forest (RF). 961 Supplementary Fig. 4. Performances of different brain-region networks based on support 962 vector machines (SVM). 963 Supplementary Fig. 5. Performance of different brain-region networks based on logistic 964 regression (LR). 965 Supplementary Fig. 6. Validation of the top-ranked genes based on sequence similarity 966 with AD-associated genes. 967 Supplementary Fig. 7. Validation of the top-ranked genes based on their coexpression 968 with known AD-associated genes. 969 Supplementary Fig. 8. Validation of the top-ranked genes based on protein-protein 970 interaction networks in the STRING and HuRI database. 971 Supplementary Fig. 9. Validation of the top-ranked genes based on miRNA-target binding 972 networks. 973 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 38 Supplementary Fig. 10. The correlation with three AD traits of the eigengenes of modules 974 3 and 4. 975 Supplementary Fig. 11. The correlation with three AD traits of the eigengenes of the top-976 ranked genes. 977 978 Supplementary Tables: 979 Supplementary Table 1. The top-ranked genes (excluding training set) that are likely 980 associated with AD based on literature. 981 Supplementary Table 2. The top ten shared GO terms of the 147 AD-associated genes 982 with the top 147 predicted genes. 983 Supplementary Table 3. Gene modules identified from the integrated gene interaction 984 network. 985 Supplementary Table 4. The correlation of 84 genes with CERAD, Braak Score and 986 CDR on the MSBB data. 987 Supplementary Table 5. The seven types of functional evidence for the selected 36 988 genes. 989 Supplementary Table 6. The 14 genes with cell type specific expression. 990 Supplementary Table 7. The seven genes with eQTLs located in the transcription factor 991 binding site in the promoter region. 992 993 Figure captions 994 Fig. 1 Overview of the method for genome-wide prediction of AD-associated genes and their functional 995 characterization. A Selection of AD-associated genes. 147 AD-associated genes were compiled from 996 various resources, including AD-associated genes from OMIM, DisGeNet, Uniprot, DistiLD, AlzBase, 997 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 39 AlzBase, AlzGene, literature, Open Targets, ROSMAP-DEG and GWAS-catalog. The gene that was 998 present in at least two resources was selected. The AD-associated genes as well as potential positive 999 genes inferred with a functional enrichment method were then removed from the full set of all human genes. 1000 The remaining genes were treated as non-AD genes (negatives). B Brain specific functional gene networks 1001 (FGNs) were used for feature matrix construction. For each gene, its cofunction probabilities with the 147 1002 positive genes in the network were extracted as features. Thus, each gene was characterized by a 147-1003 dimensional vector. C Selection of brain FGNs. We compared the ten networks collected for their predictivity 1004 of AD-associated genes with machine learning approaches. An optimal network was selected. D Validation. 1005 Predicted AD-associated genes were validated by AD-related pathways and various gene networks, 1006 including coexpression networks, protein-protein interaction networks, miRNA-target binding networks, 1007 transcriptional regulatory networks. E Functional implication in AD. The associations of the top predicted 1008 genes with AD-related phenotypes were evaluated. Gene modules from an AD-related network were 1009 identified. 1010 1011 Fig. 2 Model performance and statistical evaluation based on AD-related pathways and various gene 1012 networks. A Comparison of ExtraTrees models built from different functional gene networks in terms of 1013 AUROC and AUPRC based on cross-validation. B Enrichment of the genes ranked in the first decile in 1014 the four AD-associated gene sets or pathways with the decile enrichment test (described in Methods). C 1015 Validation of the top-ranked genes based on their sequence similarity, the number of shared miRNAs, the 1016 number of AD-associated miRNAs they can bind to, the number of coexpressed gene pairs, the number 1017 of interactions with AD-associated genes in HuRI and STRING. In all the subplots, the red vertical line 1018 and the distribution in yellow indicate the results for our top-ranked genes and randomly selected genes, 1019 respectively. 1020 1021 1022 Fig. 3 AD-related regulatory networks. A Transcriptional regulatory network including our compiled AD-1023 associated genes and the top-ranked genes. B The interaction network between predicted genes and 1024 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 40 AD-relevant miRNAs. C The interaction network between the compiled AD-associated genes and AD-1025 relevant miRNAs. 1026 1027 Fig. 4 Gene modules and their association with AD traits. The network was built by aggregating the 1028 evidence from the protein-protein interaction network, coexpression network, miRNA-gene binding 1029 network, transcriptional regulatory network and the brain FGN. This network contains the top-ranked 200 1030 genes and the 147 compiled AD-associated genes. A Four gene modules, denoted by M1, M2, M3 and 1031 M4, were identified by applying the GLay algorithm to the integrated network in Cytoscape. B The 1032 association of M1 and M2 with the three AD-related phenotypes (the CERAD, Braak and CDR score) was 1033 assessed. The results for all the tests were significant (FDR < 0.05). 1034 1035 Fig. 5. Visualization of functional evidence supporting the association of the top-ranked 200 genes with 1036 AD. The seven circles show the strength of the seven types of evidence, including the three molecular 1037 interaction evidence (the number of interacting AD-associated genes in PPI, coexpression network and 1038 miRNA-target binding network, respectively) and the four phenotypic correlation evidence (the Pearson 1039 correlation with CERAD, Braak and CDR on the MSBB dataset, and the log2-transformed fold change of 1040 expression obtained from the ROSMAP study). The darker the purple color is, the stronger the functional 1041 association is. The section corresponding to the blue arc shows the enriched GO biological process 1042 terms, where each curve points the gene annotated to the term. 1043 1044 Fig. 6 Illustration of the association of the top-ranked individual genes with AD-related phenotypes and 1045 the potential regulatory variant of the gene. A Comparison of the expression of individual genes in 1046 different sample groups. The samples were divided into groups based on the CERAD, Braak or CDR 1047 score. The comparison for FYN and PRKAR1A is shown. B Potential regulatory SNPs that may regulate 1048 the expression. For FYN, the SNP rs61202914 not only resides in the TFBS within its promoter region but 1049 also is an eQTL (upper); the SNP rs8080306 is located in the TFBS and also an eQTL for PRKAR1A. 1050 1051 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 41 1052 Tables and Figures 1053 Table 1. Hub genes (after excluding known AD-associated genes) measured with the outdegree and 1054 indegree in AD-related transcriptional regulatory network (TRN) and with the degree in miRNA-based 1055 regulatory networks (MRN). 1056 Hub Gene Gene type Outdegree, indegree in AD-related TRN Degree in AD- related MRN Association with AD RELA| NFKB3 oncogenic TF 45, 5 2 RELA is associated with learning and memory24, 25 JUN|AP-1 oncogenic TF 38, 11 5 AP1 signaling pathway is responsible for Aβ-induced neuroinflammation26 TP53|p53 TF, tumor suppressor gene 24, 7 8 TP53 was overexpressed in AD and involved in tau phosphorylation66 SIRT1 TF 14,3 8 SIRT1 is associated with the production of Aβ67 CCND1 oncogene 1, 18 16 CCND1 knockout protects against neurodegeneration in hippocampus29. CDKN1A|P21 oncogene 0, 24 15 Increased expression28 PTEN tumor suppressor gene 0, 4 14 Recruitment of PTEN into synapses contributed to synaptic depression in AD 68, 69 1057 1058 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 42 1059 Fig. 1 Overview of the method for genome-wide prediction of AD-associated genes and their functional 1060 characterization. A Selection of AD-associated genes. 147 AD-associated genes were compiled from 1061 various resources, including AD-associated genes from OMIM, DisGeNet, Uniprot, DistiLD, AlzBase, 1062 AlzBase, AlzGene, literature, Open Targets, ROSMAP-DEG and GWAS-catalog. The gene that was 1063 present in at least two resources was selected. The AD-associated genes as well as potential positive 1064 genes inferred with a functional enrichment method were then removed from the full set of all human genes. 1065 The remaining genes were treated as non-AD genes (negatives). B Brain specific functional gene networks 1066 (FGNs) were used for feature matrix construction. For each gene, its cofunction probabilities with the 147 1067 positive genes in the network were extracted as features. Thus, each gene was characterized by a 147-1068 dimensional vector. C Selection of brain FGNs. We compared the ten networks collected for their predictivity 1069 of AD-associated genes with machine learning approaches. An optimal network was selected. D Validation. 1070 Predicted AD-associated genes were validated by AD-related pathways and various gene networks, 1071 including coexpression networks, protein-protein interaction networks, miRNA-target binding networks, 1072 transcriptional regulatory networks. E Functional implication in AD. The associations of the top predicted 1073 genes with AD-related phenotypes were evaluated. Gene modules from an AD-related network were 1074 identified. 1075 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 43 1076 Fig. 2 Model performance and statistical evaluation based on AD-related pathways and various gene 1077 networks. A Comparison of ExtraTrees models built from different functional gene networks in terms of 1078 AUROC and AUPRC based on cross-validation. B Enrichment of the genes ranked in the first decile in 1079 the four AD-associated gene sets or pathways with the decile enrichment test (described in Methods). C 1080 Validation of the top-ranked genes based on their sequence similarity, the number of shared miRNAs, the 1081 number of AD-associated miRNAs they can bind to, the number of coexpressed gene pairs, the number 1082 of interactions with AD-associated genes in HuRI and STRING. In all the subplots, the red vertical line 1083 and the distribution in yellow indicate the results for our top-ranked genes and randomly selected genes, 1084 respectively. 1085 A B Fig. 2 Model performance and statistical evaluation through AD pathways and various gene networks. A Comparison of ExtraTrees models built from different functional gene networks in terms of AUROC and AUPRC based on cross- validation. B Enrichment of the genes ranked in the first decile in the four AD-associated gene sets or pathways with the decile enrichment test (described in Methods). C Validation of the top-ranked AD genes based on their sequence similarity, the number of shared miRNAs, coexpression, number of interacting with AD-associated genes in (Bioplex, HuRI and STRING). In all the sub-plots, the red vertical line and the distribution in yellow indicate the results for our top-ranked genes and randomly selected genes, respectively. C .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 44 1086 Fig. 3 AD-related regulatory networks. A Transcriptional regulatory network including our compiled AD-1087 associated genes and the top-ranked genes. B The interaction network between predicted genes and 1088 AD-relevant miRNAs. C The interaction network between the compiled AD-associated genes and AD-1089 relevant miRNAs. 1090 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 45 1091 Fig. 4 Gene modules and their association with AD traits. The network was built by aggregating the 1092 evidence from the protein-protein interaction network, coexpression network, miRNA-gene binding 1093 network, transcriptional regulatory network and the brain FGN. This network contains the top-ranked 200 1094 genes and the 147 compiled AD-associated genes. A Four gene modules, denoted by M1, M2, M3 and 1095 M4, were identified by applying the GLay algorithm to the integrated network in Cytoscape. B The 1096 association of M1 and M2 with the three AD-related phenotypes (the CERAD, Braak and CDR score) was 1097 assessed. The results for all the tests were significant (FDR < 0.05). 1098 Fig. 4 Gene modules and their association with AD traits. A Gene modules identified from the network integrated from the brain-specific functional gene network, protein-protein interaction network, coexpression network, miRNA-gene binding network and transcriptional regulatory network. This network contains the top 200 predicted genes and the 147 compiled AD genes. Then seven gene modules were identified by applying the GLay algorithm in Cytoscape. B The association with AD traits of modules. learning or memory regulation of ion transmembrane transport cognition regulation of synaptic plasticity regulation of neuron death regulation of lipid transport cholesterol efflux regulation of amyloid-beta formation positive regulation of cytokine production regulation of peptidyl-lysine acetylation response to amyloid-beta M2 M3 M4 regulation of phosphorylation regulation of cell death immune system process regulation of neurogenesis inflammatory response gliogenesis M1 A B ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 −0.1 0.0 0.1 1 2 3 4 CERAD E ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 −0.1 0.0 0.1 0 2 4 6 Braak E ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 −0.1 0.0 0.1 0 1 2 3 4 5 CDR E ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● −0.1 0.0 0.1 0.2 1 2 3 4 CERAD E ig en ge ne ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● −0.1 0.0 0.1 0.2 0 2 4 6 Braak E ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● −0.1 0.0 0.1 0.2 0 1 2 3 4 5 CDR E ig en ge ne ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● −0.2 −0.1 0.0 0.1 1 2 3 4 CERAD E ig en ge ne ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● −0.2 −0.1 0.0 0.1 0 2 4 6 Braak E ig en ge ne ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● −0.2 −0.1 0.0 0.1 0 1 2 3 4 5 CDR E ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● −0.1 0.0 0.1 1 2 3 4 CERAD E ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.1 0.0 0.1 0 2 4 6 Braak E ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● −0.1 0.0 0.1 0 1 2 3 4 5 CDR E ig en ge ne M1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 −0.1 0.0 0.1 1 2 3 4 CERAD E ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 −0.1 0.0 0.1 0 2 4 6 Braak E ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 −0.1 0.0 0.1 0 1 2 3 4 5 CDR E ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● −0.1 0.0 0.1 0.2 1 2 3 4 CERAD E ig en ge ne ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● −0.1 0.0 0.1 0.2 0 2 4 6 Braak E ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● −0.1 0.0 0.1 0.2 0 1 2 3 4 5 CDR E ig en ge ne ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● −0.2 −0.1 0.0 0.1 1 2 3 4 CERAD E ig en ge ne ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● −0.2 −0.1 0.0 0.1 0 2 4 6 Braak E ig en ge ne ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● −0.2 −0.1 0.0 0.1 0 1 2 3 4 5 CDR E ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● −0.1 0.0 0.1 1 2 3 4 CERAD E ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.1 0.0 0.1 0 2 4 6 Braak E ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● −0.1 0.0 0.1 0 1 2 3 4 5 CDR E ig en ge ne M2 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 46 1099 1100 Fig. 5. Visualization of functional evidence supporting the association of the top-ranked 200 1101 genes with AD. The seven circles show the strength of the seven types of evidence, including 1102 the three molecular interaction evidence (the number of interacting AD-associated genes in PPI, 1103 coexpression network and miRNA-target binding network, respectively) and the four phenotypic 1104 correlation evidence (the Pearson correlation with CERAD, Braak and CDR on the MSBB 1105 dataset, and the log2-transformed fold change of expression obtained from the ROSMAP study). 1106 The darker the purple color is, the stronger the functional association is. The section 1107 corresponding to the blue arc shows the enriched GO biological process terms, where each 1108 curve points the gene annotated to the term. 1109 1110 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 47 1111 1112 1113 Fig. 6 Illustration of the association of the top-ranked individual genes with AD-related phenotypes and 1114 the potential regulatory variant of the gene. A Comparison of the expression of individual genes in 1115 different sample groups. The samples were divided into groups based on the CERAD, Braak or CDR 1116 score. The expression for FYN and PRKAR1A is shown. B Potential regulatory SNPs that may regulate 1117 the expression. For FYN, the SNP rs61202914 not only resides in the TFBS within its promoter region but 1118 also is an eQTL (upper); the SNP rs8080306 is located in the TFBS and also an eQTL for PRKAR1A. 1119 1120 1121 1122 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.09.430536doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430536 http://creativecommons.org/licenses/by-nc/4.0/ 10_1101-2021_02_10_430367 ---- 9691071 Genome Warehouse: A Public Repository Housing 1 Genome-scale Data 2 3 Meili Chen1,2,#, Yingke Ma1,2,#, Song Wu1,2,3, Xinchang Zheng1,2, Hongen Kang1,2,3, 4 Jian Sang1,2,3, † , Xingjian Xu1,2,3, †† , Lili Hao1,2, Zhaohua Li1,2,3, Zheng Gong1,2,3, Jingfa 5 Xiao1,2,3, Zhang Zhang1,2,3, Wenming Zhao1,2,3, Yiming Bao1,2,3,* 6 1 National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of 7 Sciences / China National Center for Bioinformation, Beijing 100101, China 8 2 CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of 9 Genomics, Chinese Academy of Sciences, Beijing 100101, China 10 3 University of Chinese Academy of Sciences, Beijing 100049, China 11 12 # Equal contribution. 13 * Corresponding author. 14 E-mail: baoym@big.ac.cn (Bao Y). 15 † Current address: Division of Cancer Epidemiology and Genetics, National Cancer 16 Institute, National Institutes of Health, Bethesda, Maryland 20892, USA 17 † † Current address: College of Computer Science Technology, Inner Mongolia 18 Normal University, Hohhot, Inner Mongolia 010010, China 19 20 Running title: Chen M et al / Genome Assembly Data Repository 21 22 Total letter counts (Title): 63 23 Total letter counts (Running title): 46 24 Total word counts (Abstract): 193 25 Total keywords: 5 26 Total word counts (from “Introduction” to “Conclusions” or “Materials and 27 methods”): 1799 28 Total figures: 3 29 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ Total tables: 1 30 Total supplementary figures: 0 31 Total supplementary tables: 0 32 Total supplementary files: 0 33 34 35 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ Abstract 36 The Genome Warehouse (GWH) is a public repository housing genome assembly data 37 for a wide range of species and delivering a series of web services for genome data 38 submission, storage, release, and sharing. As one of the core resources in the National 39 Genomics Data Center (NGDC), part of the China National Center for Bioinformation 40 (CNCB, https://bigd.big.ac.cn/), GWH accepts both full genome and partial genome 41 (chloroplast, mitochondrion, and plasmid) sequences with different assembly levels, 42 as well as an update of existing genome assemblies. For each assembly, GWH collects 43 detailed genome-related metadata including biological project and sample, and 44 genome assembly information, in addition to genome sequence and annotation. To 45 archive high-quality genome sequences and annotations, GWH is equipped with a 46 uniform and standardized procedure for quality control. Besides basic browse and 47 search functionalities, all released genome sequences and annotations can be 48 visualized with JBrowse. By December 2020, GWH has received 17,264 direct 49 submissions covering a diversity of 949 species, and has released 3370 of them. 50 Collectively, GWH serves as an important resource for genome-scale data 51 management and provides free and publicly accessible data to support research 52 activities throughout the world. GWH is publicly accessible at 53 https://bigd.big.ac.cn/gwh/. 54 55 KEYWORDS: Genome submission; Genome sequence; Genome annotation; 56 Genome warehouse; Quality control 57 58 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ Introduction 59 Genome sequences and annotations are fundamental information for a wide range of 60 genome-related studies, including various omics data analysis such as genome [1], 61 transcriptome [2], epigenome [3,4], and genome variation [5,6]. China, as one of the 62 most biodiverse countries in the world, harbors more than 10% of the world’s known 63 species [7]. In the past decades, a large number of genome assemblies of featured and 64 important animals and crops in China have been sequenced [1, 8–11], most of which 65 were submitted to International Nucleotide Sequence Database Collaboration (INSDC) 66 members (National Center for Biotechnology Information (NCBI), European 67 Bioinformatics Institute (EBI), and DNA Data Bank of Japan (DDBJ)) [12]. With the 68 rapid growth of genome assembly data, in China for example, large genome data size, 69 slow data transfer rate due to limited international network transfer bandwidth, and 70 language barrier for communication of technical issues have obstructed researchers 71 from efficiently submitting their data to INSDC members. All these call for a 72 centralized genomic data repository within China to complement the INSDC. 73 Here, we report the Genome Warehouse (GWH, https://bigd.big.ac.cn/gwh/), a 74 centralized resource housing genome assembly data and delivering a series of genome 75 data services. As one of the core resources in the National Genomics Data Center 76 (NGDC), part of the China National Center for Bioinformation (CNCB, 77 https://bigd.big.ac.cn/) [13], the aim of GWH is to accept data submissions worldwide 78 and provide an important resource for genome data quality control, data archive, rapid 79 release, and public sharing (e.g., with INSDC) in support of research activities from 80 all over the world. To date, GWH has received a total of 12,366 genome submissions 81 (including 14 international submissions), demonstrating its increasingly important role 82 in global genome data management and sharing. 83 Data model 84 Designed for compatibility with the INSDC data model, each genome assembly in 85 GWH is linked to a BioProject (https://bigd.big.ac.cn/bioproject) and a BioSample 86 (https://bigd.big.ac.cn/biosample), which are two fundamental resources for metadata 87 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ description in CNCB-NGDC. Full or partial (chloroplast, mitochondrion, and plasmid) 88 genome assemblies with different assembly levels (complete, draft in chromosome, 89 scaffold, and contig) are all acceptable and existing genome assemblies are allowed to 90 be updated. Accession numbers are assigned with the following rules (Figure 1): (1) 91 each genome assembly has an accession number prefixed with "GWH", followed by 92 four capital letters and eight zeros (e.g., GWHAAAA00000000); (2) genome 93 sequences have the same accession number format as their corresponding genome 94 assembly, with the exception that the eight digits start from 00000001 and increase in 95 order (e.g., GWHAAAA00000001); (3) genes have similar accession pattern as those 96 of genome sequences, with the addition of letter “G” between the GWH prefix and the 97 four capital letters, and there are six digits at the end instead of eight (e.g., 98 GWHGAAAA000001); (4) transcripts use the letter “T” to replace “G” in accession 99 numbers for genes (e.g., GWHTAAAA000001); (5) proteins use the letter “P” to 100 replace “G” in accession numbers for genes (e.g., GWHPAAAA000001); (6) if the 101 submission is an update of existing submission in GWH, it will be assigned a dot and 102 an incremental number to represent the version (e.g., GWHAAAA00000000.1). 103 Database components 104 GWH is a centralized resource housing genome-scale data, with the purpose to 105 archive high-quality genome sequences and annotation information. GWH is 106 equipped with a series of web services for genome data submission, release, and 107 sharing, accordingly involving three major components, namely, data submission, 108 quality control, and archive and release (Figure 2). 109 Data submission 110 GWH not only accepts genome assembly associated data through an on-line 111 submission system but also allows off-line batch submissions. Users need to register 112 first and then to provide complete description on submitted genome sequences. 113 Biological project and sample information should be provided (through BioProject 114 and BioSample, respectively) together with genome assembly sequence, annotation, 115 and associated metadata. Metadata mainly consist of a variety of information about 116 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ submitter, general assembly, file(s), sequence assignment, and publication (if 117 available). After submission, GWH runs an automated quality control pipeline to 118 check the validity and consistency of submitted genome sequence and genome 119 annotation files. Accession numbers are assigned to assemblies and sequences upon 120 the pass of quality control. The updated assembly data can also be submitted to GWH. 121 It should be noted that compatible with the INSDC members (e.g., NCBI GenBank), it 122 is the responsibility of the submitters to ensure the data quality, completeness, and 123 consistency and GWH does not warrant or assume any legal liability or responsibility 124 for the data accuracy. 125 Quality control 126 After metadata and file(s) are received, GWH automatically runs standardized quality 127 control (QC) to check 45 different types of errors in submitted genome sequences and 128 annotations, and to scan for contaminated genome sequences (see details at 129 https://bigd.big.ac.cn/gwh/documents) if needed (Figure 2), which roughly falls into 5 130 QC steps: (1) The component will check the consistency of file(s) according to 131 filename and md5 code. (2) For genome sequences, the component will check the 132 legality of genome sequence ID and sequence content, e.g., unique sequence ID, 133 sequence composition (A/T/C/G or degenerate base), sequence length (≥ 200 bp). (3) 134 For genome annotations, the component will check gene structure completeness and 135 consistency, e.g., unique ID, a exon/CDS/UTR coordinate falling within the 136 corresponding gene coordinate, strand consistency for all features (including 137 gene/transcript/exon/CDS/UTR), codon validity (e.g., valid start/stop codon, no 138 internal stop codon). (4) Finally, it will check the internal consistency of genome 139 sequence and annotation, e.g., sequence ID in genome annotation must match genome 140 sequence ID, a feature coordinate falling within the range of the corresponding 141 genome sequence. (5) Genome sequences will also be scanned to check vectors, 142 adaptors, primers, and indices (collected from UniVec database, 143 ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/) using NCBI’s VecScreen 144 (https://www.ncbi.nlm.nih.gov/tools/vecscreen/). If there is an error, a report will be 145 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ automatically sent to the submitter by email. To finish a successful submission, the 146 submitter needs to fix all errors and resubmit files until they pass the QC process. 147 Archive and release 148 GWH will assign a unique accession number to the submitted genome assembly upon 149 the pass of quality control, allot accession numbers for each genome sequence, gene, 150 transcript, and protein, generate and backup downloadable files of genome sequence 151 and annotation in FASTA, GFF3, and TSV formats. Data generation is performed 152 with in-house-writing scripts based on submitted genome sequence and annotation 153 files. In order to ensure the security of submitted data, a copy of backup data is stored 154 on a physically separate disk. GWH will release sequence data on a user-specified 155 date, unless a paper citing the sequence or accession number is published prior to the 156 specified release date, in which case the sequence will be released immediately. For 157 the released data, GWH will generate web pages containing two primary tables: 158 genome and assembly. The former shows species taxonomy information and genome 159 assemblies, and the latter contains general information of the assembly (including 160 external links to other related resources), statistics of genome assembly and its 161 corresponding annotation. All released data are publicly available at GWH FTP site 162 (ftp://download.big.ac.cn/gwh/). GWH provides data visualization for both genome 163 sequence and genome annotation using JBrowse [14]. It offers statistics and charts in 164 light of total holdings, assembly levels, genome representations, citing articles, 165 submitting organizations, sequencing platforms, assembly methods, and downloads. 166 GWH provides user-friendly web interfaces for data browse and query using BIG 167 Search [13], in order to help users find any released data of interest. For a released 168 genome assembly, GWH also provides machine-readable APIs (Application 169 Programming Interfaces) for publicly sharing and automatically obtaining information 170 on its associated BioProject, BioSample, genome, and assembly metadata and file 171 paths. 172 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ Global sharing of SARS-CoV-2 and coronavirus genomes 173 During the COVID-19 outbreak, GWH, in support of the 2019 Novel Coronavirus 174 Resource (2019nCoVR) [15, 16] has received worldwide submissions of more than a 175 thousand SARS-CoV-2 genome assemblies with standardized genome annotations 176 [17], and has released 134 of them. To expand the international influence of data, 62 177 of the released sequences have been shared, with the submitters’ permission, in 178 GenBank [18] through a data exchange mechanism established with NCBI. In this 179 model, GWH accessions are represented as secondary accessions in NCBI GenBank 180 records, which are retrievable by the NCBI Entrez system. This model sets a good 181 example for data sharing among different data centers. 182 In addition, GWH offers sequences of the Coronaviridae family to facilitate 183 researchers to reach the data conveniently and thus to study the relationship between 184 SARS-CoV-2 and other coronaviruses. To promote the data sharing and make all 185 relevant information of the Coronaviridae readily available, GWH integrates genomic 186 and proteomic sequences as well as their metadata information from NCBI [19], 187 China National GeneBank Database (CNGBdb) [20], National Microbiology Data 188 Center (NMDC) [21] and CNCB-NGDC. Duplicated records from different sources 189 are identified and removed to gain a non-redundant dataset. As of December 31, 2020, 190 the dataset has 83,095 nucleotide and 575,438 protein sequences of the Coronaviridae. 191 Filters are implemented to narrow down the required Coronaviridae sequences using 192 multiple conditions, including country/region, host, isolation source, length, and 193 collection date. Both the metadata and sequences of the filtered results can be selected 194 and downloaded as a separate file. The daily updated sequences and all sequences can 195 also be downloaded from FTP 196 (ftp://download.big.ac.cn/Genome/Viruses/Coronaviridae/). 197 Data statistics 198 By December, 2020, GWH has received 17,264 direct submissions covering a broad 199 diversity of species (Table 1) with different assembly levels (Figure 3). These 200 genome assemblies link to 301 BioProjects and 16,538 BioSamples, and are 201 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ submitted by 231 submitters from 61 institutions (including 5 international submitters 202 from 2 countries). There are a total of 3370 released submissions, which were 203 reported in 83 articles from 44 journals. GWH has over 135,000 visits from 153 204 countries/regions, with ~891,000 downloads. The amount of data, visits, and 205 downloads in the GWH has been on the dramatic increase over the past years, clearly 206 showing its great utility in genome-scale data management. 207 Summary and future directions 208 Collectively, GWH is a user-friendly portal for genome data submission, release, and 209 sharing associated with a matched series of services. The rapid growth of genome 210 assembly submissions demonstrates the great potential of GWH as an important 211 resource for accelerating the worldwide genomic research. With the aim to fully 212 realize the findability, accessibility, interoperability, and reusability (FAIR) of 213 genome data [22], GWH has made ongoing efforts, including but not limited to, 214 improvement of web interfaces for data submission, presentation, and visualization, 215 continuous integration of newly sequenced genomes, and development of useful 216 online tools to help users analyse genome data (such as BLAST [23]). Therefore, we 217 will put in more efforts to provide genome annotation services, especially for bacteria 218 and archaea genomes, with the particular consideration that uniform standardized 219 annotation determines the accuracy of downstream data analysis. Besides, we will 220 expand the Coronaviridae dataset to other important pathogens to improve the ability 221 of public health emergency response. Finally, we plan to share and exchange all 222 public genome assembly data with the INSDC members to provide comprehensive 223 data for researchers globally. 224 CRediT author statement 225 Meili Chen: Methodology, Software, Investigation, Data Curation, Writing - Original 226 Draft, Project administration. Yingke Ma: Software, Writing - Original Draft. Song 227 Wu: Software, Data Curation. Xinchang Zheng: Data Curation. Hongen Kang: 228 Software. Jian Sang: Investigation, Data Curation. Xingjian Xu: Software. Lili Hao: 229 Investigation. Zhaohua Li: Data Curation. Zheng Gong: Data Curation. Jingfa Xiao: 230 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ Writing - Review & Editing. Zhang Zhang: Writing - Review & Editing. Wenming 231 Zhao: Writing - Review & Editing. Yiming Bao: Conceptualization, Writing - 232 Review & Editing, Supervision. 233 Competing interests 234 The authors have declared no competing interests. 235 Acknowledgments 236 We thank Profs. Jingchu Luo and Weimin Zhu for their helpful suggestions and a 237 number of users for reporting bugs and sending comments. We also thank the NCBI 238 GenBank group, especially Ilene Mizrachi, Karen Clark, Mark Cavanaugh, and Linda 239 Yankie, for their valuable advices on sequence contamination scanning and 240 SARS-CoV-2 sequence exchange. This work was supported by Strategic Priority 241 Research Program of Chinese Academy of Sciences [XDB38060100 and 242 XDB38030200 to YB; XDB38050300 to WZ; XDB38030400 to JX; XDA19050302 243 to ZZ]; National Key Research and Development Program of China 244 [2016YFE0206600 to YB; 2020YFC0847000, 2018YFD1000505, 2017YFC1201202, 245 and 2016YFC0901603 to WZ; 2017YFC0907502 to ZZ]; The 13th Five-year 246 Informatization Plan of Chinese Academy of Sciences [XXH13505-05 to YB]; 247 Genomics Data Center Construction of Chinese Academy of Sciences 248 [XXH-13514-0202 to YB]; Open Biodiversity and Health Big Data Initiative of IUBS 249 [to YB]; The Professional Association of the Alliance of International Science 250 Organizations [ANSO-PA-2020-07 to YB]; National Natural Science Foundation of 251 China [32030021 and 31871328 to ZZ]; International Partnership Program of the 252 Chinese Academy of Sciences [153F11KYSB20160008 to ZZ]. 253 ORCID 254 ORCID: 0000-0003-0102-0292 (Chen Meili) 255 ORCID: 0000-0002-9460-4117 (Ma Yingke) 256 ORCID: 0000-0002-0923-639X (Wu Song) 257 ORCID: 0000-0001-5739-861X (Zheng Xinchang) 258 ORCID: 0000-0002-9581-1329 (Kang Hongen) 259 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ ORCID: 0000-0003-4953-3417 (Sang Jian) 260 ORCID: 0000-0002-4466-3821 (Xu Xingjian) 261 ORCID: 0000-0003-3432-7151 (Hao Lili) 262 ORCID: 0000-0002-2673-0103 (Li Zhaohua) 263 ORCID: 0000-0001-7285-2630 (Gong Zheng) 264 ORCID: 0000-0002-2835-4340 (Xiao Jingfa) 265 ORCID: 0000-0001-6603-5060 (Zhang Zhang) 266 ORCID: 0000-0002-4396-8287 (Zhao Wenming) 267 ORCID: 0000-0002-9922-9723 (Bao Yiming) 268 269 270 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ References 271 [1] Liu Y, Du H, Li P, Shen Y, Peng H, Liu S, et al. Pan-genome of wild and 272 cultivated soybeans. Cell 2020;182:162-76.e13. 273 [2] Guan Y, Chen M, Ma Y, Du Z, Yuan N, Li Y, et al. Whole-genome and 274 time-course dual RNA-Seq analyses reveal chronic pathogenicity-related gene 275 dynamics in the ginseng rusty root rot pathogen Ilyonectria robusta. Sci Rep 276 2020;10:1586. 277 [3] Li R, Liang F, Li M, Zou D, Sun S, Zhao Y, et al. MethBank 3.0: a database of 278 DNA methylomes across a variety of species. Nucleic Acids Res 2018;46:D288–D95. 279 [4] Xiong Z, Li M, Yang F, Ma Y, Sang J, Li R, et al. EWAS Data Hub: a resource of 280 DNA methylation array data and metadata. Nucleic Acids Res 2020;48:D890–D5. 281 [5] Song S, Tian D, Li C, Tang B, Dong L, Xiao J, et al. Genome Variation Map: a 282 data repository of genome variations in BIG Data Center. Nucleic Acids Res 283 2018;46:D944–D9. 284 [6] Tang B, Zhou Q, Dong L, Li W, Zhang X, Lan L, et al. iDog: an integrated 285 resource for domestic dogs and wild canids. Nucleic Acids Res 2019;47:D793–D800. 286 [7] McBeath J, McBeath JH. Biodiversity conservation in China: policies and practice. 287 Journal of International Wildlife Law & Policy 2006;9:293–317. 288 [8] Fan H, Wu Q, Wei F, Yang F, Ng BL, Hu Y. Chromosome-level genome 289 assembly for giant panda provides novel insights into Carnivora chromosome 290 evolution. Genome Biol 2019;20:267. 291 [9] Xia Q, Zhou Z, Lu C, Cheng D, Dai F, Li B, et al. A draft sequence for the 292 genome of the domesticated silkworm (Bombyx mori). Science 2004;306:1937–40. 293 [10] Lin T, Xu X, Ruan J, Liu SZ, Wu SG, Shao XJ, et al. Genome analysis of 294 Taraxacum kok-saghyz Rodin provides new insights into rubber biosynthesis. Natl Sci 295 Rev 2018;5:78–87. 296 [11] Li C, Song W, Luo Y, Gao S, Zhang R, Shi Z, et al. The HuangZaoSi maize 297 genome provides insights into genomic variation and improvement history of maize. 298 Mol Plant 2019;12:402–9. 299 [12] Arita M, Karsch-Mizrachi I, Cochrane G. The international nucleotide sequence 300 database collaboration. Nucleic Acids Res 2021;49:D121–D4. 301 [13] Members C-N, Partners. Database resources of the National Genomics Data 302 Center, China National Center for Bioinformation in 2021. Nucleic Acids Res 303 2021;49:D18–D28. 304 [14] Buels R, Yao E, Diesh CM, Hayes RD, Munoz-Torres M, Helt G, et al. JBrowse: 305 a dynamic web platform for genome visualization and analysis. Genome Biol 306 2016;17:66. 307 [15] Zhao WM, Song SH, Chen ML, Zou D, Ma LN, Ma YK, et al. The 2019 novel 308 coronavirus resource. Yi Chuan 2020;42:212–21. 309 [16] Song S, Ma L, Zou D, Tian D, Li C, Zhu J, et al. The global landscape of 310 SARS-CoV-2 genomes, variants, and haplotypes in 2019nCoVR. Genomics, 311 Proteomics & Bioinformatics 2020. [DOI: https://doi.org/10.1016/j.gpb.2020.09.001] 312 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ [17] Shean RC, Makhsous N, Stoddard GD, Lin MJ, Greninger AL. VAPiD: a 313 lightweight cross-platform viral annotation pipeline and identification tool to facilitate 314 virus genome submissions to NCBI GenBank. BMC Bioinformatics 2019;20:48. 315 [18] Sayers EW, Cavanaugh M, Clark K, Ostell J, Pruitt KD, Karsch-Mizrachi I. 316 GenBank. Nucleic Acids Res 2020;48:D84–D6. 317 [19] Sayers EW, Beck J, Bolton EE, Bourexis D, Brister JR, Canese K, et al. Database 318 resources of the National Center for Biotechnology Information. Nucleic Acids Res 319 2021;49:D10–D7. 320 [20] Chen FZ, You LJ, Yang F, Wang LN, Guo XQ, Gao F, et al. CNGBdb: China 321 National GeneBank DataBase. Yi Chuan 2020;42:799–809. 322 [21] Wu L, Sun Q, Desmeth P, Sugawara H, Xu Z, McCluskey K, et al. World data 323 centre for microorganisms: an information infrastructure to explore and utilize 324 preserved microbial strains worldwide. Nucleic Acids Res 2017;45:D611–D8. 325 [22] Zhang Z, Song S, Yu J, Zhao W, Xiao J, Bao Y. The elements of data sharing. 326 Genomics Proteomics Bioinformatics 2020;18:1–4. 327 [23] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. 328 Gapped BLAST and PSI-BLAST: a new generation of protein database search 329 programs. Nucleic Acids Res 1997;25:3389–402. 330 331 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure legends 332 Figure 1 Data model in GWH 333 Genome assembly accession number is prefixed with "GWH", followed by four 334 capital letters (represented by XXXX) and 8 zeros. For genome sequence accessions, 335 eight digits increase in order. For gene sequence, transcript sequence, and protein 336 sequence accessions, G, T, and P are followed by the GWH prefix, respectively, with 337 six digits at the end that increase in order. 338 Figure 2 Major components in GWH data processing workflow 339 Figure 3 Statistics of genome assembly in GWH (as of December 31, 2020) 340 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tables 341 Table 1 Total data holdings in GWH 342 Status Type Animals Plants Fungi Bacteria Archaea Viruses Metagenomes Others Total Released Assembly 187 (5.55%) 210 (6.23%) 13 (0.39%) 220 (6.53%) 73 (2.17%) 701 (20.80%) 1957 (58.07%) 9 (0.27%) 3370 Species 72 (19.41%) 139 (37.47%) 12 (3.23%) 106 (28.57%) 11 (2.96%) 19 (5.12%) 3 (0.81%) 9 (2.43%) 371 Unpublic Assembly 6783 (48.82%) 926 (6.66%) 5 (0.04%) 68 (0.49%) 13 (0.09%) 939 (6.76%) 4702 (33.84%) 458 (3.30%) 13,894 Species 22 (3.67%) 549 (91.50%) 5 (0.83%) 7 (1.17%) 2 (0.33%) 6 (1.00%) 5 (0.83%) 4 (0.67%) 600 Total Assembly 6970 (40.37%) 1136 (6.58%) 18 (0.10%) 288 (1.67%) 86 (0.50%) 1640 (9.50%) 6659 (38.57%) 467 (2.71%) 17,264 Species 92 (9.69%) 675 (71.13%) 16 (1.69%) 110 (11.59%) 13 (1.37%) 24 (2.53%) 7 (0.74%) 12 (1.26%) 949 343 . C C -B Y -N C -N D 4 .0 In te rn a tio n a l lice n se p e rp e tu ity. It is m a d e a va ila b le u n d e r a p re p rin t (w h ich w a s n o t ce rtifie d b y p e e r re vie w ) is th e a u th o r/fu n d e r, w h o h a s g ra n te d b io R xiv a lice n se to d isp la y th e p re p rin t in T h e co p yrig h t h o ld e r fo r th is th is ve rsio n p o ste d F e b ru a ry 1 0 , 2 0 2 1 . ; h ttp s://d o i.o rg /1 0 .1 1 0 1 /2 0 2 1 .0 2 .1 0 .4 3 0 3 6 7 d o i: b io R xiv p re p rin t https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2021_02_10_430512 ---- Prediction of adverse drug reactions associated with drug-drug interactions using hierarchical classification 0 Prediction of adverse drug reactions associated with drug-drug interactions using hierarchical classification Catherine Kim1* and Nicholas Tatonetti2 1. Jericho Senior High School, 99 Cedar Swamp Rd, Jericho, NY 11753 2. Department of Biomedical Informatics, Department of Systems Biology, & Department of Medicine, Columbia University, 622 West 168th St. PH20 New York, NY 10032 *Corresponding author: cathy.kim@jerichoapps.org .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 1 ABSTRACT Adverse drug reactions (ADRs) associated with drug-drug interactions (DDIs) represent a significant threat to public health. Unfortunately, most conventional methods for prediction of DDI-associated ADRs suffer from limited applicability and/or provide no mechanistic insight into DDIs. In this study, a hierarchical machine learning model was created to predict DDI- associated ADRs and pharmacological insight thereof for any drug pair. Briefly, the model takes drugs’ chemical structures as inputs to predict their target, enzyme, and transporter (TET) profiles, which are subsequently utilized to assess occurrences of ADRs, with an overall accuracy of ~91%. The robustness of the model for ADR classification was validated with DDIs involving three widely prescribed drugs. The model was then applied for interstitial lung disease (ILD) associated with DDIs involving atorvastatin, identifying the involvement of multiple targets, enzymes, and transporters in ILD. The model presented here is anticipated to serve as a versatile tool for enhancing drug safety. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 INTRODUCTION Adverse drug reactions (ADRs) represent a significant threat to public health worldwide, accounting for considerable morbidity and mortality with estimated costs of ~$500 billion annually [1, 2]. As ADRs continue to present a growing concern in modern health care systems, their identification and prevention are quintessential for improved drug safety and patient care. While drugs are subjected to preclinical in vitro safety profiling and clinical drug safety trials to assess drug safety, many ADRs occur in small subsets of the human population, making ADRs not readily detectable in advance [3]. Moreover, ADRs are more difficult to analyze when multiple, rather than single, drugs are administered, which has become common amongst a growing elderly population [4]. Drug-drug interactions (DDIs) between co-administered drugs appear in various forms of ADRs by different mechanisms, adding additional complexity [3]. To better address DDI-associated ADRs, an understanding of their pharmacological mechanisms is strongly required. DDIs can occur when drugs compete for the same target [5]. DDIs also involve drug metabolizing enzymes (e.g. cytochrome P450 (CYP) enzymes) and influx and efflux drug transporters — all of which determine the adsorption, distribution, metabolism, and excretion (ADME) of drugs [6]. Thus, interference with target binding, enzyme- mediated metabolization, and/or uptake and excretion of drugs may cause DDIs [5, 7-9]. Moreover, the comprehensive evaluation of entire TET profiles — many of which are dependent on the chemical structures of drugs — and their interplay between the drugs is critical for an enhanced understanding of DDI-associated ADRs [10]. With the current inability to reliably assess DDIs in preclinical testing and clinical trials and the complex nature of DDI-associated ADRs, a data-driven computational approach is well- suited for predicting such ADRs. This approach may benefit from extensive ADR databases, such as the FDA Adverse Event Reporting System, where data representative of a large population are collected from patients, clinicians, and pharmaceutical companies [11]. While various machine learning models have been previously developed for predicting DDI-associated ADRs with considerable accuracy, they suffer from major limitations. Most currently available models are based on drug similarity, providing accurate prediction only when the drug in question is similar to existing drugs with known TET profiles and/or ADR information [12-15]. This requirement makes these models not readily applicable when such information is unavailable, for example, when a drug is still under development. Moreover, conventional .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3 models provide no pharmacological insight into DDI-associated ADRs. The availability of such a priori mechanistic understandings can lay out theoretical foundations on which a novel, effective pharmacological strategy can be developed. Overall, a novel computational approach to evaluate associations between DDIs and ADRs and to determine their molecular basis is urgently needed for better drug design and enhanced drug safety. This study reports the development of a hierarchical machine learning model to predict risks of various DDI-associated ADRs and their underlying pharmacological mechanisms. This model consists of two layers of classifiers for the prediction of TET profiles and occurrences of ADRs from chemical structures of a drug pair, requiring no drug similarity. The model was tested for its robustness with three case studies and then employed to elucidate the origin of an ADR of a rare disease, interstitial lung disease (ILD), associated with DDIs. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4 METHODS All computations were performed with Python 3.8.3 on Jupyter Notebook 6.0.1 Anaconda Navigator 1.9.12, unless otherwise noted. Statistical Analyses of ADRs Proportional reporting ratios (PRRs) were calculated for each drug pair and corresponding adverse drug reactions (ADRs) from the TWOSIDES v0.1 database [3] using the equation ��� � �/����� �/����� as described previously [3, 16]: where a = the number of patients who were administered the drug pair and were reported for the ADR, b = the number of patients who were administered the drug pair and were not reported for the ADR, c = the number of patients were not administered the drug pair and were reported for the ADR, and d = the number of patients who were not administered the drug pair and were not reported for the ADR. The TWOSIDES v0.1 database was created by application of propensity score matching to the FDA Adverse Event Reporting System [11] in order to account for covariates in the dataset and eliminate potential bias [3] and used directly for the PRR calculations in this study. The numbers of unique drug pairs and ADRs used in this study were 211,990 and 12,726, respectively. For drug pairs containing one of three widely prescribed drugs — levothyroxine, omeprazole, and atorvastatin — all of their reported ADRs were extracted from the TWOSIDES v0.1 database using the Python pandas 1.1.2 library [17]. Determination of Chemical Fingerprints of Drugs For all the drugs listed in the DrugBank 5.1.7 database [18], their chemical structures in the format of the simplified molecular-input line-entry system (SMILES) were obtained directly from the database or PubChem v1.6.3.b [18, 19]. The SMILES were stored in a 2D representation with the Python RDKit 2020.03.1 library and used to produce a chemical fingerprint for each drug by calculating its Molecular Access System (MACCS) keys [20]. Binary string representations of the MACCS keys were stored in a Python pandas 1.1.2 dataframe [17]. Construction of Target, Enzyme, and Transporter Profiles of Drugs .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 5 Annotations about 4,263 unique targets, 316 unique enzymes, and 286 unique transporters were collected from DrugBank 5.1.7 [18] to create TET vectors of drugs using the Python NumPy 1.19.4 library [17]. Each of all unique TETs was assigned a position in a TET vector. For each drug, a TET vector was created with the Python NumPy 1.19.4 library to represent its pharmacological profile. Briefly, in each position of a drug’s TET vector, the value of “1” was assigned if any action of the drug (e.g., as a ligand, substrate, inhibitor, activator, agonist, antagonist) on each target, enzyme, and transporter was noted in DrugBank 5.1.7, whereas the value of “0” otherwise. Development of RFCs for Prediction of Target, Enzyme, and Transporter Profiles of Drugs Random forest classifiers (RFCs) were constructed for prediction of targets, enzymes, and transporters from the chemical structure of a drug using the Python sci-kit learn 0.23.2 library [17]. These models formed the first layer of the hierarchical model. The dataset of the drugs’ MACCS keys and TET vectors were split into training (75% of dataset) and testing (25% of dataset) sets. RFCs were trained and tested to predict TET profiles from MACCS key representations (i.e., chemical fingerprints) of the drugs. During training and testing of the RFCs, model accuracies were measured and averaged. Development of a Model for Prediction of DDI-associated ADRs from TET Profiles of Drugs TET vectors of a drug pair were combined to form its TET matrix, which was then matched to the drug pair’s PRRs for various ADRs reported in the TWOSIDES v0.1 database. From the TWOSIDES v0.1 database, the calculated PRRs for different ADRs of each drug pair were categorically encoded with the value of “0” when 0 ≤ PRRs < 1, “1” when PRRs = 1, and “2” when PRRs > 1. The processed PRR dataset with the matched TET matrices for the drug pairs were split into training (75% of dataset) and testing (25% of dataset) sets. The machine learning algorithms, Random Forest Classifiers (RFC) [21], Logistic Regression (LR) [22], and Support Vector Machines (SVM) [23], were constructed as classifiers for ADR prediction using the Python sci-kit learn 0.23.2 library. The models were fit with default tuning parameters in the Python sci-kit learn 0.23.0 library. Model accuracies were measured using a 10-fold cross- validation, as described elsewhere [24]. The SVM model was chosen as a second layer of the hierarchical model. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 6 Pathway Analysis of ADRs The key genes/proteins involved in ILD associated with DDIs involving atorvastatin were determined from the pathway database Reactome [25]. Various repositories on gene/protein interactions and pathways, such as BioGRID [26], Proteomics DB [27], STRING [28], and CORUM [29], were applied to identify interactions between drug targets and genes/proteins involved in ADRs. The NCBI Gene database [30] was used to determine tissue-specific gene expression levels. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 7 RESULTS AND DISCUSSION Model Overview A hierarchical model to predict ADRs from the chemical structures of a drug pair was developed. In this model, the input variables were the chemical structures of drugs (Fig. 1A) and the output variables were the PRRs for various ADRs (Fig. 1B). A value of PRR > 1 indicates a high risk of an ADR for a given drug pair, whereas a PRR<1 suggests that a given ADR is less commonly reported for a drug pair, relative to other drug pairs [3, 16]. The PRR of 1 indicates a statistically neutral association between a drug pair and an ADR. Instead of attempting to correlate chemical structures of drugs directly with PRRs for ADRs, an intermediate tier of a pharmacological profile, namely a target, enzyme, and transporter (TET) profile, of drugs was introduced to connect chemical fingerprints of drugs with various ADRs (Fig. 1). The TET profiles depend on the drugs’ chemical structures [24, 31]. On the other end, the TET profiles determine the drugs’ ADME processes and their ultimate pharmacological actions through on- and off-targeting, all of which play a dominant role in ADRs [5-9]. Thus, the intermediate tier of the TET profiles serves as an essential component in the hierarchical machine learning model that connects the input variables (i.e., chemical structures of drugs) and the output variables (i.e., PRRs for various ADRs), while allowing for a deeper mechanistic understanding of ADRs (Fig. 1). Prediction of Target, Enzyme, and Transporter Profiles from Chemical Fingerprints of Drugs To predict the pharmacological profiles (i.e. TET profiles) from the chemical fingerprints (i.e. MACCS keys) of drugs, Random Forest Classifiers (RFCs) were constructed. The RFCs achieved high (>95%) accuracy across TET profile prediction (Table 1). The accuracies of these models were higher than other machine learning algorithms previously developed for the classification of drugs inhibiting a specific transporter [24]. For prediction of the entire TET profile, a testing accuracy of the RFC models is estimated to be 94.63% (=99.54% × 96.82% × 98.19 %; Table 1). The TET prediction method presented in this study may address limitations of costly and often time-inefficient preclinical in vitro experiments to determine TET profiles [32]. Moreover, in vitro methods to assess drug’s action on transporters are not well established, presenting another limitation [33]. Other computational approaches, such as molecular docking, require the .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 8 3D chemical structures of TETs [34], which are often lacking in newly identified drug targets and many transporters [24]. Requiring no such 3D structural information, the RFCs presented here allow for accurate, thorough, and inexpensive evaluations of TET profiles of drugs — even those under the development stage. ADR prediction from Target, Enzyme, and Transporter Profiles of Drug Pairs To predict ADRs of a drug pair from its TET profiles, Random Forest Classifier (RFC), Logistic Regression (LR), and Support Vector Machine (SVM) models were developed and evaluated using a 10-fold cross-validation. Compared to the RFC and LR models performing at mean classification accuracies (i.e. a fraction of a correctly classified ADR from a drug pair’s TET matrix) of 91.96% (Fig. 2A) and 86.63% (Fig. 2B), respectively, the SVM model outperformed with a greater mean classification accuracy of 95.73% (Fig. 2C). Application of the SVM model for DDI-associated ADRs Involving Three Major Drugs The SVM model was further tested for its robustness with DDIs involving three commonly prescribed drugs: levothyroxine, a synthetic hormone to treat hypothyroidism [35], omeprazole, a proton pump inhibitor for gastric acid-related disorders [36], and atorvastatin, an inhibitor of 3-hydroxy-3-methyl-glutaryl-CoA (HMG-CoA) reductase used for lowering lipids concentrations to treat hypercholesterolemia [37]. (1) Case study 1: Levothyroxine To apply the SVM model for DDIs associated with levothyroxine, eptifibatide was chosen as the concomitant drug, since the co-administration of levothyroxine and a blood thinner (e.g., eptifibatide) was previously found to cause a bleeding-related ADR [38, 39] via inhibition of platelet aggregation [38, 40]. Levothyroxine has four major targets (integrin subunit αV (ITGAV), integrin subunit βIII (ITGB3), thyroid hormone receptor α (THRA), and thyroid hormone receptor β (THRB)), two metabolizing enzymes (cytochrome P450 (CYP) 2C8 (CYP2C8) and UDP-glucuronosyltransferase 1A1 (UGT1A1)), and nine transporters (ATP- binding cassette sub-family B member 1 (ABCB1), solute carrier (SLC) family 7 member 5 (SLC7A5), SLC16A2, solute carrier organic anion transporter (SLCO) 1A2 (SLCO1A2), SLCO1B1, SLCO1B3, SLCO1C1, SLCO2B1, and SLCO4A1; Fig. 3A). THRA and THRB are .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 9 nuclear receptors for levothyroxine, regulating transcription of hormone-responsive genes (referred to as genomic actions) [41, 42]. A heterodimeric complex, ITGAV-ITGB3, consisting of ITGAV and ITGB3 is another receptor for levothyroxine [41, 42], mediating the drug’s nongenomic actions, such as the proliferation of endothelial cells [41, 42]. Eptifibatide’s TET profile contains only one target, ITGB3 (Fig. 3A), which is complexed with integrin subunit αIIb [43] to form a heterodimeric complex, ITGB3-ITGA2B, mediating platelet aggregation [40, 43]. The co-administration of eptifibatide, which inhibits ITGB3-ITGA2B’s binding to fibrinogen, can reduce platelet aggregation [38, 40]. Thus, the sharing of ITGB3 by levothyroxine and eptifibatide (Fig. 3A) may be responsible for ADRs, as is the case with other combinations of drugs binding to the same pharmacological targets [44]. The predictive power of the SVM model, particularly in the role of the shared target (i.e. ITGB3) in DDI-associated ADRs, was evaluated through comparisons with statistical results. Briefly, the PRRs for various ADRs associated with the co-administration of levothyroxine and eptifibatide were calculated from statistical analyses of TWOSIDES v0.1. The PRRs for levothyroxine alone and eptifibatide alone were calculated similarly using OFFSIDES v0.1 [3]. Then, for a given ADR, its PRR for the co-administration of levothyroxine and eptifibatide was subtracted from the average PRR for the single administrations of levothyroxine and eptifibatide (i.e. Δ PRR = average PRR for single administrations of levothyroxine and eptifibatide – PRR for their co-administration). Highly negative values of this difference (e.g., Δ PRR < the 99% confidence interval of Δ PRRs for the “No Change” group) are indicative of strong DDIs. The calculated PRR differences were then compared with prediction results, which were obtained using the SVM model upon removal of the ITGB3 as a target from the TET profile of eptifibatide. The comparison result suggests that the risks of most ADRs associated with strong DDIs between levothyroxine and eptifibatide are predicted to decrease if the TET profile of eptifibatide lacks ITGB3 as a target, suggesting the critical role of shared ITGB3 in the DDIs (Fig. 3B). (2) Case study 2: Omeprazole For a subsequent analysis, clopidogrel, an antiplatelet drug for the treatment of cardiovascular diseases [45], was chosen as the concomitant drug with omeprazole. Omeprazole has two major targets (aryl hydrocarbon receptor (AHR) and potassium-transporting ATPase α .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10 chain 1 (ATP4A)), nine metabolizing enzymes (CYP1A1, CYP1A2, CYP1B1, CYP2C8, CYP2C9, CYP2C18, CYP2C19, CYP2D6, and CYP3A4), and three transporters (ABCB1, ABC subfamily C member 3 (ABCC3), and ABC subfamily G member 2 (ABCG2); Fig. 4A) While omeprazole and clopidogrel share no same pharmacological targets, they have multiple enzymes and a single transporter in common (Fig. 4A). The concomitant use of omeprazole was found to lower the platelet inhibitory effects of clopidogrel [46, 47], increasing a risk of reinfarction [48, 49] and major cardiovascular events [50], compared to those receiving clopidogrel alone. Accordingly, the FDA has recommended not using omeprazole together with clopidogrel unless absolutely required [47]. To identify the major pharmacological determinants responsible for DDIs between omeprazole and clopidogrel, each of the targets, enzymes and transporters was removed one at a time from the TET profile of omeprazole and the effects of each removal on the DDI-associated ADRs were predicted using the SVM model developed in this study. The result from this analysis showed that CYP1A2, CYP2C8, CYP2C9, CYP2C19, CYP3A4, ABCB1, and ABCC3 may play key roles in DDIs between omeprazole and clopidogrel (Fig. 4B). Consistent with this result, CYP2C19 was previously identified as a key enzyme to mediate DDIs between omeprazole and clopidogrel [47]. For its anti-platelet aggregation effect, clopidogrel needs to be converted by CYP2C19 to an active metabolite [51, 52], which prevents activation of P2RY12 required for platelet activation and aggregation [53]. Thus, omeprazole, an inhibitor of CYP2C19 [47, 54], can prevent the biotransformation of clopidogrel required for efficacy, causing DDI-associated ADRs [46-49]. The analysis also suggests the possible involvement of other enzymes and transporters in DDIs between omeprazole and clopidogrel (Fig. 4B), as supported by previous reports. For example, the metabolic activation of clopidogrel was also found to be mediated by CYP3A4 [55, 56]. In addition, the high likelihood of CYP1A2 mediating DDIs involving omeprazole was previously proposed based on omeprazole’s ability to induce CYP1A2 activity [57, 58], though still under debate [47, 59, 60]. Omeprazole is a weak inhibitor of CYP2D6 relative to CYP2C19 and CYP3A4 [47], making CYP2D6-mediated DDIs less likely. An efflux transporter, ABCB1, may be an active player in these DDIs, as omeprazole interferes with the efflux of other drugs (e.g., digoxin [61] and nifedipine [62]) by ABCB1 [47]. Out of these enzymes and transporters, CYP2C19 was chosen as a key enzyme for subsequent comparative analyses. For this examination, Δ PRR (= an average PRR of .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 11 omeprazole alone and clopidogrel alone − a PRR for the co-administration of omeprazole and clopidogrel) was calculated as a DDI index for each ADR, as described above. Δ PRR values were then compared with the PRR changes predicted by the SVM model when CYP2C19 was removed from the TET profile of omeprazole. Overall, the predictions and the calculations were in good agreement for ADRs significantly associated with DDIs (either negatively or positively, as judged by Δ PRR relative to the 99% interval of Δ PRRs for the “No change” group) between omeprazole and clopidogrel (Fig. 4C), supporting the role of CYP2C19 in their DDIs, as described elsewhere [47]. The comparative result also indicates that correct predictions of drug pairs with little to no DDIs (that is, Δ PRR ~ 0) are difficult with this model. (3) Case study 3: Atorvastatin To further validate the SVM model, a similar computational approach was applied to atorvastatin for its well-known DDI-associated ADR, myopathy [63-66]. Atorvastatin has five major targets (AHR, dipeptidyl peptidase 4 (DPP4), histone deacetylase 2 (HDAC2), 3-hydroxy- 3-methylglutaryl-coenzyme A reductase (HMGCR), and nuclear receptor subfamily 1 group I member 3 (NR1I3)), ten metabolizing enzymes (CYP2B6, CYP2C8, CYP2C9, CYP2C19, CYP2D6, CYP3A4, CYP3A5, CYP3A7, UGT1A1, and UDP-glucuronosyltransferase 1A3 (UGT1A3)), and ten transporters (ABCB1, ATP-binding cassette sub-family B member 11 (ABCB11), ATP-binding cassette sub-family C member (ABCC) 1 (ABCC1), ABCC2, ABCC4, ABCC5, SLCO1A2, SLCO1B1, SLCO1B3, and SLCO2B1; Fig. 5A). Atorvastatin’s actions on multiple TETs indicate its pharmacological complexity. The SVM model was used to derive TETs important in atorvastatin-induced myopathy through predicted PRR changes of drug pairs upon removal of each of the targets, enzymes, and transporters from atorvastatin’s TET profile. The SVM model predicted the importance of CYP2C9, CYP2C19, CYP3A4, UGT1A1, ABCB1, ABCB11, ABCC1, ABCC2, ABCC4, SLCO1A2, and SLCO1B1 in atorvastatin DDI-associated myopathy (Fig. 5B). Consistent with this result, the co-administration of drugs that are either inhibitors or substrates of CYP3A4 was found to decrease the metabolism of atorvastatin [67]. As a result, the plasma concentration of atorvastatin increases, leading to the onset of ADRs [67], including myopathy [66]. In addition, polymorphisms in the CYP2C19, UGT1A1, ABCB1 and SLCO1B1 genes are associated with systemic exposure of atorvastatin, an important risk factor for myopathy [68, 69]. Drugs .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 12 inhibiting SLCO1B1 and ABCB1, most of which are CYP3A4 inhibitors [70], can cause DDIs with atorvastatin [70]. While the crucial role of CYP2C8 in DDIs involving other statins (e.g., simvastatin and lovastatin) has been documented [70, 71], the involvement of this enzyme in atorvastatin-mediated DDIs remains unclear. The model was then tested through comparisons between the predicted and calculated PRR changes of drug combinations for myopathy. Out of the identified key molecules, CYP3A4 was chosen for further analyses due to its direct involvement in the onset of myopathy associated with DDIs involving atorvastatin, as reported previously [72, 73]. PRRs of drugs with higher degrees of DDIs (as judged by Δ PRR relative to the 99% confidence interval of Δ PRRs for the “No Change” group) with atorvastatin were predicted to decrease upon removal of atorvastatin’s CYP3A4 interaction (Fig. 5C), consistent with the literature reports on the importance of this CYP enzyme in atorvastatin-mediated myopathy [72, 73]. Similar to the results with omeprazole (case study 2), the accurate prediction of PRR changes for drug combinations with Δ PRR~ 0 was difficult (Fig. 5C). Overall, the results obtained with levothyroxine, omeprazole and atorvastatin in this study demonstrate the high applicability of the machine learning model for predicting DDI- associated ADRs and providing underlying pharmacological insight. Model Application for Interstitial Lung Disease Involving DDIs with Atorvastatin Motivated by its high prediction power, the model developed in this study was applied to a rare yet life-threatening ADR, interstitial lung disease (ILD), associated with DDIs involving atorvastatin. The PRR of the single administration of atorvastatin for ILD was calculated from OFFSIDES v0.1 to analyze its statistical associations to ILD. A similar statistical analysis was extended to drug pairs containing atorvastatin for ILD. Δ PRRs (i.e. DDI indices) were calculated and plotted with PRRs of the co-administration for ILD (Fig. 6). The PRR of atorvastatin alone for ILD was 1.01, a value indicative of a statistically neutral association between atorvastatin and ILD. Between ΔPRRs and PRRs for the co- administration of atorvastatin, a strong negative linear relationship was detected (Fig. 6). The implication is that drug pairs of atorvastatin and concomitant drugs reported with high risks of ILD are due to DDIs between the drugs. When calculated from the linear regression line, Δ PRR .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 13 is 0.0508 (~ 0, no DDI) for drug pairs showing no associations with ILD (i.e., PRR = 1; Fig. 6), as expected, validating this analysis. To identify important TETs in ILD associated with DDIs involving atorvastatin, the PRR changes of drug pairs were predicted by the model upon removal of each of the targets, enzymes, and transporters from atorvastatin’s TET profile (Fig. 7A). Among atorvastatin’s five targets, AHR, DPP4, HDAC2, HMGCR and NR1I3, the importance of DPP4 was minimal (Fig. 7A). In this analysis, different metabolizing enzymes seemed equally important, suggesting that DDI- associated ILD involving atorvastatin may be mediated by a set of multiple enzymes, which may be responsible for previous contradictory findings on the role of metabolizing enzymes in this type of ADR [74, 75]. Among transporters, ABCB11, SLCO1B1 and SLCO1B3, all of which are primarily expressed in liver [30], were found to be important. ABCB11, a primary transporter of bile salts [76], was found to be involved in the biliary excretion of statins [77]. SLCO1B1 and SLCO1B3 are responsible for the uptake of atorvastatin into hepatocytes [78, 79]. Thus, the removal of these three transporters from atorvastatin’s TET profile may increase its plasma concentration, increasing risks of ILD [78, 79]. To further distinguish among the five targets, a similar procedure was conducted to calculate the number of drug pairs with predicted increases in PRRs for ILD when a target was removed from atorvastatin’s TET profile (Fig. 7B). Interestingly, when atorvastatin’s action on HMGCR and DPP4 became nullified, PRRs for ILD further increased (Fig. 7B). The implication is that when its binding to HMGCR and DPP4 becomes ineffective, atorvastatin may bind to the other three targets more strongly, increasing risks of DDIs. No such PRR increases were observed with the removal of the other three targets (Fig. 7B). Thus, AHR, HDAC2, and NR1I3 were identified as important targets for DDI-associated ILD involving atorvastatin. To validate these computational results, two key molecules, NR1I3 and ABCB11, which were identified by the SVM model in DDI-associated ILD with atorvastatin, were used for further analyses. For this examination, Δ PRRs (the average PRR for single administrations of atorvastatin and a concomitant drug – the PRR for their co-administration) were calculated and compared with PRR predictions by the SVM model for the drug pairs upon the removal of NR1I3 (Fig. 7C) and ABCB11 (Fig. 7D) from the TET profile of atorvastatin. In these analyses, PRRs for ILD were predicted to decrease with most drug pairs involving significant DDIs, when judged by Δ PRR < the 99% confidence interval of Δ PRRs for the “No change” group (Fig. 7C- .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 14 D), supporting the critical roles of NR1I3 and ABCB11 in DDI-associated ILD involving atorvastatin. A few potential pathological pathways underlying DDI-associated ILD involving a high plasma concentration of atorvastatin were determined around the three important targets — AHR, NR1I3, and HDAC2 — identified in this study. In this analysis, only genes/proteins significantly expressed in lung, as recorded in the NCBI Gene database were [30] considered. The analyses revealed the high likelihood that atorvastatin binding to AHR, NR1I3, and HDAC2 may cause ILD through a major ILD mechanism — the dysregulation of surfactant production and homeostasis [80, 81]. Many interactors, such as SP1 transcription factor and estrogen receptor 1 (ESR1), are highly interconnected in the pathways around AHR, NR1I3, and HDAC2 (Fig. 8A-B). Genes/proteins important in surfactant metabolism also create a strong network (Fig. 8A-B). Thus, any impact from AHR, NR1I3 and HDAC2 can be amplified, influencing one another in these pathways. Literature survey identified a few plausible routes atorvastatin can take to cause ILD. AHR is a transcription factor inducible by aromatic hydrocarbon-based xenobiotics, such as atorvastatin [82, 83]. Upon binding to a ligand, AHR is complexed with aryl hydrocarbon receptor nuclear translocator (ARNT; Fig. 8A) [82, 83]. The AHR/ARNT complex can then induce expression of AHR’s target genes, which code for enzymes and transporters required for xenobiotic metabolism [82, 84]. Activated AHR inhibits estrogen receptor (ESR1) activity [83], by redirecting ESR1 away from ESR1 target genes [85], such as ATP-binding cassette sub- family A member 3 (ABCA3; Fig. 8A) [86]. ABCA3 plays a critical role in the formation of pulmonary surfactant by transporting phospholipids from the endoplasmic reticulum to a surfactant storage organelle in type II epithelial cells [87, 88]. Thus, the binding of atorvastatin to AHR may cause pulmonary surfactant metabolism dysfunction by downregulating the ABCA3 gene via inhibition of ESR1 activity [89]. Different interactors (e.g., histone acetyltransferase p300 (EP300)) may be involved in ILD, amplifying the effects of AHR through networks of other interactors and ILD genes/proteins. In addition, NR1I3 is a nuclear receptor that mediates transcriptional activation of target genes required for the metabolism and elimination of xenobiotics [82, 90, 91], such as CYP2B6 and CYP3A4 [90, 92]. Upon the binding of xenobiotics, NR1I3 is dephosphorylated for nuclear translocation and transactivation [93], which requires reduced SRC kinase activity [94]. On the .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 15 other hand, P2Y purinoceptor 2 (P2RY2), a G protein-coupled receptor, activates SRC [95, 96] in order to promote surfactant secretion from alveolar type II cells [97]. Thus, the binding of atorvastatin to NR1I3 [98] can dysregulate normal surfactant secretion via interference with SRC kinase activity. Atorvastatin inhibits HDAC2 [99]. The connections between HDAC2 and ILD genes/proteins are highly interconnected, also sharing many interactors with AHR and NR1I3, suggesting the existence of many different paths that cause ILD from HDAC2 inhibition (Fig. 8B). Interestingly, the binding of atorvastatin to HDAC2 may be related to atorvastatin’s beneficial effect against cancer [100]. HDAC2 plays a key role in the epigenetic regulation of gene expression in cancer [101] and HDAC2 inhibitors (e.g., atorvastatin [99]) can display anti- cancer activities [101]. The anticancer effect of atorvastatin was also statistically analyzed. In this analysis, combinations of atorvastatin and a drug that have PRRs > 1 for ILD were identified and their PRRs for lung cancer were also calculated. None of combinations of atorvastatin and concomitant drugs that had PRRs > 1 for ILD had PRRs > 1 for lung cancer, supporting atorvastatin’s anti-cancer effects. Overall, the model-based computational examinations and pathway analyses revealed key molecules important in DDI-associated ILD involving atorvastatin and proposed underlying pathological pathways. CONCLUSIONS This study presented a novel computational approach to accurately predict the occurrences of ADRs using a machine learning model consisting of hierarchically structured classifiers. The hierarchical model presented here addresses the limitations of conventional models relying on drug similarity for the prediction of ADRs. The method developed here is based on TET profile-dependencies of ADRs derived from drugs’ chemical structures, requiring no high chemical similarity of drugs. Given basic structural characteristics of drugs, this hierarchical model integrating the RFCs for TET profile prediction and the SVM for specific DDI-associated ADRs can accurately predict ADRs with an overall ~91% (=94.63% for TET prediction × 95.73% for ADR prediction) accuracy. As DDIs typically appearing as various forms of ADRs have been another primary issue in past predictions of DDI-associated ADRs [102], the presented model deconvolutes this complexity, as judged by its accurate prediction of various ADRs for any drug pair. In addition, pharmacological insight offered by the hierarchical .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 16 model was successfully connected to pathway analyses underlying ADRs, making the described computational approach powerful for not only predicting the occurrence of DDI-associated ADRs but also enhancing mechanistic understandings. Notably, the constructed model can accurately predict TET profiles and DDI-associated ADRs from most basic information of drugs - chemical structures – for any pair. Thus, the model presented in this study can also be used for drug design. For example, the MACCS keys descriptors can be manipulated and inputted into the hierarchical model to identify a drug’s key structural characteristics that increase a risk of ADRs. Once the structural hot spots are identified, an array of drug variants with different chemical moieties at the locations can be designed and evaluated for DDIs and ADRs prior to synthesis. As a result, many drugs can readily be evaluated for their potential DDIs in advance, avoiding costly preclinical and clinical tests. Thus, the hierarchical model developed is anticipated to pave new way to enhance drug safety and reduce drug development costs. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 17 REFERENCES 1. Kovacevic, M., Vezmar Kovacevic, S., Radovanovic, S., Stevanovic, P. & Miljkovic, B. (2019). Adverse drug reactions caused by drug-drug interactions in cardiovascular disease patients: introduction of a simple prediction tool using electronic screening database items. Curr Med Res Opin, 35(11), 1873-1883. 2. Luo, J., Eldredge, C., Cho, C. C. & Cisler, R. A. (2016). Population Analysis of Adverse Events in Different Age Groups Using Big Clinical Trials Data. JMIR Med Inform, 4(4), e30. 3. Tatonetti, N. P., Ye, P. P., Daneshjou, R. & Altman, R. B. (2012). Data-driven prediction of drug effects and interactions. Sci Transl Med, 4(125), 125ra131. 4. Nguyen, T., Wong, E. & Ciummo, F. (2020). Polypharmacy in older adults: practical applications alongside a patient case. The Journal for Nurse Practitioners, 16(3), 205- 209. 5. Kaneko, S. & Nagashima, T. (2020). Drug Repositioning and Target Finding Based on Clinical Evidence. Biol Pharm Bull, 43(3), 362-365. 6. Neve, E. P., Artursson, P., Ingelman-Sundberg, M. & Karlgren, M. (2013). An integrated in vitro model for simultaneous assessment of drug uptake, metabolism, and efflux. Mol Pharm, 10(8), 3152-3163. 7. Benet, L. Z., Cummins, C. L. & Wu, C. Y. (2003). Transporter-enzyme interactions: implications for predicting drug-drug interactions from in vitro data. Curr Drug Metab, 4(5), 393-398. 8. Kalliokoski, A. & Niemi, M. (2009). Impact of OATP transporters on pharmacokinetics. Br J Pharmacol, 158(3), 693-705. 9. Poirier, A., Funk, C., Lave, T. & Noe, J. (2007). New strategies to address drug-drug interactions involving OATPs. Curr Opin Drug Discov Devel, 10(1), 74-83. 10. Jamal, S., Goyal, S., Shanker, A. & Grover, A. (2017). Predicting neurological Adverse Drug Reactions based on biological, chemical and phenotypic properties of drugs using machine learning models. Sci Rep, 7(1), 872. 11. Sakaeda, T., Tamon, A., Kadoyama, K. & Okuno, Y. (2013). Data mining of the public version of the FDA Adverse Event Reporting System. Int J Med Sci, 10(7), 796-803. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 18 12. Brouwers, L., Iskar, M., Zeller, G., van Noort, V. & Bork, P. (2011). Network neighbors of drug targets contribute to drug side-effect similarity. PLoS One, 6(7), e22187. 13. Munoz, E., Novacek, V. & Vandenbussche, P. Y. (2016). Using Drug Similarities for Discovery of Possible Adverse Reactions. AMIA Annu Symp Proc, 2016, 924-933. 14. Seo, S., Lee, T., Kim, M. H. & Yoon, Y. (2020). Prediction of Side Effects Using Comprehensive Similarity Measures. Biomed Res Int, 2020, 1357630. 15. Vilar, S., Uriarte, E., Santana, L., Lorberbaum, T., Hripcsak, G., Friedman, C. & Tatonetti, N. P. (2014). Similarity-based modeling in large-scale prediction of drug-drug interactions. Nat Protoc, 9(9), 2147-2163. 16. Noguchi, Y., Ueno, A., Otsubo, M., Katsuno, H., Sugita, I., Kanematsu, Y., Yoshida, A., Esaki, H., Tachi, T. & Teramachi, H. (2018). A simple method for exploring adverse drug events in patients with different primary diseases using spontaneous reporting system. BMC Bioinformatics, 19(1), 124. 17. Stancin, I. & Jovic, A. (2019). An overview and comparison of free Python libraries for data mining and big data analysis. 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). Anal Chem, 88(7), 3539-3547. 18. Wishart, D. S., Feunang, Y. D., Guo, A. C., Lo, E. J., Marcu, A., Grant, J. R., Sajed, T., Johnson, D., Li, C., Sayeeda, Z., Assempour, N., Iynkkaran, I., Liu, Y., Maciejewski, A., Gale, N., Wilson, A., Chin, L., Cummings, R., Le, D., Pon, A., Knox, C. & Wilson, M. (2018). DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res, 46(D1), D1074-D1082. 19. Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., Li, Q., Shoemaker, B. A., Thiessen, P. A., Yu, B., Zaslavsky, L., Zhang, J. & Bolton, E. E. (2019). PubChem 2019 update: improved access to chemical data. Nucleic Acids Res, 47(D1), D1102-D1109. 20. Du, H., Cai, Y., Yang, H., Zhang, H., Xue, Y., Liu, G., Tang, Y. & Li, W. (2017). In Silico Prediction of Chemicals Binding to Aromatase with Machine Learning Methods. Chem Res Toxicol, 30(5), 1209-1218. 21. Breiman, L. (2001). Random Forests. Machine Learning, 45, 5-32. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 19 22. Dreiseitl, S. & Ohno-Machado, L. (2002). Logistic regression and artificial neural network classification models: a methodology review. J Biomed Inform, 35(5-6), 352- 359. 23. Cortes, C. & Vapnik, V. (1995). Support-Vector Networks. Machine Learning, 20, 273−297. 24. Khuri, N., Zur, A. A., Wittwer, M. B., Lin, L., Yee, S. W., Sali, A. & Giacomini, K. M. (2017). Computational Discovery and Experimental Validation of Inhibitors of the Human Intestinal Transporter OATP2B1. J Chem Inf Model, 57(6), 1402-1413. 25. Jassal, B., Matthews, L., Viteri, G., Gong, C., Lorente, P., Fabregat, A., Sidiropoulos, K., Cook, J., Gillespie, M., Haw, R., Loney, F., May, B., Milacic, M., Rothfels, K., Sevilla, C., Shamovsky, V., Shorser, S., Varusai, T., Weiser, J., Wu, G., Stein, L., Hermjakob, H. & D'Eustachio, P. (2020). The reactome pathway knowledgebase. Nucleic Acids Res, 48(D1), D498-D503. 26. Oughtred, R., Stark, C., Breitkreutz, B. J., Rust, J., Boucher, L., Chang, C., Kolas, N., O'Donnell, L., Leung, G., McAdam, R., Zhang, F., Dolma, S., Willems, A., Coulombe- Huntington, J., Chatr-Aryamontri, A., Dolinski, K. & Tyers, M. (2019). The BioGRID interaction database: 2019 update. Nucleic Acids Res, 47(D1), D529-D541. 27. Schmidt, T., Samaras, P., Frejno, M., Gessulat, S., Barnert, M., Kienegger, H., Krcmar, H., Schlegl, J., Ehrlich, H. C., Aiche, S., Kuster, B. & Wilhelm, M. (2018). ProteomicsDB. Nucleic Acids Res, 46(D1), D1271-D1281. 28. Szklarczyk, D., Gable, A. L., Lyon, D., Junge, A., Wyder, S., Huerta-Cepas, J., Simonovic, M., Doncheva, N. T., Morris, J. H., Bork, P., Jensen, L. J. & Mering, C. V. (2019). STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res, 47(D1), D607-D613. 29. Giurgiu, M., Reinhard, J., Brauner, B., Dunger-Kaltenbach, I., Fobo, G., Frishman, G., Montrone, C. & Ruepp, A. (2019). CORUM: the comprehensive resource of mammalian protein complexes-2019. Nucleic Acids Res, 47(D1), D559-D563. 30. Brown, G. R., Hem, V., Katz, K. S., Ovetsky, M., Wallin, C., Ermolaeva, O., Tolstoy, I., Tatusova, T., Pruitt, K. D., Maglott, D. R. & Murphy, T. D. (2015). Gene: a gene- centered information resource at NCBI. Nucleic Acids Res, 43(Database issue), D36-42. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 20 31. Yu, W. & MacKerell, A. D., Jr. (2017). Computer-Aided Drug Design Methods. Methods Mol Biol, 1520, 85-106. 32. Meibohm, B. & Derendorf, H. (2002). Pharmacokinetic/pharmacodynamic studies in drug product development. J Pharm Sci, 91(1), 18-31. 33. Sudsakorn, S., Bahadduri, P., Fretland, J. & Lu, C. (2020). 2020 FDA Drug-drug Interaction Guidance: A Comparison Analysis and Action Plan by Pharmaceutical Industrial Scientists. Curr Drug Metab, 21(6), 403-426. 34. El-Hachem, N., Haibe-Kains, B., Khalil, A., Kobeissy, F. H. & Nemer, G. (2017). AutoDock and AutoDockTools for Protein-Ligand Docking: Beta-Site Amyloid Precursor Protein Cleaving Enzyme 1(BACE1) as a Case Study. Methods Mol Biol, 1598, 391-403. 35. Jonklaas, J., Bianco, A. C., Bauer, A. J., Burman, K. D., Cappola, A. R., Celi, F. S., Cooper, D. S., Kim, B. W., Peeters, R. P., Rosenthal, M. S., Sawka, A. M. & American Thyroid Association Task Force on Thyroid Hormone, R. (2014). Guidelines for the treatment of hypothyroidism: prepared by the american thyroid association task force on thyroid hormone replacement. Thyroid, 24(12), 1670-1751. 36. Shi, S. & Klotz, U. (2008). Proton pump inhibitors: an update of their clinical use and pharmacokinetics. Eur J Clin Pharmacol, 64(10), 935-951. 37. Mohammadkhani, N., Gharbi, S., Rajani, H. F., Farzaneh, A., Mahjoob, G., Hoseinsalari, A. & Korsching, E. (2019). Statins: Complex outcomes but increasingly helpful treatment options for patients. Eur J Pharmacol, 863, 172704. 38. Amoroso, G., van Boven, A. J., van Veldhuisen, D. J., Tio, R. A., Balje-Volkers, C. P., Petronio, A. S. & van Oeveren, W. (2001). Eptifibatide and abciximab exhibit equivalent antiplatelet efficacy in an experimental model of stenting in both healthy volunteers and patients with coronary artery disease. J Cardiovasc Pharmacol, 38(4), 633-641. 39. Coutinho, J., Field, J. B. & Sule, A. A. (2018). Armour(R) Thyroid Rage - A Dangerous Mixture. Cureus, 10(4), e2523. 40. Schror, K. & Weber, A. A. (2003). Comparative pharmacology of GP IIb/IIIa antagonists. J Thromb Thrombolysis, 15(2), 71-80. 41. Davis, P. J., Leonard, J. L. & Davis, F. B. (2008). Mechanisms of nongenomic actions of thyroid hormone. Front Neuroendocrinol, 29(2), 211-218. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 21 42. Hammes, S. R. & Davis, P. J. (2015). Overlapping nongenomic and genomic actions of thyroid hormone and steroids. Best Pract Res Clin Endocrinol Metab, 29(4), 581-593. 43. Phillips, D. R., Charo, I. F. & Scarborough, R. M. (1991). GPIIb-IIIa: the responsive integrin. Cell, 65(3), 359-362. 44. Torres, N. B. & Altafini, C. (2016). Drug combinatorics and side effect estimation on the signed human drug-target network. BMC Syst Biol, 10(1), 74. 45. Lee, C. H., Franchi, F. & Angiolillo, D. J. (2020). Clopidogrel drug interactions: a review of the evidence and clinical implications. Expert Opin Drug Metab Toxicol, 16(11), 1079-1096. 46. Ho, P. M., Maddox, T. M., Wang, L., Fihn, S. D., Jesse, R. L., Peterson, E. D. & Rumsfeld, J. S. (2009). Risk of adverse outcomes associated with concomitant use of clopidogrel and proton pump inhibitors following acute coronary syndrome. JAMA, 301(9), 937-944. 47. Ogawa, R. & Echizen, H. (2010). Drug-drug interaction profiles of proton pump inhibitors. Clin Pharmacokinet, 49(8), 509-533. 48. Evanchan, J., Donnally, M. R., Binkley, P. & Mazzaferri, E. (2010). Recurrence of acute myocardial infarction in patients discharged on clopidogrel and a proton pump inhibitor after stent placement for acute myocardial infarction. Clin Cardiol, 33(3), 168-171. 49. Stockl, K. M., Le, L., Zakharyan, A., Harada, A. S., Solow, B. K., Addiego, J. E. & Ramsey, S. (2010). Risk of rehospitalization for patients using clopidogrel with a proton pump inhibitor. Arch Intern Med, 170(8), 704-710. 50. Gaglia, M. A., Jr., Torguson, R., Hanna, N., Gonzalez, M. A., Collins, S. D., Syed, A. I., Ben-Dor, I., Maluenda, G., Delhaye, C., Wakabayashi, K., Xue, Z., Suddath, W. O., Kent, K. M., Satler, L. F., Pichard, A. D. & Waksman, R. (2010). Relation of proton pump inhibitor use after percutaneous coronary intervention with drug-eluting stents to outcomes. Am J Cardiol, 105(6), 833-838. 51. Savi, P., Pereillo, J. M., Uzabiaga, M. F., Combalbert, J., Picard, C., Maffrand, J. P., Pascal, M. & Herbert, J. M. (2000). Identification and biological activity of the active metabolite of clopidogrel. Thromb Haemost, 84(5), 891-896. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 22 52. Umemura, K., Furuta, T. & Kondo, K. (2008). The common gene variants of CYP2C19 affect pharmacokinetics and pharmacodynamics in an active metabolite of clopidogrel in healthy subjects. J Thromb Haemost, 6(8), 1439-1441. 53. Herbert, J. M. & Savi, P. (2003). P2Y12, a new platelet ADP receptor, target of clopidogrel. Semin Vasc Med, 3(2), 113-122. 54. Li, X. Q., Andersson, T. B., Ahlstrom, M. & Weidolf, L. (2004). Comparison of inhibitory effects of the proton pump-inhibiting drugs omeprazole, esomeprazole, lansoprazole, pantoprazole, and rabeprazole on human cytochrome P450 activities. Drug Metab Dispos, 32(8), 821-827. 55. Clarke, T. A. & Waskell, L. A. (2003). The metabolism of clopidogrel is catalyzed by human cytochrome P450 3A and is inhibited by atorvastatin. Drug Metab Dispos, 31(1), 53-59. 56. Farid, N. A., Payne, C. D., Small, D. S., Winters, K. J., Ernest, C. S., 2nd, Brandt, J. T., Darstein, C., Jakubowski, J. A. & Salazar, D. E. (2007). Cytochrome P450 3A inhibition by ketoconazole affects prasugrel and clopidogrel pharmacokinetics and pharmacodynamics differently. Clin Pharmacol Ther, 81(5), 735-741. 57. Diaz, D., Fabre, I., Daujat, M., Saint Aubert, B., Bories, P., Michel, H. & Maurel, P. (1990). Omeprazole is an aryl hydrocarbon-like inducer of human hepatic cytochrome P450. Gastroenterology, 99(3), 737-747. 58. Rost, K. L., Brosicke, H., Brockmoller, J., Scheffler, M., Helge, H. & Roots, I. (1992). Increase of cytochrome P450IA2 activity by omeprazole: evidence by the 13C-[N-3- methyl]-caffeine breath test in poor and extensive metabolizers of S-mephenytoin. Clin Pharmacol Ther, 52(2), 170-180. 59. Rizzo, N., Padoin, C., Palombo, S., Scherrmann, J. M. & Girre, C. (1996). Omeprazole and lansoprazole are not inducers of cytochrome P4501A2 under conventional therapeutic conditions. Eur J Clin Pharmacol, 49(6), 491-495. 60. Xiaodong, S., Gatti, G., Bartoli, A., Cipolla, G., Crema, F. & Perucca, E. (1994). Omeprazole does not enhance the metabolism of phenacetin, a marker of CYP1A2 activity, in healthy volunteers. Ther Drug Monit, 16(3), 248-250. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 23 61. Oosterhuis, B., Jonkman, J. H., Andersson, T., Zuiderwijk, P. B. & Jedema, J. N. (1991). Minor effect of multiple dose omeprazole on the pharmacokinetics of digoxin after a single oral dose. Br J Clin Pharmacol, 32(5), 569-572. 62. Soons, P. A., van den Berg, G., Danhof, M., van Brummelen, P., Jansen, J. B., Lamers, C. B. & Breimer, D. D. (1992). Influence of single- and multiple-dose omeprazole treatment on nifedipine pharmacokinetics and effects in healthy subjects. Eur J Clin Pharmacol, 42(3), 319-324. 63. Bataillard, M., Beyens, M. N., Mounier, G., Vergnon-Miszczycha, D., Bagheri, H. & Cathebras, P. (2019). Muscle Damage Due to Fusidic Acid-Statin Interaction: Review of 75 Cases From the French Pharmacovigilance Database and Literature Reports. Am J Ther, 26(3), e375-e379. 64. Boonmuang, P., Nathisuwan, S., Chaiyakunapruk, N., Suwankesawong, W., Pokhagul, P., Teerawattanapong, N. & Supsongserm, P. (2013). Characterization of Statin- Associated Myopathy Case Reports in Thailand Using the Health Product Vigilance Center Database. Drug Saf, 36(9), 779-787. 65. Brahmachari, B. & Chatterjee, S. (2015). Myopathy induced by statin-ezetimibe combination: Evaluation of potential risk factors. Indian J Pharmacol, 47(5), 563-564. 66. du Souich, P., Roederer, G. & Dufour, R. (2017). Myotoxicity of statins: Mechanism of action. Pharmacol Ther, 175, 1-16. 67. Hirota, T. & Ieiri, I. (2015). Drug-drug interactions that interfere with statin metabolism. Expert Opin Drug Metab Toxicol, 11(9), 1435-1447. 68. Marusic, S., Lisicic, A., Horvatic, I., Bacic-Vrca, V. & Bozina, N. (2012). Atorvastatin- related rhabdomyolysis and acute renal failure in a genetically predisposed patient with potential drug-drug interaction. Int J Clin Pharm, 34(6), 825-827. 69. Stormo, C., Bogsrud, M. P., Hermann, M., Asberg, A., Piehler, A. P., Retterstol, K. & Kringen, M. K. (2013). UGT1A1*28 is associated with decreased systemic exposure of atorvastatin lactone. Mol Diagn Ther, 17(4), 233-237. 70. Neuvonen, P. J., Niemi, M. & Backman, J. T. (2006). Drug interactions with lipid- lowering drugs: mechanisms and clinical relevance. Clin Pharmacol Ther, 80(6), 565- 581. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 24 71. Ogilvie, B. W., Zhang, D., Li, W., Rodrigues, A. D., Gipson, A. E., Holsapple, J., Toren, P. & Parkinson, A. (2006). Glucuronidation converts gemfibrozil to a potent, metabolism-dependent inhibitor of CYP2C8: implications for drug-drug interactions. Drug Metab Dispos, 34(1), 191-197. 72. Canestaro, W. J., Austin, M. A. & Thummel, K. E. (2014). Genetic factors affecting statin concentrations and subsequent myopathy: a HuGENet systematic review. Genet Med, 16(11), 810-819. 73. Gluba-Brzozka, A., Franczyk, B., Toth, P. P., Rysz, J. & Banach, M. (2016). Molecular mechanisms of statin intolerance. Arch Med Sci, 12(3), 645-658. 74. Fernandez, A. B., Karas, R. H., Alsheikh-Ali, A. A. & Thompson, P. D. (2008). Statins and interstitial lung disease: a systematic review of the literature and of food and drug administration adverse event reports. Chest, 134(4), 824-830. 75. Zanger, U. M. & Schwab, M. (2013). Cytochrome P450 enzymes in drug metabolism: regulation of gene expression, enzyme activities, and impact of genetic variation. Pharmacol Ther, 138(1), 103-141. 76. Pedersen, J. M., Matsson, P., Bergstrom, C. A., Hoogstraate, J., Noren, A., LeCluyse, E. L. & Artursson, P. (2013). Early identification of clinically relevant drug interactions with the human bile salt export pump (BSEP/ABCB11). Toxicol Sci, 136(2), 328-343. 77. Hirano, M., Maeda, K., Hayashi, H., Kusuhara, H. & Sugiyama, Y. (2005). Bile salt export pump (BSEP/ABCB11) can transport a nonbile acid substrate, pravastatin. J Pharmacol Exp Ther, 314(2), 876-882. 78. Hirota, T., Fujita, Y. & Ieiri, I. (2020). An updated review of pharmacokinetic drug interactions and pharmacogenetics of statins. Expert Opin Drug Metab Toxicol, 16(9), 809-822. 79. Zhang, L., Lv, H., Zhang, Q., Wang, D., Kang, X., Zhang, G. & Li, X. (2019). Association of SLCO1B1 and ABCB1 Genetic Variants with Atorvastatin-induced Myopathy in Patients with Acute Ischemic Stroke. Curr Pharm Des, 25(14), 1663-1670. 80. Akella, A. & Deshpande, S. B. (2013). Pulmonary surfactants and their role in pathophysiology of lung disorders. Indian J Exp Biol, 51(1), 5-22. 81. Whitsett, J. A., Wert, S. E. & Weaver, T. E. (2010). Alveolar surfactant homeostasis and the pathogenesis of pulmonary disease. Annu Rev Med, 61, 105-119. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 25 82. Larigot, L., Juricek, L., Dairou, J. & Coumoul, X. (2018). AhR signaling pathways and regulatory functions. Biochim Open, 7, 1-9. 83. Matthews, J. & Gustafsson, J. A. (2006). Estrogen receptor and aryl hydrocarbon receptor signaling pathways. Nucl Recept Signal, 4, e016. 84. Hankinson, O. (1995). The aryl hydrocarbon receptor complex. Annu Rev Pharmacol Toxicol, 35, 307-340. 85. Matthews, J., Wihlen, B., Thomsen, J. & Gustafsson, J. A. (2005). Aryl hydrocarbon receptor-mediated transcription: ligand-dependent recruitment of estrogen receptor alpha to 2,3,7,8-tetrachlorodibenzo-p-dioxin-responsive promoters. Mol Cell Biol, 25(13), 5317-5328. 86. Lin, C. Y., Strom, A., Vega, V. B., Kong, S. L., Yeo, A. L., Thomsen, J. S., Chan, W. C., Doray, B., Bangarusamy, D. K., Ramasamy, A., Vergara, L. A., Tang, S., Chong, A., Bajic, V. B., Miller, L. D., Gustafsson, J. A. & Liu, E. T. (2004). Discovery of estrogen receptor alpha target genes and response elements in breast tumor cells. Genome Biol, 5(9), R66. 87. Klugbauer, N. & Hofmann, F. (1996). Primary structure of a novel ABC transporter with a chromosomal localization on the band encoding the multidrug resistance-associated protein. FEBS Lett, 391(1-2), 61-65. 88. Yamano, G., Funahashi, H., Kawanami, O., Zhao, L. X., Ban, N., Uchida, Y., Morohoshi, T., Ogawa, J., Shioda, S. & Inagaki, N. (2001). ABCA3 is a lamellar body membrane protein in human lung alveolar type II cells. FEBS Lett, 508(2), 221-225. 89. Shulenin, S., Nogee, L. M., Annilo, T., Wert, S. E., Whitsett, J. A. & Dean, M. (2004). ABCA3 gene mutations in newborns with fatal surfactant deficiency. N Engl J Med, 350(13), 1296-1303. 90. Auerbach, S. S., Dekeyser, J. G., Stoner, M. A. & Omiecinski, C. J. (2007). CAR2 displays unique ligand binding and RXRalpha heterodimerization characteristics. Drug Metab Dispos, 35(3), 428-439. 91. Qatanani, M. & Moore, D. D. (2005). CAR, the continuously advancing receptor, in drug metabolism and disease. Curr Drug Metab, 6(4), 329-339. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 26 92. Goodwin, B., Hodgson, E., D'Costa, D. J., Robertson, G. R. & Liddle, C. (2002). Transcriptional regulation of the human CYP3A4 gene by the constitutive androstane receptor. Mol Pharmacol, 62(2), 359-365. 93. Mutoh, S., Osabe, M., Inoue, K., Moore, R., Pedersen, L., Perera, L., Rebolloso, Y., Sueyoshi, T. & Negishi, M. (2009). Dephosphorylation of threonine 38 is required for nuclear translocation and activation of human xenobiotic receptor CAR (NR1I3). J Biol Chem, 284(50), 34785-34792. 94. Groll, N., Petrikat, T., Vetter, S., Wenz, C., Dengjel, J., Gretzmeier, C., Weiss, F., Poetz, O., Joos, T. O., Schwarz, M. & Braeuning, A. (2016). Inhibition of beta-catenin signaling by phenobarbital in hepatoma cells in vitro. Toxicology, 370, 94-105. 95. Liu, J., Liao, Z., Camden, J., Griffin, K. D., Garrad, R. C., Santiago-Perez, L. I., Gonzalez, F. A., Seye, C. I., Weisman, G. A. & Erb, L. (2004). Src homology 3 binding sites in the P2Y2 nucleotide receptor interact with Src and regulate activities of Src, proline-rich tyrosine kinase 2, and growth factor receptors. J Biol Chem, 279(9), 8212- 8218. 96. Woods, L. T., Jasmer, K. J., Munoz Forti, K., Shanbhag, V. C., Camden, J. M., Erb, L., Petris, M. J. & Weisman, G. A. (2020). P2Y2 receptors mediate nucleotide-induced EGFR phosphorylation and stimulate proliferation and tumorigenesis of head and neck squamous cell carcinoma cell lines. Oral Oncol, 109, 104808. 97. Rice, W. R. & Singleton, F. M. (1986). P2-purinoceptors regulate surfactant secretion from rat isolated alveolar type II cells. Br J Pharmacol, 89(3), 485-491. 98. Rezen, T., Hafner, M., Kortagere, S., Ekins, S., Hodnik, V. & Rozman, D. (2017). Rosuvastatin and Atorvastatin Are Ligands of the Human Constitutive Androstane Receptor/Retinoid X Receptor alpha Complex. Drug Metab Dispos, 45(8), 974-976. 99. Lin, Y. C., Lin, J. H., Chou, C. W., Chang, Y. F., Yeh, S. H. & Chen, C. C. (2008). Statins increase p21 through inhibition of histone deacetylase activity and release of promoter-associated HDAC1/2. Cancer Res, 68(7), 2375-2383. 100. Archibugi, L., Arcidiacono, P. G. & Capurso, G. (2019). Statin use is associated to a reduced risk of pancreatic cancer: A meta-analysis. Dig Liver Dis, 51(1), 28-37. 101. Bolden, J. E., Peart, M. J. & Johnstone, R. W. (2006). Anticancer activities of histone deacetylase inhibitors. Nat Rev Drug Discov, 5(9), 769-784. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 27 102. Noguchi, Y., Tachi, T. & Teramachi, H. (2020). Subset Analysis for Screening Drug- Drug Interaction Signal Using Pharmacovigilance Database. Pharmaceutics, 12(8), .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 28 Training Accuracy Testing Accuracy Target 99.69% 99.54% Enzyme 98.74% 96.82% Transporter 99.39% 98.19% Table 1. Average training and testing accuracies for target, enzyme, and transporter prediction by the RFC models. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 29 FIGURE CAPTIONS Fig. 1. Hierarchical classification model overview for the prediction of DDI-associated ADRs from drugs’ chemical structures through predictions of (A) TET profiles from chemical fingerprints and (B) ADRs from TET matrices of drug pairs. (A) Drugs’ chemical structures were represented with MACCS keys and used as features to predict TET profiles of drugs using a Random Forest Classifier (RFC). (B) TET profiles of a drug pair were combined into a TET matrix, which was then used as a feature to predict encoded PRRs for all ADRs in RFC, Logistic Regression (LR), and Support Vector Machine (SVM) models. Fig. 2. Repeated 10-fold cross-validation for (A) Random Forest Classifier (RFC), (B) Logistic Regression (LR), and (C) Support Vector Machine (SVM) models. Fig. 3. Adverse drug reactions associated with drug-drug interactions (DDIs) between levothyroxine and eptifibatide. (A) Target, enzyme, and transporter (TET) profiles of levothyroxine and eptifibatide. Y: the presence of a drug’s action on TETs. N: the absence of a drug’s action on TETs. (B) Comparisons between ΔPRRs (as DDI indices) and PRR prediction upon removal of ITGB3 from eptifibatide’s TET profile for ADRs associated with the co-administration of levothyroxine and eptifibatide. For a given ADR, the average PRR of single administrations of levothyroxine and eptifibatide – the PRR of their co- administration was calculated, and the PRR change for co-administration of levothyroxine and eptifibatide was predicted upon alteration of TET profiles of eptifibatide for integrin β-3 (ITGB3) from Y to N. Y: Inclusion of ITGB3 in eptifibatide’s TET profile. N: Removal of ITGB3 from eptifibatide’s TET profile. **: Outside of the 99% confidence interval of the “No change” group. Fig. 4. Adverse drug reactions associated with drug-drug interactions (DDIs) between omeprazole and clopidogrel. (A) Target, enzyme, and transporter (TET) profiles of omeprazole and clopidogrel. Y: the presence of a drug’s action on TETs. N: the absence of a drug’s action on TETs. (B) The impacts of omeprazole’s TET profile on its DDI-associated ADRs with clopidogrel. The PRR changes for ADRs associated with co-administration of omeprazole and clopidogrel were calculated using the SVM model when each of omeprazole’s .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 30 TETs was removed. (C) Comparisons between ΔPRRs (as DDI indices) and PRR predictions upon removal of CYP2C19 from omeprazole’s TET profile for ADRs associated with the co-administration of omeprazole and clopidogrel. For a given ADR, the average PRR of single administrations of omeprazole and clopidogrel – the PRR of their co-administration was calculated, and the PRR change for co-administration of omeprazole and clopidogrel was predicted using the SVM model upon alteration of the TET profile of omeprazole for CYP2C19 from Y to N. Y: Inclusion of CYP2C19 in omeprazole’s TET profile. N: Removal of CYP2C19 from omeprazole’s TET profile. **: Outside of the 99% confidence interval of the “No change” group. Fig. 5. Myopathy associated with drug-drug interactions (DDIs) involving atorvastatin. (A) Target, enzyme, and transporter (TET) profiles of atorvastatin and concomitant drugs, such as ramipril and warfarin. Y: the presence of a drug’s action on TETs. N: the absence of a drug’s action on TETs. (B) The impacts of atorvastatin’s TET profile on its DDI-associated ADR of myopathy with various concomitant drugs. The PRR changes for myopathy associated with co-administration of atorvastatin and other drugs were calculated using the SVM model when each of atorvastatin’s TETs was removed. (C) Comparisons between ΔPRRs (as DDI indices) and PRR predictions upon removal of CYP3A4 from atorvastatin’s TET profile for myopathy associated with the co-administration of atorvastatin and a concomitant drug. For myopathy, the average PRRs of single administrations of atorvastatin and a concomitant drug – the PRRs of their co-administration were calculated, and PRR changes for the co-administration of atorvastatin and the drug were predicted using the SVM model upon alteration of the TET profile of atorvastatin for cytochrome P450 3A4 (CYP3A4) from Y to N. Y: Inclusion of CYP3A4 in atorvastatin’s TET profile. N: Removal of CYP3A4 from atorvastatin’s TET profile. **: Outside of the 99% confidence interval of the “No change” group. Fig. 6. Drug-drug interactions between atorvastatin and concomitant drugs for interstitial lung disease (ILD). For ILD, Δ PRR (= the average PRR of single administrations of atorvastatin and the concomitant drug – the PRR of their co-administration) was calculated, and plotted with the PRR for their co-administration. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 31 Fig. 7. Interstitial lung disease (ILD) associated with drug-drug interactions (DDIs) involving atorvastatin. (A) The impacts of atorvastatin’s TET profile on its DDI-associated ILD with various concomitant drugs. The PRR changes for ILD associated with the co- administration of atorvastatin and other drugs were calculated using the SVM model when each of atorvastatin’s TETs was removed. (B) The number of drug pairs containing atorvastatin with a predicted increase in PRRs for ILDs when each of atorvastatin’s targets was removed. (C) Comparisons between ΔPRRs (as DDI indices) and PRR predictions upon removal of (A) NR1I3 and (B) ABCB11 from atorvastatin’s TET profile for ILD associated with the co-administration of atorvastatin and a concomitant drug. For ILD, the average PRR of single administrations of atorvastatin and a concomitant drug– the PRR of their co- administration was calculated, and the PRR changes for the co-administration of atorvastatin and the drug were predicted using the SVM model upon alteration of the TET profile of atorvastatin for (A) NR1I3 and (B) ABCB11 from Y to N. Y: Inclusion of (A) NR1I3 and (B) ABCB11 in atorvastatin’s TET profile. N: Removal of (A) NR1I3 and (B) ABCB11 from atorvastatin’s TET profile. **: Outside of the 99% confidence interval of the “No change” group. Fig. 8. Pathway analyses for the enhanced risk of ILD, associated with DDIs involving atorvastatin created around (A) AHR and NR1I3 and (B) HDAC2. Interactions among genes/proteins were determined using an array of bioinformatics databases, including BioGRID , Proteomics DB, STRING, CORUM and Reactome. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 32 Fig. 1 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 33 Fig. 2 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 34 Fig. 3 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 35 Fig. 4 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 36 Fig. 5 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 37 Fig. 6 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 38 Fig. 7 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 39 Fig. 8 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 40 KEYWORDS adverse drug reactions; drug-drug interaction; drug safety; hierarchical classification; machine learning; prediction, metabolizing enzyme, target; transporter. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430512doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430512 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2021_02_10_430563 ---- SPARC Data Structure: Rationale and Design of a FAIR Standard for Biomedical Research Data 1 SPARC Data Structure: Rationale and Design of a FAIR Standard for Biomedical Research Data Anita Bandrowskia, Jeffrey S. Grethea, Anna Pilkoa, Tom Gillespiea, Gabi Pinea, Bhavesh Patelb, Monique Surles-Zeiglera, and Maryann E. Martonea,*, aUniversity of California, San Diego, CA bCalifornia Medical Innovations Institute, San Diego, CA *Correspondence should be addressed to mmartone@ucsd.edu Abstract The NIH Common Fund’s Stimulating Peripheral Activity to Relieve Conditions (SPARC) initiative is a large-scale program that seeks to accelerate the development of therapeutic devices that modulate electrical activity in nerves to improve organ function. Integral to the SPARC program are the rich anatomical and functional datasets produced by investigators across the SPARC consortium that provide key details about organ-specific circuitry, including structural and functional connectivity, mapping of cell types and molecular profiling. These datasets are provided to the research community through an open data platform, the SPARC Portal. To ensure SPARC datasets are Findable, Accessible, Interoperable and Reusable (FAIR), they are all submitted to the SPARC portal following a standard scheme established by the SPARC Curation Team, called the SPARC Data Structure (SDS). Inspired by the Brain Imaging Data Structure (BIDS), the SDS has been designed to capture the large variety of data generated by SPARC investigators who are coming from all fields of biomedical research. Here we present the rationale and design of the SDS, including a description of the SPARC curation process and the automated tools for complying with the SDS, including the SDS validator and Software to Organize Data Automatically (SODA) for SPARC. The objective is to provide detailed guidelines for anyone desiring to comply with the SDS. Since the SDS are suitable for any type of biomedical research data, it can be adopted by any group desiring to follow the FAIR data principles for managing their data, even outside of the SPARC consortium. Finally, this manuscript provides a foundational framework that can be used by any organization desiring to either adapt the SDS to suit the specific needs of their data or simply desiring to design their own FAIR data sharing scheme from scratch. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430563doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430563 http://creativecommons.org/licenses/by/4.0/ 2 1. Introduction The NIH Common Fund’s SPARC project, Stimulating Peripheral Activity to Relieve Conditions, is a large-scale program whose mission is to map the peripheral nervous system across multiple species and improve our understanding of nerve-organ interactions. SPARC achieves this aim by providing access to high-value datasets, maps, and computational studies in support of bioelectronic medicine. Bioelectric medicine is defined as “...the convergence of molecular medicine, neuroscience, engineering and computing to develop devices to diagnose and treat diseases”1. Integral to the SPARC program are the rich anatomical and functional datasets produced by investigators across the SPARC consortium that provide key details about organ-specific circuitry, including structural and functional connectivity, mapping of cell types and molecular profiling. These datasets are provided to the research community through an open data platform, the SPARC Portal available at sparc.science. SPARC is also developing new tools and technologies to support modeling and simulation of nerve-end organ interactions. The data produced by the SPARC project is highly heterogeneous, deriving from multiple species, spatial and temporal scales, and anatomical, physiological and molecular techniques. To ensure that SPARC data adhere to the principles for making data Findable, Accessible, Interoperable and Reusable (FAIR)2, the SPARC curation team is charged with identifying, and implementing community standards and annotating SPARC data with rich metadata. Standards are integral to FAIR because they make it easier to combine across datasets, ensure that necessary metadata is provided, and make it possible to write automated tools to promote reuse of data. Community standards are either adopted from other domains or developed by SPARC to serve their needs. To date, SPARC has been curating data to two primary standards developed by the SPARC consortium: 1) The Minimal Information Specification (MIS), a semantic metadata scheme capturing key experimental and dataset details; 2) The SPARC Dataset Structure (SDS), a file and metadata organizational scheme based on the Brain Imaging Data Structure (BIDS), developed by the neuroimaging community3. SPARC investigators are required to organize their data files and metadata according to SDS; SPARC curators then align the submitted metadata and file pointers to the MIS using automated and semi-automated workflows. In this paper, we explain the rationale behind the design of the SDS and give a detailed description of the associated guidelines. This provides a full overview and instructions for anyone wanting to follow these FAIR data standards for any field of biomedical research. The SDS may be useful for fields where FAIR data standards are yet to be established as it is agnostic to data type. We also present automated validation and curation tools that have been developed for SPARC, which could facilitate use of the SDS beyond SPARC. This paper also provides a foundational framework that could be used for adapting the SDS to suit the specific needs of data from a particular field of research. 2. Overview of SPARC Curation Process Data and curation services and infrastructure for SPARC are provided by the SPARC Data and Resource Center. Currently, SPARC data is uploaded to the Blackfynn data platform4, which provides a private, password-protected space for researchers to store and organize their data. Data are uploaded from individual investigators in the SPARC consortium according to timelines and milestones negotiated with the US National Institutes of Health (NIH). Investigators are required to upload their data within 30 days of completing a particular milestone. Each batch of data uploaded to complete a milestone is considered a SPARC dataset. Investigators are given instructions and templates for organizing their data according to the SDS and are expected to upload their data in this format. Once uploaded, data are curated by SPARC curators who will review for compliance with the SDS, completeness of data and metadata and overall quality. Certain types of data, e.g., 2D and 3D images, undergo spatial registration using the TissueMaker software developed by MBF Biosciences with organ-specific 3D scaffolds and data visualizations being created by the Auckland Bioinformatics Institute (ABI). A more detailed curation workflow is described in Section 7. When complete, a dataset in SPARC comprises the following: 1. Data files uploaded to the Blackfynn platform organized according to the SPARC Data Structure that includes all required metadata 2. A complete detailed experimental protocol in Protocols.io describing any procedures used to obtain the data uploaded 3. If applicable, a set of fiducial mark up of 2D images for spatial registration of images to scaffolds; converting image files to required formats (performed by MBF Biosciences) 4. If applicable, data registered to 3D spatial scaffolds, which includes creating visualizations of certain types of data, e.g., RNAseq (performed by ABI) 5. A set of curator’s notes that accompanies the data file to summarize key parameters of the dataset In this paper, we outline the rationale and structure of the SDS and some of the tooling that has been developed to support it. A separate paper will be prepared for the MIS. 3. Development of the SDS To capture data across diverse types of biological data, the SPARC Consortium has adopted the Brain Imaging Data Structure (BIDS, RRID:SCR_016124) format for research objects as a foundation for the SDS (see Fig. 1). The BIDS format is a simple file folder organization and metadata scheme. At the top level, the BIDS format functions as a series of folders representing a dataset, consisting of a set of specified files and subfolders containing different types of metadata and data. 3.1. Rationale Formal data structures, like BIDS aim to increase the integrity of scientific research through the active encouragement and facilitation of FAIR. “Findability” is improved when the names of organisms and organs are .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430563doi: bioRxiv preprint https://scicrunch.org/resolver/RRID:SCR_016124 https://doi.org/10.1101/2021.02.10.430563 http://creativecommons.org/licenses/by/4.0/ 3 standardized to established community ontologies. “Accessibility” and “Interoperability” are improved as files are organized in more predictable locations across different datasets and when they use common and open formats, such as csv or tiff. “Reusability” is improved by ensuring that all contributed data is well annotated and conforms to community standards, e.g., minimal information models, when such are available, and are made available under a clear license. For SPARC, all datasets are released under the CC-BY-4.0 license. BIDS was deliberately and carefully designed to complement likely research practices in the laboratory to ensure accurate capture of complex imaging experiments. Towards this end, BIDS can be used by laboratories with minimal bioinformatics experience or support to manage, exchange and, submit well-annotated data in a human and machine-readable format. The BIDS format creates a resulting structure sufficiently standardized to support the creation of validation code (e.g., BIDS Validator RRID:SCR_017255). The BIDS validator is an application that checks for the presence of required files, and the completion of required fields within those files. The BIDS format and BIDS validation code are already used in several repositories that store imaging data, including OpenNeuro.org (RRID:SCR_005031). BIDS was developed and refined over many years, through many meetings and by many contributors. This standard has become relatively well accepted in the neuroimaging community as a means to package and describe neuroimaging studies and has been endorsed by the International Neuroinformatics Coordinating Facility (INCF) through its standards review process5. The curation team joined SPARC in 2018 just as the first deadlines for data submission by consortia members were approaching. Based on the recent INCF endorsement of BIDS, we recommended the project adopt a modification of BIDS as an initial effort to coordinate data across different laboratories. Although BIDS was developed originally for neuroimaging, its basic structure is adaptable to various experimental paradigms. Because of the diversity of data in SPARC, the large number of files and complex structure of the datasets, we felt that without a consistent structure, data in SPARC would be very difficult to work with by end users, and very difficult to curate, as each dataset would be organized and documented differently. As BIDS had already gone through multiple rounds of community review, including the independent INCF review, and is a recognized standard for the OpenNeuro Data Archive supported by the US BRAIN Initiative, we felt confident that it provided a solid foundation for SPARC in the early stages of data sharing. 3.2. SDS overview The BIDS structure was modified to remove neuroimaging specific aspects, and to accommodate the fact that most data in SPARC are derived from animals and animal tissue. Thus, unlike in non-invasive neuroimaging studies, data may be acquired at the subject level, e.g., in vivo physiological recordings, or at the specimen level (from an ex-vivo tissue specimen or in vitro cell culture) (Fig. 2). The proposed modifications to BIDS were accepted by the SPARC Data Standards Committee and we moved forward with working with investigators to organize their data according to the SPARC Dataset Structure. Version 1.0 was put in place to organize the first data submitted from January 2019 - July 2019 in anticipation of the debut of the SPARC data portal at the 11th Congress of the International Society for the Autonomic Neuroscience (ISAN 2019). The overall structure is shown in Fig. 2B. It defined a set of high-level folders, including one for Subjects and one for Specimens, and included various spreadsheets into which investigators could enter metadata for the dataset as a whole (dataset_description), subjects and samples. Note that the file format chosen for these spreadsheets is .xlsx, rather than an open format like .tsv or .csv. Although .csv is the preferred file format for tabular data in SPARC, the curation team wanted to make it easier for both investigators and curators by including features such as drop down value sets for certain metadata fields, features which are not supported by these basic formats. In addition, the Blackfynn data platform did not have a viewer available for .csv files, but did support on-line viewing of .xlsx through the Microsoft Open Office suite. As with BIDS, the SDS follows the inheritance principle that requires any metadata files in the root directory to apply to all folders and files below it, except when explicitly overridden by a metadata file contained in a lower order folder. After a review of datasets submitted for ISAN and interviews with investigators, the curation team modified the basic structure to simplify the folder structure (Fig. 2C), collapsing the subject and samples folders into a single folder named primary. Samples may now be nested under their respective subjects. The current release is version 1.2, (Fig. 2C). The required folders and files are Figure 1. Transformation between DICOM and BIDS3. Figure 2. A comparison of high-level details of BIDS (A), SDS 1.0 (B) and SDS 1.2 (C). .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430563doi: bioRxiv preprint https://scicrunch.org/resolver/RRID:SCR_017255 https://scicrunch.org/resolver/RRID:SCR_005031 https://doi.org/10.1101/2021.02.10.430563 http://creativecommons.org/licenses/by/4.0/ 4 provided to investigators as a downloadable versioned template via GitHub (https://github.com/SciCrunch/sparc- curation/releases/tag/dataset-template-1.2.3). All datasets are now curated according to version 1.2, including those that were released for the ISAN 2019 meeting. Although the SDS is modeled on the approach by BIDS, i.e., file folder organization, naming scheme and provision of critical metadata, it is sufficiently distinct from BIDS that we do not consider it an extension, but rather a derivative (see Fig. 2 for a comparison between BIDS and SDS). We now describe SDS V1.2 in more detail. 4. SPARC Data Structure V1.2 The SPARC Dataset Structure includes the following components (Fig. 3): • A set of organized data files in a hierarchical set of predictably named folders and subfolders. Folders/subfolders may contain supplementary and additional documentation, i.e. manifest files that describe the files and/or folders contained therein. • A set of descriptive top-level files that contain information on subjects, experimental information, and dataset descriptions. These descriptive files include both spreadsheets containing structured metadata and text files with additional information. • A set of file manifests associated with each folder that provides descriptions of the contents. 4.1. Top-level structure Data files are organized into 3 different top-level folders, depending on the type of data: • primary: a required dataset dependent folder that contains all folders and files for experimental subjects and/or samples, e.g., time-series data, tabular data, clinical imaging data, genomic, metabolomic, microscopy data. The data generally have been minimally processed so they are in a form ready for analysis. Within the primary folder, data is organized by subjects or samples (see Section 5). All subjects and samples will have a unique folder with a standardized name corresponding to the exact names or IDs as referenced in the subjects and samples metadata file (see Fig. 3). • source: an optional folder containing unaltered, raw files from an experiment, if they are included in the data. For example, this folder may include the “truly” raw k-space data for a Magnetic Resonance (MR) image that has not yet been reconstructed, or a set of microscopic images that had not yet been assembled into a mosaic. The reconstructed DICOM or NIFTI files and the image mosaic, for example, would be found within the primary folder. • derivative: a required folder if derivative data exists. This folder contains derived data files. For example, processed image stacks that are annotated via the MicroBrightField (MBF Biosciences) tools, segmentation files, or smoothed overlays of current and voltage that demonstrate a particular effect. If files are converted into a format other than what was submitted, these files are included in the derivative folder. Derived data should be organized into subject and sample folders, using the subject and sample IDs as the folder names, as with the primary data. Other files are organized in three different (optional) folders: • code: a required folder only if code is used in generation of the data; the folder contains all the source code used in the study, e.g, MATLAB. • protocol: an optional folder that contains supplementary files to accompany the experimental protocols submitted to Protocols.io. The additional files in this folder are not a substitution for the experimental protocol which should have been submitted to Protocols.io/sparc. • docs: an optional folder that contains all the supporting documents for the dataset, including but not limited to, a representative image for the dataset. Unlike the readme file, which is necessarily a text document, docs can contain documents in multiple formats, including images. Figure 3. The organization structure of the files and folders for a SPARC dataset. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430563doi: bioRxiv preprint https://github.com/SciCrunch/sparc-curation/releases/tag/dataset-template-1.2.3 https://github.com/SciCrunch/sparc-curation/releases/tag/dataset-template-1.2.3 https://www.protocols.io/ https://www.protocols.io/groups/sparc https://doi.org/10.1101/2021.02.10.430563 http://creativecommons.org/licenses/by/4.0/ 5 4.2. Descriptive top-level files A set of descriptive, top-level files contain information on subjects, samples, dataset descriptions and administrative data. These files contain required metadata fields that are aligned to the DataCite schema (dataset description), and the HBP’s (Minimal Information about a Neuroscience Data Set)6 for subjects and samples. Additional recommended fields are included for each (see Supplementary Material A). Investigators are encouraged to add additional columns beyond this core set to thoroughly describe the dataset. While there is a great deal of flexibility built into the metadata templates in order to accommodate the diversity of experimental paradigms and data, for the effective functioning of the validator (described in Section 6), it is important for data wranglers not add, edit or delete required columns in the mandatory descriptive files (these are color-coded (see green and blue in Appendix A). If there is information that doesn't correspond with available columns, the information should be added to a new column on the right-hand side (subjects and samples) or a new row on the bottom of the sheet (dataset description). If there is information not available to the researcher at the time of submission, fields should be left empty or marked “unknown”. An overview of the spreadsheet metadata templates is provided below: • dataset_description (xlsx, csv or json): Required file containing basic metadata about a dataset, derived largely from the DataCite Schema7. A full list of metadata and definitions is provided in Supplementary Table A1. Investigators provide basic metadata such as title, description, contributors, funding and contact person, that provide provenance for the dataset and also support formal data citation. The version 1.2 release includes an additional field specifying the metadata version. This field is not to be changed by data submitters. It allows proper alignment between different metadata releases, securing the data integrity for multiple batches of submissions. We also encourage researchers to describe if they plan to submit more data for new or for the same subjects, i.e., this dataset is part of a larger study. This will help determine when all the primary data has been deposited and help with mapping across the different parts of the dataset. • submission (xlsx, csv or json): Required file containing information relevant to internal SPARC bookkeeping, relating milestones negotiated with NIH to datasets submitted. According to the SPARC Material Sharing Policy8, data is to be deposited within 30 days of milestone completion and will become public no later than 1 year after milestone completion. This file is for internal use only; it must not be released when the data are published. • subjects (xlsx, csv or json): Required file if subjects are used in the experiment producing the dataset. Contains updated fields with required and optional metadata fields providing information about subjects (model organism or animals) involved in data collection. The file contains fields specifying provenance for the subject, e.g., subject_id, pool_id and experimental group (blue fields in Appendix A2). Each subject and pooled subjects must be assigned a unique ID, as this ID is used to name the data folders for individual subjects. For proper mapping of the data, folders containing experimental data need to exactly match the subject ID. All subject identifiers must be unique within a dataset and not contain any sensitive, identifiable information (for human subjects). Having each lab use unique subject identifiers across datasets is highly desirable to aid in connecting multiple experiments using the same subjects. In the future, we plan to connect subjects across datasets and projects; however, we currently do not map subjects across multiple data submissions. The subjects.xlsx file contains several mandatory fields (green in Appendix A1) including species, age, strain and Research Resource Identifier (RRID). Additional columns containing additional descriptive metadata, demographic assessment data, etc, largely derived from OpenMINDS, are provided for investigators in the template. In the download template, these are highlighted in yellow (See Appendix A1) and serve as exemplars of the types of metadata that are important for providing scientific context. According to the FAIR principles, data should be described by a “plurality of relevant attributes”, but we are leaving it up to the investigators’ discretion to decide what is sufficient for others to understand and reuse the data. Investigators have the liberty to add as many fields as needed that they deem necessary. Currently, all metadata provided for subjects and samples is provided in free text, which is then mapped to the SPARC vocabularies by the curation team (see Section 6.1). However, we are actively working with investigators on lists of controlled vocabularies for certain fields. • samples (xlsx, csv or json): Conditional file required if measurements are obtained from samples, e.g., tissue slices, derived from individual or pooled subjects. This file contains information about samples used to generate the data. Investigators must provide a unique ID for each sample that will be used to name the data folders. The sample ID must match the folder ID exactly. Each sample should also reference a subject from the subject file; a single subject (a research animal/donor) may be linked to multiple biological samples derived from that subject. If the samples are pooled from multiple subjects, the complete provenance must be specified in the subject file. The metadata present in the samples file should also explicitly note whether a sample was collected directly or was derived from another sample. Required metadata includes the subject or tissue from which the sample was derived and the anatomical location (green in Appendix A3) Additional Fields may be added by the investigator. The template provides some suggested fields derived from the Minimal Information about a Neuroscience Dataset (OpenMINDS). Investigators .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430563doi: bioRxiv preprint https://commonfund.nih.gov/sites/default/files/SPARC_material%2520sharing%2520policy%252026jan17_508.pdf https://commonfund.nih.gov/sites/default/files/SPARC_material%2520sharing%2520policy%252026jan17_508.pdf https://doi.org/10.1101/2021.02.10.430563 http://creativecommons.org/licenses/by/4.0/ 6 should only use columns that are relevant to their type of study. An overview of the descriptive text files is provided below: • README (txt): Required file provided by Investigators that contains necessary details for reuse of the data, beyond that which is captured in structured metadata. Some information that should be included are: o How would a user use the files that are provided? E.g., first open file X and then look at file Y. o What additional details do they need to know? Are some subjects missing data? o Are there warnings about how to use the data or code? o Are there appropriate/inappropriate uses for this data? o Are there other places that users can go for more information? e.g., did you provide a GitHub repository or are there additional papers beyond what was provided in the metadata form? • Plog (xlsx or txt): Optional performance log file, which can be used to attach information about individual performances of the experiment, e.g. how long they took, what the average room temperature was, or who performed them. There is currently no other place in the data model to attach that kind of information. • CHANGES (txt): Conditional file required if a new version of the dataset is uploaded to document any changes from the previous version. 4.3. Manifests Manifest (xlsx, csv or json): Required file that must be in all folders containing data files (Fig. 3, Fig. 4)9 and in folders with subfolders whose meaning is not clear. This file contains information and metadata about the files and folders that are expected in the folder where they sit. Required fields include file name (or file name pattern for folders with many related files), description and file type, although investigators have a lot of flexibility by adding additional columns, including notes about pertinent aspects of each file that differentiate the files (e.g., data collection specific protocol, stimulation condition, microscope filter applied, drug applied, etc). The manifest file can apply to collections of files (through the use of a file pattern) or list specific files (e.g. sub??-task1-run?? can specify all the files related to task1 in the protocol). If investigators include folders that organize data along a particular dimension, e.g., datatype or time point, a manifest file should be generated that describes the content of the folders. 5. Folder hierarchy principles The folders and files pictured in Fig. 3 are required and invariant for each SPARC dataset. This invariance imposes a standard structure for SPARC datasets that allows a user to reliably navigate the often complex experimental data (Fig. 5). However, given the variety of different experimental protocols and the way in which subjects and samples are treated across different types of experiments, the folder and file structure can vary among different datasets within the primary data folder. Some examples are provided with the template download (Fig. 6). For the majority of SPARC datasets, data in the primary data folder are organized into subject folders, with the folder names corresponding to the subject IDs provided in the subjects.xlsx file. If samples are derived from these subjects, data files are organized within sample subfolders under the appropriate subject, according to this pattern (Pattern 1) The inheritance principle applies, so that if sample 3 (sam-3) appears as a subfolder of subject 1 (sub-1), then it is assumed that sample 3 was obtained from subject 1 (Fig. 5). In some cases, no data may be derived from the subjects directly, i.e., no data files are generated at the subject level. In this case, investigators could omit the subject folder (although the subject.xlsx file must be included to provide the appropriate metadata). Time series data: For functional studies where measurements are obtained at different time points, Figure 5. Relationships between metadata files and folder structure. Example taken from (Morris et al. 2020) Figure 4. Example of a complete manifest. From Morris et al (2020). .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430563doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430563 http://creativecommons.org/licenses/by/4.0/ 7 different time points should be organized into folders labeled perf-1, perf-2 etc, where the numbering indicates the temporal ordering, under either the sample or subject folder. An example is shown in Fig. 6 (Pattern 2). Note that in this case the manifest file would specify information about the data that will be found in the perf-1 and perf-2 folders. Pooled samples or subjects: Although the majority of data are organized with subjects nested under the primary data folder and samples nested under subjects, this simple hierarchical arrangement does not apply to all datasets. In some cases, samples may be pooled from multiple subjects, in which case the sample folder lives alongside the subject folder and not nested within it, according to Pattern 3. The SDS also accommodates subject pools where the samples folder is replaced with the pooled folder (Pattern 3; Fig. 6). Note that pool_ids and characteristics must be provided in the subjects.xlsx file. 6. Tooling to support SPARC Dataset Structure 6.1. SDS Validator To enforce the SDS structure and required metadata fields, the curation team developed a SPARC Dataset Structure validator10 that is used for frequent checks to ensure the integrity of the data across the platform and provide valuable feedback to the curation team. The validator is written in python and uses JSON schema11 to specify the expected structure of the dataset files and folders, as well as the structure and contents of the 4 types of metadata files (dataset_description, subjects/samples, submission, and manifest). Tabular metadata files are transformed into JSON, and validated against the schemas. The validator first checks that all the required metadata files are present after which the content of the individual metadata files is validated. For example, in the subjects.xlsx file, checks are performed to ensure that all subject ids are unique, that there are not names in columns that expect numbers (e.g. 'adult' in the 'age' column is an error) and that the files in the primary data folder match the names and number of subjects and samples provided in the metadata files. The validator also checks that organism and anatomical entities are present in the appropriate columns, by matching the content of these columns against the SPARC vocabularies (see section 6.1). This is not an exhaustive list of the checks that are performed, but it gives a flavor for the types of checks that are done (Fig. 7). Some of the most common mistakes detected by the validator arise when investigators remove headers or cells for which they do not have information. This means that the validator looks for information in the wrong place, e.g. if species information is expected in cell F2 (Fig. 5), but the investigator deleted column E, then the strain information will be noted in the species field producing an error because c57BL/6J is not in the ontology as a valid species name, and all other information to the right of the deleted column or below the deleted row will also be incorrect. Errors are noted per dataset and categorized by type for curation so that curators can act on the error. For simple alignment errors, the curators usually replace the affected files by pasting misaligned data into a fresh template. With the newly released data organization tool, SODA, (see Section 6.2) these sorts of errors will be less of a problem because at least some of the metadata files will be replaced by a form that asks investigators questions and produces a properly formatted file. The process of validation is done automatically on each dataset, but is only meaningful for datasets that are undergoing curation, where these errors are read and acted upon. While the data are being prepared by the investigator for submission, it is not uncommon for datasets to have very large error numbers as none of the files may be in the right location and metadata fields may be incomplete. The complete curation workflow is Figure 6. Dataset-template 1.2.3 folder hierarchy. Figure 7. Workflow for the SDS Data validator. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430563doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430563 http://creativecommons.org/licenses/by/4.0/ 8 described in Section 7. In addition to running the validation of the required metadata, the validation code also extracts metadata from the SDS, and maps them to the MIS. During this process, certain metadata fields, e.g., anatomical structure, are mapped to the NIF Standard Ontology12 ( RRID:SCR_005414), which in turn imports multiple community ontologies such as NCBI Taxonomy, UBERON, ChEBI. Additional ontologies, e.g., FMA are used as necessary. A list of identifiers used to map SPARC data is provided in Table 1. The validator produces a set of files from the contents of the required metadata files, the ontologies and other data sources such as protocols.io. These are made available in several formats including the MIS “ttl” file (also JSON and CSV), to Blackfynn, curation systems, and the DRC staff. With each run of the validator code the metadata in the ttl file will therefore change to reflect the current state of the dataset. The curation team has created a private searchable and sortable table using UCSD’s SciCrunch.org infrastructure, https://scicrunch.org/sparc, which allows curators to quickly see which elements are missing in each dataset and determine if the error can be fixed by curators or whether the investigator needs to help resolve issues. 6.2. Software to Organize Data Automatically (SODA) for SPARC Complying with the above-described guidelines requires additional time investment from the researchers (as with any data curation and submission standards), and the data curation process can become progressively overwhelming as additional data is submitted. If researchers are not currently using any standard way of organizing their data within the laboratory, in the long run, this work will benefit the laboratory. However, if researchers already use a formal method for organizing their data, complying with SPARC requirements could prove even more burdensome as they must organize their data according to additional rules. To remediate this issue, a software named Software to Organize Data Automatically (SODA) for SPARC has been developed to assist SPARC investigators in easily curating and annotating their datasets. Distributed as an open-source (MIT license) and cross-platform (Windows, macOS, Linux) desktop application, the goal of SODA for SPARC is to bridge a long-standing, overlooked gap between comprehensive data standards and their convenient application by researchers. SODA for SPARC provides an interactive interface that, without requiring any coding knowledge, walks SPARC investigators step-by-step through the SPARC data curation process, all the while automating repetitive, complex, and time-consuming tasks. Besides being time-efficient, SODA for SPARC also provides the convenience to SPARC investigators of organizing their datasets following a custom workflow (e.g., based on personal preferences or to comply with internal guidelines applicable in their labs) and rapidly organize their data according to the SDS only when they are ready to submit the dataset for review by the SPARC Curation Team. The SODA for SPARC installers as well as the source code are accessible via the dedicated GitHub repository13. During the first phase of development (May 2019-August 2020), the following features were integrated into SODA for SPARC (Fig. 8): 1. Prepare submission and dataset_description metadata files through an intuitive interface and with assistance from the program that provides access to standard values/terminologies and makes automated suggestions based on previously saved information. 2. Prepare datasets step-by-step via a convenient interface • Specify desired local data files to be included in each of the SPARC folders. • Specify metadata files to be attached. • Request manifest files to be generated automatically. • Check that information provided during the previous steps will generate a SPARC-approved dataset using an automated validator (before a thorough validation by the SPARC Curation Team). • Generate a dataset based on information specified during the previous steps either locally or directly on the Blackfynn platform (to avoid duplicating files on the user’s computer). Table 1. List of ontologies and controlled vocabularies used to map SPARC metadata. Entity Identifier sys- tem Controlled Vocab- ularies Author ORCID Contributor roles DataCite Species NCBI Taxonomy Strains RRIDs Antibodies RRIDs Cell Lines RRIDs Software tools and instru- ments RRIDs Anatomical structures UBERON and FMA Small mole- cules ChEBI Techniques NIFSTD Experimental modalities Controlled list See Appendix B, Table B1 Diseases or conditions MONDO or Dis- ease Ontology .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430563doi: bioRxiv preprint https://scicrunch.org/resolver/RRID:SCR_005414 https://doi.org/10.1101/2021.02.10.430563 http://creativecommons.org/licenses/by/4.0/ 9 3. Manage datasets by easily connecting to Blackfynn with SODA for SPARC then conveniently create datasets, add metadata to Blackfynn datasets, manage dataset permissions, upload local files/folders, and share datasets with the SPARC Curation Team for review. During the second phase of development (starting September 2020), more features are being added to the software including a virtual interface for organizing data, support for collaborative data curation, assistance for preparing samples and subjects metadata files, and file- level curation support. The user interface is also being upgraded to make use of the software more intuitive. A screenshot of the user interface from the current version (3.0.1) is provided in Fig. 9. A team of 10 beta testers, all of whom receive funding from the SPARC program, is reviewing and providing feedback frequently to ensure that SODA for SPARC meets the needs of the SPARC investigators. Preliminary testing by the beta testers has shown that computer-assisted curation with SODA not only reduces the time required by investigators to organize and submit their data, but also minimizes human errors14. More features will be included in the future to enhance further the curation workflow and ensure that SPARC datasets are disseminated efficiently. Even beyond the SPARC consortium, quality data curation is a critical concern. SODA for SPARC could impact the broader research community by providing an exemplar, foundational tool for convenient and time-efficient data curation, which could then be adopted by other projects. In the future, we expect to modify the BIDS-inspired SPARC SDS for computational studies (the changes as they currently stand are in a draft version and will need to be approved by the data sharing committee before being acted on) that are undertaken as part of the SPARC project, it is likely that this will involve changes to the SODA for SPARC tool in compliance. 7. SPARC data submission workflow All Investigators in SPARC have 1 year from the time a milestone is completed (Fig. 10), and a draft dataset is submitted (step 1) to publish the resulting dataset (step 4). A dataset is published when it has been assigned a digital object identifier (DOI) and is available for viewing and download by the public. During that year, the dataset will move through several curatorial stages and possibly an embargo period. Investigators will have 30 days from the completion of a milestone to formally submit their data to the SPARC Data Repository. Data is considered completely submitted only when the data are shared with the Data Curation Team. Once curation is complete, the dataset moves into an embargo phase or is published. During the embargo phase, the data set is visible only to members of the SPARC consortium who have signed a data use agreement. The submission + curation + Figure 8. Overview of the major features included in SODA during the first development phase. Figure 9. User interface (on a Windows computer) from version 3.0.1 of SODA for SPARC. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430563doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430563 http://creativecommons.org/licenses/by/4.0/ 10 embargo period add up to 1 year, that is, the length of the embargo period depends on how long it takes to curate the data to the above standards. Curation is a collaborative process that involves a back and forth between the investigator and the curation team and so the time to completion is difficult to predict. However, if investigators wish to publish before the end of the embargo period, they are encouraged to do so. Creating a SPARC dataset in the SDS structure involves multiple steps. Instructions for creating a dataset with detailed steps can be found at https://sparc.science/help/7k8nEPuw3FjOq2HuS8OVsd) : • INVESTIGATOR: Create and name a draft dataset in their private space on the SPARC Data Repository hosted by Blackfynn (within the “SPARC Consortium” organization on Blackfynn) • INVESTIGATOR: Organize and upload files to the dataset within this space according to the requirements of the SPARC Data Structure, using the template provided by SPARC. • INVESTIGATOR: Request a publication review. This step initiates the curation process and locks the dataset so that changes can only be made by the curation team • INVESTIGATOR: Upload the experimental protocol to Protocols.io and share this protocol with the SPARC group. • SCRIPT: Downloads all SPARC data from Blackfynn. • CURATOR: Logs all SPARC datasets into the master spreadsheet, their status and any communication tickets with the investigator. • SCRIPT: Run weekly to find new datasets by matching the dataset IDs in the data dump with those on the master spreadsheet. • CURATOR: Send an email acknowledgment when new dataset is detected within 5 working days. • SCRIPT: Run all datasets through the validator • CURATOR and INVESTIGATOR: Curators will work with the automated validator report and investigators to ensure that required fields are complete and the folder structure is appropriate. • CURATOR: Find MIS data elements in the protocol using semi-automated tools, adding these to the structured metadata package that will be sent back to Blackfynn as a .ttl file. • CURATOR: Hand off image datasets to MBF Biosciences curators for segmentation assistance, spatial registration and conversion to SPARC approved formats and to transform the banner image. • CURATOR: Hand off data If genetic or physiology data are present to the Auckland curation team to create appropriate data visualizations for those data types. • CURATOR: Finalize the dataset within Blackfynn, adding the finalized description once data is aligned to the SPARC standards, annotated and sign off is received from the MBF Biosciences & Auckland team, adding license information and provisioning a DOI, if the data are to be published immediately. • INVESTIGATOR: Final check by PI of dataset after curators sign off. • INVESTIGATOR: Request dataset to be published • CURATOR: Publishes the dataset (Principal Investigator), or allows it to be published automatically after the embargo period ends. These steps can be viewed within the private data portal using the “dataset status”, a feature implemented in Blackfynn in December 2019. The steps that each dataset go through are formalized, numbered, and color coded (Fig. 11). Each label is associated with the party that is responsible for setting the particular status. Please note that the teams at MBF Biosciences and ABI are considered curators for this workflow. These teams are responsible for ensuring that SPARC data are aligned to common spatial frameworks, as described in the introduction. These steps are not necessarily performed in sequential order. For example, the image registration, conversion and segmentation performed by MBF Biosciences may be performed before the imaging files are uploaded to Blackfynn. Researchers do, however, create the necessary dataset descriptors in Blackfynn and often upload the necessary metadata files. This will mean that in some cases the order will go from 1-2-5-6(MBF Biosciences)-3-4(UCSD Curation). Fig. 12 is a schematic representation of the workflow described above. It highlights how data is generated by individual investigators, curated by the Data Curation Team, and shared as an embargoed dataset with the SPARC Embargoed Data Sharing Group. It shows how the data is made available to the public over time. Figure 11. Ordered status types set by Investigators or set by Cura- tors. Figure 10. Data submission milestones. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430563doi: bioRxiv preprint https://sparc.science/help/7k8nEPuw3FjOq2HuS8OVsd https://www.protocols.io/groups/sparc https://doi.org/10.1101/2021.02.10.430563 http://creativecommons.org/licenses/by/4.0/ 11 8. Quality Control Metrics for SPARC Datasets The SPARC curation team has developed a set of QC guidelines that are used to check for errors and to ensure consistency in the descriptions of SPARC data. SPARC datasets are checked by the curation team for the following: 1. They conform to the requirements of the SPARC Data Set Structure 2. The files are appropriately organized into the primary, derivative, docs, code, protocol folders 3. Manifests are included at each level of the folder tree and contain a sufficient description of the files present. 4. Title and description are clear, appropriate and detailed15. 5. If the data are part of a larger dataset, that relationships are specified 6. The species is appropriately identified and referred to consistently across protocol, dataset description and experimental data 7. All types of experimental data referred to in the abstract or protocol are contained in the dataset 8. All abbreviations used to describe the dataset across the different documents are defined 9. All experimental or sample groups referenced in the metadata are defined 10. All file types submitted conform to approved file types (upcoming) 11. As metadata standards are defined they are used appropriately A checklist has been developed, which also includes questions that can be asked of the investigator16. Some of these checks will be incorporated into future versions of SODA. 9. Discussion The establishment of the SDS has proven to be essential to curating the complex and large datasets submitted for SPARC. With a common structure, curation can take advantage of tools such as the validator to help with the curation process, thereby allowing them to focus on the scientific aspects of the dataset, e.g., is the description and protocol clear, rather than simple mechanistic tasks such as checking whether the number of subjects listed matches the number of subject folders. We realize that the SDS presents extra work for SPARC investigators, who must adapt their local lab practices to comply with a new structure. However, with the launch of SODA, investigators should find it easier to walk through the curation process. Finally, as the SPARC portal evolves, the user interface can take advantage of the regular structure to make it easier to browse SPARC data in a consistent manner. In order for the SPARC project to meet its deliverables, the first round of standards needed to be implemented relatively quickly. The first public data released for SPARC occurred in July 2019 at the ISAN 2019 meeting. At that time, curators were curating to SDS 1.0, but many of the datasets released were demo datasets and were not fully structured. The SDS was revised in October of 2019 in response to the July release and through discussions with investigators. Data for the February 2020 release was curated to SDS 1.2.3. At that time, all of the original datasets were also recurated. As the SDS continues to evolve - version 2.0 is scheduled to be released in spring of 2021 - we are not planning on recurating older data, as it would not be feasible to constantly revise the large number of datasets available through SPARC. We are, however, extracting larger amounts of structured information from these datasets, e.g., from the experimental protocols, and mapping it to the MIS, so some re-curation of metadata Figure 12. Overview of the entire submission-curation-publishing workflow. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430563doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430563 http://creativecommons.org/licenses/by/4.0/ 12 does occur. This information will be used to create more powerful and nuanced search across SPARC datasets and models. Because of the Consortium’s variability in experimental methodology and primary data types, we are continuing to evaluate whether the SPARC Data Structure is sufficient for any investigator to unambiguously interpret the datasets from other research labs. On the technical side, we are examining this new structure’s ability to facilitate datasets to be exchanged and queried freely as well as understood by other scientists. For each use case such as simulation data or physiology data, we look at the relevant SPARC protocols and current results to determine the required parameters needed to understand the resulting data. We are using this information as a basis for formalizing modality- specific extensions to the SDS and MIS and to develop QC guidelines, as outlined in Section 8. There are several additional areas where standardization will benefit SPARC. Within the next year, SPARC will also move to implement more consistent file formats for major data types, ensuring that SPARC data is available in non-proprietary formats. For example, all imaging data will have to be submitted as JPEG2000 and BioTIFF or in a format that can be converted to these formats. Guidelines for additional data types will be released in summer of 2021. A third driver of standards in SPARC is the requirement to be interoperable with other data repositories, particularly those being created by the US BRAIN Initiative and other large brain projects around the world. The US BRAIN Initiative is investing in the creation of standards for major data types such as neuroimaging (BIDS17), neurophysiology (NWB18) and standards for 3D microscopy. These standards underlie the major archives established for BRAIN data: OpenNeuro, DANDI and the Brain Image Library, respectively. SPARC will be monitoring these standards for maturity and will create the means for SPARC data to be converted into these formats. The establishment of standards for SPARC also underwent a governance change after the first data release. While in the first phase of the project, data standards were developed or recommended by the Data Standards Committee comprising SPARC investigators, after the first sets of data were released, responsibility for recommending and implementing new standards was shifted to the curation team, as they are most familiar with the breadth of SPARC data and the areas requiring standardization. The recommendations of the SPARC curation team are then put forward for review by the Data Standards Group and the SPARC community at large. References 1. Olofsson, P. S. & Tracey, K. J. Bioelectronic medicine: technology targeting molecular mechanisms for therapy. Journal of Internal Medicine vol. 282 3–4 (2017). 2. Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, (2016). 3. Gorgolewski, K. J. et al. The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Sci. Data 3, 1–9 (2016). 4. Blackfynn. https://www.blackfynn.com/. 5. Abrams, M. B. et al. A Standards Organization for Open and FAIR Neuroscience: the International Neuroinformatics Coordinating Facility. Neuroinformatics 1–12 (2021) doi:10.1007/s12021- 020-09509-0. 6. openMINDS. https://github.com/HumanBrainProject/openMINDS. 7. DataCite Metadata Working Group. DataCite Metadata Schema Documentation for the Publication and Citation of Research Data. Version 4.3. DataCite e.V doi:https://doi.org/10.14454/7xq3-zf69. 8. SPARC Material Sharing Policy. NIH COMMONS https://commonfund.nih.gov/sparc/materialsharing (2017). 9. Morris, K. et al. Feline brainstem neuron extracellular potential recordings. (2020) doi:https://doi.org/10.26275/1upo-xvkt. 10. sparc-curation: code and files for SPARC curation workflows. https://github.com/SciCrunch/sparc- curation. 11. JSON Schema Draft-06 Release Notes | JSON Schema. https://json-schema.org/draft-06/json- schema-release-notes.html. 12. Bug, W. J. et al. The NIFSTD and BIRNLex vocabularies: Building comprehensive ontologies for neuroscience. Neuroinformatics vol. 6 175–194 (2008). 13. Patel, B. SODA for SPARC: Simplifying data curation for researchers funded by the NIH SPARC initiative. https://github.com/bvhpatel/soda. 14. Patel, B., Srivastava, H., Aghasafari, P. & Helmer, K. SPARC: SODA, an interactive software for curating SPARC datasets. FASEB J. 34, 1–1 (2020). 15. QC documentation for investigators: titles and descriptions - Google Docs. https://docs.google.com/document/d/1zo3xDKRkPfOFJ qlL2F_5mlOx9qJMBzIgBpDWtI2HJW4/edit#. 16. QC Checklist for SPARC - Google Sheets. https://docs.google.com/spreadsheets/d/1EmNkgIBVef Tsi-RPAtRb7KGu-v1KiMhWbm0v- 0mVVUY/edit#gid=1617718411. 17. The Brain Imaging Data Structure - BIDS v1.4.0. https://bids-specification.readthedocs.io/en/stable/. 18. Ruebel, O. et al. NWB:N 2.0: An Accessible Data Standard for Neurophysiology. bioRxiv 523035 (2019) doi:10.1101/523035. Acknowledgements We thank funding from NIH SPARC OT2OD030541, NIH SPARC OT2OD025308, and NIH SPARC OT2OD030213. Author contributions A.B., J.G., A.P., T.G, G.P., M S-Z, and M.M. form the SPARC Curation Team and have all participated in the development of the SDS. B.P. is leading the development of SODA for SPARC. All have contributed to the writing and revision of this manuscript. Competing interest statement AB, MM and JG have equity interest in SciCrunch.com, a tech start up out of UCSD that develops tools and services for reproducible science, including support for RRIDs. AB is the CEO of SciCrunch.com. .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430563doi: bioRxiv preprint http://scicrunch.com/ http://scicrunch.com/ https://doi.org/10.1101/2021.02.10.430563 http://creativecommons.org/licenses/by/4.0/ 13 Appendix Metadata specifications for SPARC datasets V1.2.3 Table A1: Descriptive metadata for V1.2.3. Required fields are highlighted in green while conditional fields (i.e., required if present) are highlighted in yellow. Metadata ele- ment Description Example Name Descriptive title for the data set. Equivalent to the title of a scientific paper. The metadata associated with the published version of this dataset does not currently make use of this field. My SPARC dataset Description NOTE This field is not currently used when publishing a SPARC dataset. Brief description of the study and the data set. Equivalent to the abstract of a scientific paper. Include the rationale for the approach, the types of data collected, the techniques used, formats and number of files and an approximate size. The metadata associated with the published version of this dataset does not currently make use of this field. A really cool dataset that I collected to answer some question. Keywords A set of 3-5 keywords other those in the title that will aid in search spinal cord, electrophysiology, RNA- seq, mouse Contributors Name of any contributors to the dataset. These individuals need not have been authors on any publications describing the data, but should be acknowledged for their role in producing and publishing the data set. If more than one, add each contributor in a new column. Last, First Middle Contributor ORCID ID ORCID ID. If you don't have an ORCID, we suggest you sign up for one. https://orcid.org/0000-0002-5497- 0243 Contributor Affiliation Institutional affiliation for contributors https://ror.org/0168r3w48 Contributor Role Contributor role, e.g., PrincipleInvestigator, Creator, CoInvestigator, Con- tactPerson, DataCollector, DataCurator, DataManager, Distributor, Editor, Producer, ProjectLeader, ProjectManager, ProjectMember, RelatedPer- son, Researcher, ResearchGroup, Sponsor, Supervisor, WorkPackage- Leader, Other. These roles are provided by the Data Cite schema. If more than one, add additional columns Data Collector Is Contact Person Yes or No if the contributor is a contact person for the dataset Yes Acknowl- edgements Acknowledgements beyond funding and contributors Thank you everyone! Funding Funding sources OT2OD025349 Originating Article DOI DOIs of published articles that were generated from this dataset https://doi.org/10.13003/5jchdy Protocol URL or DOI URLs (if still private) / DOIs (if public) of protocols from protocols.io re- lated to this dataset Additional Links URLs of additional resources used by this dataset (e.g., a link to a code repository) https://github.com/myuser/code-for- really-cool-data Link Descrip- tion Short description of URL content, you do not need to fill this in for Origi- nating Article DOI or Protocol URL or DOI link to GitHub repository for code used in this study Number of subjects Number of unique subjects in this dataset, should match subjects metadata file. 1 Number of samples Number of unique samples in this dataset, should match samples metadata file. Set to zero if there are no samples. 0 Complete- ness of data set Is the data set as uploaded complete or is it part of an ongoing study. Use "hasNext" to indicate that you expect more data on different subjects as a continuation of this study. Use “hasChildren” to indicate that you expect more data on the same subjects or samples derived from those subjects. hasNext, hasChildren .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430563doi: bioRxiv preprint https://orcid.org/ https://doi.org/10.13003/5jchdy https://github.com/myuser/code-for-really-cool-data https://github.com/myuser/code-for-really-cool-data https://doi.org/10.1101/2021.02.10.430563 http://creativecommons.org/licenses/by/4.0/ 14 Parent da- taset ID If this is a part of a larger data set, or references subjects or samples from a parent dataset, what was the accession number of the prior batch. You need only give us the number of the last batch, not all batches. If samples and subjects are from multiple parent datasets please create a comma separated list of all parent ids. N:dataset:c5c2f40f-76be-4979-bfc4- b9f9947231cf Title for com- plete data set Please give us a provisional title for the entire data set. Metadata Version DO NOT CHANGE 1.2.3 1.2.3 Table A2: Subject metadata. Required fields are highlighted in green while recommended fields are highlighted in yellow. Blue fields are required (pool_id only if pooled subjects were used) and provide the necessary fields for providing provenance of subjects and subject pools within experiments. Attribute Description Example subject_id Lab-based schema for identifying each subject, should match folder names sub-1 pool_id If data is collected on multiple subjects at the same time include the identifier of the pool where the data file will be found. If this is included it should be the name of the top level folder inside primary. pool-1 experimental group Experimental group subject is assigned to in research project Control age Age of the subject (e.g., hours, days, weeks, years old) or if unknown fill in with “unknown” 4 weeks sex Sex of the subject, or if unknown fill in with “Unknown” Female species Subject species Rattus norvegicus strain Organism strain of the subject Sprague-Dawley RRID for strain Research Resource Identifier Identification (RRID) for the strain For this field RRID:RGD_10395233 Additional Fields (e.g. MINDS) MINDS = minimal information about a neuroscience dataset age category description of age category from derived from UBERON life cycle stage prime adult stage age range (min) The minimal age (youngest) of the research subjects. The format for this field: numerical value + space + unit (spelled out) 10 days age range (max) The maximal age (oldest) of the research subjects. The format for this field: numerical value + space + unit (spelled out) 20 days handedness Preference of the subject to use the right or left hand, if applicable right genotype Ignore if RRID is filled in, Genetic makeup of genetically modified al- leles in transgenic animals belonging to the same subject group MGI:3851780 reference at- las The reference atlas and organ Paxinos and Watson, THe Rat Brain In Stereotaxic Coordinates, 7th Ed, 2013 protocol title Once the research protocol is uploaded to Protocols.io, the title of the protocol within Protocols.io must be noted in this field. Spinal Cord extraction .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430563doi: bioRxiv preprint https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://scicrunch.org/resources/Organisms/search https://scicrunch.org/resources/Organisms/search http://www.ontobee.org/ontology/catalog/UBERON?iri=http://purl.obolibrary.org/obo/UBERON_0000105 http://www.ontobee.org/ontology/catalog/UBERON?iri=http://purl.obolibrary.org/obo/UBERON_0000105 https://doi.org/10.1101/2021.02.10.430563 http://creativecommons.org/licenses/by/4.0/ 15 protocol.io lo- cation The Protocol.io URL for the protocol. Once the protocol is uploaded to Protocols.io, the protocol must be shared with the SPARC group and the Protocol.io URL is noted in this field. Please share with the SPARC group. https://www.protocols .io/view/corchea-paper-based-micro- fluidic-device-vtwe6pe experimental log file name A file containing experimental records for each sample. Table A3: Sample metadata The color key is the same as for subjects (A2) Attribute Description Example subject_id Lab-based schema for identifying each subject sub-1 sample_id Lab-based schema for identifying each sample, must be unique sub-1_sam-2 wasDerivedFromSam- ple sample_id of the sample from which the current sample was derived (e.g., slice, tissue punch, biopsy, etc.) sub-1_sam-1 pool_id If data is collected on multiple samples at the same time in- clude the identifier of the pool where the data file will be found. pool-1 experimental group Experimental group subject is assigned to in research pro- ject. If you have experimental groups for samples please add another column. Control specimen type Physical type of the specimen from which the data were extracted tissue specimen anatomical lo- cation The organ, or subregion of organ from which the data were extracted dentate gyrus Additional Fields (e.g. MINDS) species Subject species Rattus norvegicus sex Sex of the subject, or if unknown fill in with “Unknown” Female age Age of the subject (e.g., hours, days, weeks, years old) or if unknown fill in with “unknown” 4 weeks age category Qualitative description of age category derived from UBERON life cycle stage prime adult stage age range (min) The minimal age (youngest) of the research subjects. The format for this field: numerical value + space + unit (spelled out) 10 days age range (max) The maximal age (oldest) of the research subjects. The for- mat for this field: numerical value + space + unit (spelled out) 20 days handedness Preference of the subject to use the right or left hand, if ap- plicable right strain Organism strain of the subject Sprague-Dawley RRID for strain RRID for the strain For this field RRID:RGD_10395233 .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430563doi: bioRxiv preprint https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://scicrunch.org/resources/Organisms/search https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://doi.org/10.1101/2021.02.10.430563 http://creativecommons.org/licenses/by/4.0/ 16 genotype Ignore if RRID is filled in, Genetic makeup of genetically modified alleles in transgenic animals belonging to the same subject group MGI:3851780 reference atlas The reference atlas and organ Paxinos Rat V3 protocol title Once the research protocol is uploaded to Protocols.io, the title of the protocol within Protocols.io must be noted in this field. Spinal Cord extraction protocol.io location The Protocol.io URL for the protocol. Once the protocol is uploaded to Protocols.io, the protocol must be shared with the SPARC group and the Protocol.io URL is noted in this field. Please share with the SPARC group. https://www.protocols.io/view/cor- chea-paper-based-microfluidic-de- vice-vtwe6pe experimental log file name A file containing experimental records for each sample. Table A4: Controlled vocabulary for experimental modes used in SPARC. These terms are in the process of being added to the NIFSTD ontology techniques branch. name Definition NIFSTD ID anatomy Study that aims to understand the structure of organisms or their parts. behavioral Study that induces and/or measures the behavior of the subject cell counting Study that is designed to quantify cell populations cell culture Study that employs cells isolated from the organism or tissue that are kept alive and studied in vitro cell morphology Study that specifically seeks to understand the shape and structure of individual cells cell population characterization Study that measures biochemical, molecular and/or physiological characteristics of popula- tions of cells as opposed to individual cells connectivity Study that maps or measures functional and/or anatomical connections between nerve cells and their targets or connections between populations of neurons in defined anatomical re- gions. electrophysiology Study that measures electrical impulses within an organism, cell or tissue or the effects of direct electrical stimulation epigenomics Study that measures modifications of genetic material that affect transcription but do not al- ter the organism's DNA expression Study that measures or visualizes gene or protein expression within cells or tissues. Fo- cuses on the gene. expression char- acterization Study that characterizes the cellular, anatomical, or morphological distribution of gene ex- pression. Focuses on population. genomics Study that measures aspects related to the complete DNA genome of an organism histology Study that investigates the microscopic structure of tissues microscopy Study that primarily uses light or electron microscopic imaging models Study that creates or characterizes computational models or simulations of other experi- mentally observed phenomena morphology Study designed to determine the shape and structure of tissues and body parts multimodal Study that employs multiple modalities in significant ways optical Study that makes measurements using photons in the visible spectrum. physiology Study that measures the function or behavior of organs and tissues in living systems. radiology Study that uses at least one of a variety of minimally invasive probes such as x-rays, ultra- sound, or nuclear magnetic resonance signals to capture data about the internal structure of intact subjects. spatial tran- scriptomics Study used to spatially resolve RNA-seq data, and thereby all mRNAs, in individual tissue sections (Wikipedia). transcriptomics Study that measures RNA transcription in the organism or cell population of interest .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430563doi: bioRxiv preprint https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe https://doi.org/10.1101/2021.02.10.430563 http://creativecommons.org/licenses/by/4.0/ 10_1101-2021_02_10_430604 ---- Struo2: efficient metagenome profiling database construction for ever-expanding microbial genome datasets 1 Struo2: efficient metagenome profiling database construction for ever-expanding 2 microbial genome datasets 3 Nicholas D. Youngblut* ,1 , Ruth E. Ley 1 4 1 Department of Microbiome Science, Max Planck Institute for Developmental Biology, Max Planck Ring 5, 5 72076 Tübingen, Germany 6 * Corresponding author: Nicholas Youngblut (nicholas.youngblut@tuebingen.mpg.de) 7 Running title: Struo2 builds databases faster 8 Key words: metagenome, database, profiling, GTDB 1 .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430604doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430604 http://creativecommons.org/licenses/by/4.0/ 9 Abstract 10 Mapping metagenome reads to reference databases is the standard approach for 11 assessing microbial taxonomic and functional diversity from metagenomic data. However, public 12 reference databases often lack recently generated genomic data such as 13 metagenome-assembled genomes (MAGs), which can limit the sensitivity of read-mapping 14 approaches. We previously developed the Struo pipeline in order to provide a straight-forward 15 method for constructing custom databases; however, the pipeline does not scale well with the 16 ever-increasing number of publicly available microbial genomes. Moreover, the pipeline does 17 not allow for efficient database updating as new data are generated. To address these issues, 18 we developed Struo2, which is >3.5-fold faster than Struo at database generation and can also 19 efficiently update existing databases. We also provide custom Kraken2, Bracken, and 20 HUMAnN3 databases that can be easily updated with new genomes and/or individual gene 21 sequences. Struo2 enables feasible database generation for continually increasing large-scale 22 genomic datasets. 23 Availability: 24 ● Struo2: https://github.com/leylabmpi/Struo2 25 ● Pre-built databases: http://ftp.tue.mpg.de/ebio/projects/struo2/ 26 ● Utility tools: https://github.com/nick-youngblut/gtdb_to_taxdump 27 Results 28 Metagenome profiling involves mapping reads to reference sequence databases and is 29 the standard approach for assessing microbial community taxonomic and functional composition 30 via metagenomic sequencing. Most metagenome profiling software includes “standard” 31 reference databases. For instance, the popular HUMANnN pipeline includes multiple databases 32 for assessing both taxonomy and function from read data (Franzosa et al. , 2018) . Similarly, 33 Kraken2 includes a set of standard databases for taxonomic classification of specific clades 34 ( e.g., fungi or plants) or all taxa (Wood et al. , 2019) . While such standard reference databases 35 provide a crucial resource for metagenomic data analysis, they may not be optimal for the 36 needs of researchers. For example, a custom database that includes newly generated MAGs 37 can increase the percent of reads mapped to references (Youngblut et al. , 2020) . The process 38 of making custom reference databases is often complicated and requires substantial 39 computational resources, which led us to create Struo for straight-forward custom metagenome 40 profiling database generation (de la Cuesta-Zuluaga et al. , 2020) . However, Struo requires ~2.4 41 CPU hours per genome, which would necessitate >77,900 CPU hours (>9.1 years) if including 42 one genome per the 31,911 species in Release 95 of the Genome Taxonomy Database (GTDB) 43 (Parks et al. , 2018) . 44 Struo2 generates Kraken2 and Bracken databases similarly to Struo (Lu et al. , 2017; 45 Wood et al. , 2019) , but the algorithms diverge substantially for the time consuming step of gene 46 annotation required for HUMAnN database construction. Struo2 performs gene annotation by 47 clustering all gene sequences of all genomes using the mmseqs2 linclust algorithm, and then 48 each gene cluster representative is annotated via mmseq2 search (Figure 1A; Supplemental 49 Methods) (Steinegger and Söding, 2017, 2018) . In contrast, Struo annotates all non-redundant 50 genes of each genome with DIAMOND (Buchfink et al. , 2015) . Struo2 utilizes snakemake and 2 .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430604doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430604 http://creativecommons.org/licenses/by/4.0/ 51 conda, which allows for easy installation of all dependencies and simplified scaling to high 52 performance computing systems (Köster and Rahmann, 2012) . 53 Benchmarking on genome subsets from the GTDB showed that Struo2 requires ~0.67 54 CPU hours per genome versus ~2.4 for Struo (Figure 1B). Notably, Struo2 annotates slightly 55 more genes than Struo, possibly due to the sensitivity of the mmseqs search iterative search 56 algorithm (Figure 1C). The use of mmseqs2 allows for efficient database updating of new 57 genomes and/or individual gene sequences via mmseqs clusterupdate (Figure S1); we show 58 that this approach saves 15-19% of the CPU hours relative to generating a database from 59 scratch (Figure 1D). 60 We used Struo2 to create publicly available Kraken2, Bracken, and HUMAnN3 custom 61 databases from Release 95 of the GTDB (see Supplemental Methods). We will continue to 62 publish these custom databases as new GTDB versions are released. The databases are 63 available at http://ftp.tue.mpg.de/ebio/projects/struo2/ . We also created a set of utility tools for 64 generating NCBI taxdump files from the GTDB taxonomy and mapping between the NCBI and 65 GTDB taxonomies. The taxdump files are utilized by Struo2, but these tools can be used more 66 generally to integrate the GTDB taxonomy into existing pipelines designed for the NCBI 67 taxonomy (available at https://github.com/nick-youngblut/gtdb_to_taxdump ). 68 Figure 1. Struo2 can build databases faster than Struo and can efficiently update the databases. A) A 69 general outline of the Struo2 database creation algorithm. Cylinders are input or output files, squares are 70 processes, and right-tilted rhomboids are intermediate files. The largest change from Struo is the 3 .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430604doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430604 http://creativecommons.org/licenses/by/4.0/ 71 utilization of mmseqs2 for clustering and annotation of genes. B) Benchmarking the amount of CPU hours 72 required for Struo and Struo2, depending on the number of input genomes. C) The number of genes 73 annotated with a UniRef90 identifier. D) The percent of CPU hours saved via the Struo2 database 74 updating algorithm versus de novo database generation. The original database was constructed from 75 1000 genomes. For B) and D), the grey regions represent 95% confidence intervals. 76 Data availability 77 Struo2 is available at https://github.com/leylabmpi/Struo2 , the pre-built databases can be 78 found at http://ftp.tue.mpg.de/ebio/projects/struo2/ , and utility tools are located at 79 https://github.com/nick-youngblut/gtdb_to_taxdump . 80 Acknowledgements 81 This study was supported by the Max Planck Society. We thank Albane Ruaud, Liam 82 Fitzstevens, Jacobo de la Cuesta-Zuluaga, and Jillian Waters for providing helpful comments on 83 an earlier version of this manuscript. 84 References 85 Buchfink,B. et al. (2015) Fast and sensitive protein alignment using DIAMOND. Nat. Methods , 86 12 , 59–60. 87 de la Cuesta-Zuluaga,J. et al. (2020) Struo: a pipeline for building custom databases for 88 common metagenome profilers. Bioinformatics , 36 , 2314–2315. 89 Franzosa,E.A. et al. (2018) Species-level functional profiling of metagenomes and 90 metatranscriptomes. Nat. Methods , 15 , 962–968. 91 Köster,J. and Rahmann,S. (2012) Snakemake--a scalable bioinformatics workflow engine. 92 Bioinformatics , 28 , 2520–2522. 93 Lu,J. et al. (2017) Bracken: estimating species abundance in metagenomics data. PeerJ 94 Comput. Sci. , 3 , e104. 95 Parks,D.H. et al. (2018) A standardized bacterial taxonomy based on genome phylogeny 96 substantially revises the tree of life. Nat. Biotechnol. , 36 , 996–1004. 97 Steinegger,M. and Söding,J. (2018) Clustering huge protein sequence sets in linear time. Nat. 98 Commun. , 9 , 2542. 99 Steinegger,M. and Söding,J. (2017) MMseqs2 enables sensitive protein sequence searching for 100 the analysis of massive data sets. Nat. Biotechnol. , 35 , 1026–1028. 101 Wood,D.E. et al. (2019) Improved metagenomic analysis with Kraken 2. Genome Biol. , 20 , 257. 102 Youngblut,N.D. et al. (2020) Large-Scale Metagenome Assembly Reveals Novel 103 Animal-Associated Microbial Genomes, Biosynthetic Gene Clusters, and Other Genetic 104 Diversity. mSystems , 5 . 4 .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430604doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430604 http://creativecommons.org/licenses/by/4.0/ 10_1101-2021_02_10_430606 ---- NeuronMotif: Deciphering transcriptional cis-regulatory codes from deep neural networks NeuronMotif: Deciphering transcriptional cis-regulatory codes from deep neural networks Zheng Wei1, Kui Hua1, Lei Wei1, Shining Ma2, Rui Jiang1, Yanda Li1, Wing Hung Wong2, Xiaowo Wang1, * 1. Ministry of Education Key Laboratory of Bioinformatics; Center for Synthetic and Systems Biology; Beijing National Research Center for Information Science and Technology; Department of Automation, Tsinghua University, Beijing, 100084, China 2. Department of Statistics, Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA Abstract Discovering DNA regulatory sequence motifs and their relative positions are vital to understand the mechanisms of gene expression regulation. Such complicated motif grammars are difficult to be summarized from shallow models. Although Deep Convolutional Neural Network (DCNN) achieved great success in annotating cis- regulatory elements, few combinatorial motif grammars have been accurately interpreted due to the mixed signal in DCNN. To address this problem, we proposed NeuronMotif, a general backward decoupling algorithm, to reveal the homo-/hetero-typic motif combinations and arrangements embedded in convolutional neurons. We applied NeuronMotif on several widely-used DCNN models. Many uncovered motif grammars of deep convolutional neurons are supported by literature or ATAC-seq footprinting. We further diagnosed the sick neurons that are sensitive to adversarial noises, which can guide DCNN architecture optimization for better prediction performance and motif feature extraction. Overall, NeuronMotif enables decoding cis-regulatory codes from deep convolutional neurons and understanding DCNN from a novel perspective. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 Fig. 1. The overview of NeuronMotif and existing methods. a, A trained DCNN model can annotate genome function with corresponding genome sequence as input. Interpreting regulatory grammar from DCNN includes discovering the motif glossary and syntax. The motif is similar to the different word-forms for a lexeme, the smallest isolatable meaningful unit. Soft/hard hetero/homo-multimer motif are organized by motif syntax tree. b,c, Max activation and saliency map methods adapted from CV. d,e How NeuronMotif decouple a layer-4 neuron based on the mechanism of DCNN. d, Eight sequences 𝒙!"# matched by two CTCF-6N-DDIT3::CEBPA motifs with four different relative positions are sampled by adapted genetic algorithm. In each layer, the masked subsequences are detected by the neurons of the corresponding colors . Convolutional neuron combines the motif sequence recognized by previous layers (rectangles with black border) and fills the gap between them. Max-pooling operation aligns the recognized regions by extending their length. The chaotic signal of nucleotide bases in 𝒙!"# with similar function are layer-wisely unified into the similar signal 𝑦!"# = 𝑦!"# (%) (feature map 𝒚('), only the key components of feature map in layers 𝑙 = 2,3,4 are shown in figure). 𝒚!"# (() ,𝑦!"# (%) (𝑦!"#) are independent of different motif sequences and shift diversity. e, From layer 4 to 1, feature maps of the sequences can be firstly distinguished at layer 2. To reverse the max-pooling operation of size 2, twice kmeans (k=2) are applied on feature maps 𝒚!"# ()) reclusively. 𝒙!"# are divided into 4 groups for calculating PPM respectively. A is the max activation in each group. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 Fig. 2. Details of NeuronMotif. a, In experiment like SELEX, the sequences (𝒛) bound by TF are filtered and aligned for motif estimation 𝔼𝑿. To simulate this process, enumerating the valid sequence (𝒙) for estimating the motif for a neuron is not correct (given distribution of 𝑿). The frequency/weight of sequences should be proportional to affinity level (given distribution of 𝑌 in b). Within the whole DCNN structure, DCNN sub- structure of the neuron in red is equivalent to function 𝑦 = 𝑓(𝒙). The abbreviation 𝑠.𝑡. means subject to. b, Distribution of neuron activation values (𝑦) in a. The sequence collection with a higher activation level contains more information in the sequence logo. c, The sequences are sampled during the optimization process of seed sequence. d, Two types of latent variables lead to motif mixture in neuron model. The shifted motifs can be decoupled under the control of shift latent variable that determine the position 1 and 2. The synonymous latent variable determine the different replaceable motif with similar function at the same position. The example is the original motif 1 and its reverse complementary motif 2. For some TFs, function is not sensitive to orientation. e, Comparing the neuron with (left column) and without (right column) synonymous mixture motif. Under the controlling of synonymous latent variable 𝑆 = 𝑆1,𝑆2, the sequences and corresponding motifs are similar in single model but different in mixture model. The sequences with max activation value in two model are 𝒙𝟏,𝒙𝟐,𝒙𝟑 (𝑓-(𝒙𝟏) > 𝑓.(𝒙𝟐) ≈ 𝑓.(𝒙𝟑)). Both of the models share consensus sequence (𝒙/). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 Fig. 3. Use NeuronMotif to annotate Basset model. a, Four motifs of a second-layer neuron decoupled by NeuronMotif (row 2-4). The decouple motifs with the same size (the receptive field size of neuron in the second layer is 51bp) are aligned with 1bp offsets. They are matched by JASPAR motif NFIB using Tomtom (row 1). The interpretation results using methods of Kelley et al., Alipanahi et al. and Saliency Map are shown in row row 5-7. b, Motifs of a third-layer neuron decoupled by NeuronMotif (row 2-13). The decoupled 12 motifs with the same size (the receptive field size of neurons in the third layer is 132 bp) are aligned with 1bp offsets. They are matched by JASPAR motif CEBPB, CTCF and DDIT3::CEBPA using Tomtom (row 1). The interpretation results using methods of Kelley et al., Alipanahi et al. and Saliency Map are shown in row row 14-16. c, The 2 neurons in the second layer learn the reverse complementary motifs. They represent the motifs of AAC triplet repeats (row 1-3) and GTT triplet repeats (row 4-6) respectively, which were decoupled by NeuronMotif. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 Fig. 4. NeuronMotif diagnose model defects and guide DCNN architecture design for better performance. a, Dead kernel definition. Dead neuron (pink) activation value distribution is negative. It will be filtered by ReLU activation function. The output of dead neuron is zero. The downstream neuron output does not depend on this neuron. b,c, Diagnosis of motif mixture in the decoupled motifs from Basset, BD-10, DeepSEA and DD- 10 model. Each point is a decoupled motif generate by a sample set of sequence. The points of motifs generated by the sample set with less than 100 sequences are marked by red color. Otherwise, they are marked by blue color. The distribution of the max activation value is used to show if the relative max activation values of most of the motifs are too low. b, Diagnosis of models trained by DeepSEA dataset c, Diagnosis of models trained by Basset dataset. Only the max activation value of the decoupled motifs in Fig. 3b are significantly higher than the decoupled motifs of other neurons in layer 3 of Basset-3 model. d, The meaning of each region in the sub-plots of b,c. e, Schematic for receptive field coupling of previous layer neuron in the neuron sub-structure. f,g, Use AUPRC as an indicator to compare the prediction performance of models. For each model pair, one-sided t-test of Δ𝐴𝑈𝑃𝑅𝐶 = 𝐴𝑈𝑃𝑅𝐶0"123- − 𝐴𝑈𝑃𝑅𝐶2"123- is used to access model performance difference level. f, Comparison between DeepSEA and DD-10 models. g, Comparison between Basset and BD-10 models. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 Fig. 5. Motif discovery performance for different layers in different models. a, Accuracy analysis for discovered motifs of different models. Three columns of box plots describe similarity between neuron motifs and JASPAR motifs in Basset, BD-5 and BD-10 model respectively. For each selected layer in the model, -log10(q-values) distribution of the top 100 JASPAR motifs matched neuron (q-value < 0.1) are shown with box and jittering points. The color of the box means the applied interpretation method. In the first row, it shows the result of input (first) layer of Basset model and shallow layers in BD-5 and BD-10 model with similar receptive field sizes (18 bp) of the Basset input layer neuron (19 bp). In the second row, it shows the convolutional output layer, the convolutional layer in the front of dense layer, result of the three models. b, The number of motifs discovered (q-value < 0.001) from the neuron in convolutional output layer of Basset, BD-5 and BD-10 model. c, The number of motif discovered (q-value < 0.01) from the neuron in layer 3 of Basset model using different interpretation methods including Kelley et al., Alipanahi et al. and NeuronMotif. d, Discovered motifs from the neuron of top convolutional layer in BD-10 model (q-value < 0.01). These motifs can be matched to JASPAR database. Only the one with smallest q-value for each JASPAR motif is shown. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 Fig. 6. Verification of neuron motif. a, Motif syntax represented by a neuron of top convolutional layer (layer 10) in DD-10 model applied on DeepSEA data. The neuron motifs (144 bp) are matched to TF CTCF and DDIT3::CEBPA. CTCF-DDIT3::CEBPA is a hard hetero-trimer. The distance between two CTCF- DDIT3::CEBPA trimer is flexible. b, Motif syntax represented by another neuron of top convolutional layer (layer 10) in DD-10 model applied on DeepSEA data. The neuron motifs (144 bp) are matched to TF NFIX. c, Five different cell types’ ATAC-seq data footprinting (500 bp upstream and downstream from the motif matched midpoint is shown) of the motifs in a. Cut-site counts of each position are normalized by total cut-site counts within 1000 bp window. d, Similar to c, the footprinting of the neuron motif in b. e, CTCF-DDIT3::CEBPA motif matched count for each relative motif midpoint position. Soft homodimer of CTCF-DDIT3::CEBPA heterotrimer relations are shown at the bottom.. f, NIFX motif matched count for each relative motif midpoint position. Soft homotrimer of NIFX relation is shown at the bottom. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 Introduction The DNA sequence is a language of life1. To understand life processes, it is essential to decode the grammar of DNA. One of the most important problems is deciphering transcriptional cis-regulatory code from functional DNA sequence. Deep sequencing techniques such as ChIP-seq, ATAC-seq2, etc. have been developed to discover the sequences with specific function or characteristic like Transcription Factor Binding Sites (TFBSs), Histone-marks (HMs) and chromatin openness. But the logic of the sequence is difficult to summarize directly. With the development of deep learning techniques, a growing number of researchers resort to the deep convolutional neural network (DCNN) for its significant advantage including automatic extraction of sequence motif (Fig. 1d) and higher prediction accuracy3. For example, DeepSEA4 and Basset5 model successfully use DNA sequence to predict chromatin-profiling data including TFBSs, HMs profiles and DNase I sensitivity. Among these common functions, in cis-regulatory modules, Transcription Factors (TF) regulate gene expression through binding or co- binding to specific preferred DNA sequences that occur at particular genome positions6. Accurately characterizing TF binding specificities and interpreting the relative positions of TFs from DCNN are vital to understand the logic of gene regulation (Fig. 1a). Unfortunately, DCNN is a black-box that is difficult to be interpreted what motif glossary or even motif grammar it exactly learns. Interpretation of DCNN black-box is not as smooth as function annotations. Most existing methods4,5,7,8 seek to interpret DCNN by detecting the correlation between the predicted genome function as the model output and DNA sequence at the resolution of a single nucleotide base as inputs via different approaches adapted from Computer Vision (CV) (Fig. 1b, Fig. 1c and Fig S1, see supplementary information for details). However, from the viewport of linguistics, letters of nucleotide bases do not have actual meanings unless they are combined into various words of motif sequences9. Thus, interpreting the meaning of a single nucleotide base while ignoring its the context-dependence is polysemous or even meaningless. The average interpretation of polysemous results is a confusing mixture. Due to the lack of interpretation methods, the design of deeper DCNN structure with better prediction performance is limited. Different from the deepening DCNNs applied in the CV like 16-convolutional-layer VGG-19 and 128-convolutional-layer ResNet10, most DCNN models for studying genome functions contain up to 3 convolutional layers to guarantee clear interpretations 11,12. The interpretation of the first layer in shallow 3- convolutional-layer DCNN is more reliable with existing interpretation methods. These shallow model avoids serious motif mixing problems in deeper layers13 and motif fragmentation happened in the first layer of deeper DCNN14. But the kernel size in the (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 first layer has to be large enough to learn a single complete motif5. However, deeper DCNNs show better performance in genomics15,16. Hence, performance and interpretation seem to be a trade-off determined by the DCNN architecture to a large extent. Here, we proposed NeuronMotif to decipher transcriptional cis-regulatory grammar from DCNN (Fig.1 d,e). This algorithm considers the sequences recognized by an Artificial Neuron (AN) as a mixture model depending on latent variables. From the output of AN to the input, it automatically backward discovers the latent variables reflecting the neural network structure to decouple the AN mixture model for extracting motif grammar. We applied NeuronMotif on several existing shallow DCNNs (DeepSEA4 and Basset5). A large portion of uncovered motifs and syntaxes of their combinations are supported by literature or ATAC-seq profile, which outperforms existing state-of-art methods. The results of NeuronMotif reveal the origin of adversarial noise in the model, which can be used to guide the design of DCNN architectures to suppress noise. With the help of NeuronMotif, we further built and interpreted 10-convolutional-layer deeper DCNNs with the help of NeuronMotif. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 Results The NeuronMotif algorithm for uncovering motif and decoupling motif mixture from DCNN model TFs are proteins that can recognize and bind to specific DNA sequences. The perferred sequences bound by a given TF are usually summarized as a motif. Motif is a model typically refers to Position Weight Matrix (PWM), which can be converted from Position Probability Matrix (PPM)17. At each base position in PPM, the four scores represent the probability of the four bases that occur at the relative position of TFBS. The probability can be estimated by collecting the DNA sequences binding with TFBS through experiments such as Systematic Evolution of Ligands by Exponential Enrichment (SELEX)18 (Fig. 2a). This process is similar to sampling sequences (𝒙) of TFBSs with 𝑁- bases length from an 4 × N random variable matrix 𝑿 ∼ 𝑝𝐏𝐏𝐌𝟒×𝑵(𝒙) to estimate PPM (𝔼𝑿) by element-wise average 𝒙, (see Methods). Here, 𝒙 is the 4 × 𝑁 one-hot code of the sequence, and each column of 𝑿 is an different independent categorial distribution. The sampling process in the experiment reflects TF binding affinities to sequences. The sequences with stronger affinities may occur at higher frequency. Inspired by SELEX screening TF-preferred sequences, we attempted to imitate this process by sampling AN-preferred sequences to study AN. The sub-structure of an AN processes the sequence input (𝒙) with a non-linear function 𝑦 = 𝑓(𝒙) and then outputs an activation value (𝑦) (Fig. 2a and Fig. S4a). This process is quite similar to SELEX screening sequences because the sequences (𝒙) with higher activation 𝑦 are preferred by the AN for affecting downstream ANs and the final prediction result, which reflects sequence affinity. Hence, the input random variable matrix 𝑿 ∼ 𝑝𝐏𝐏𝐌𝟒×𝑵(𝒙) depends on the output random variable 𝑌 ∼ 𝑝(𝑦) through 𝑌 = 𝑓(𝑿). To obtain PPM reflecting binding affinity rather than binding probability, we adopt a linear function as 𝑝(𝑦) of the distribution (Fig. 2b, see Methods for detail explanation). In other words, sampling weight or frequency of each unique sequence (𝒙) should be positive proportional to its activation value (𝑦) (Fig. 2a and 2b). It can be implemented by sampling 𝑿 at the same level of 𝑦 to estimate 𝔼(𝑿|𝑌 = 𝑦) (bottom of Fig. 2b) and then taking the weighted average of them (𝔼𝑿 = 𝔼[𝔼(𝑿|𝑌)]) to estimate PPM (Fig.2b, see Methods for details). This method precedes previous studies in representing the TF binding affinity to DNA sequences. Adapted Back Propagation (BP) methods like Saliency Map and DeepLIFT do not model the sequence preference with 𝑿. The importance score (e.g. 𝜕𝑦/𝜕𝒙 ) of these methods do not directly reflect PPM or PWM (Fig. 1c). While adapted max activation methods like the methods developed by Kelley et al. and Alipanahi et al. for interpreting Basset model5 and DeepBind model8 try to follow the PPM model but they estimate 𝔼𝑿 by (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 𝒙 ({𝒙|𝑓(𝒙) > 0} or {𝒙|𝑓(𝒙) > 𝑦#$%/2}) without depending on the level of 𝑌 , thus could hardly reflect the activation perference.(Fig. 1b). Here, we assumed that TFBS are located at the same relative position in the input sequences without shifting, and then we can define motif or PPM for the sequences recognized by an AN as 𝔼𝑿 given distribution of 𝑌. We called it AN motif or PPM of AN. However, we found that due to the max-pooling operation in DCNN, TFBSs may be located at different relative positions in the input sequences to activate the AN. In the max-pooling layer, the key input feature maps reflecting the shifting diversity of TFBSs will be unified into similar output feature maps (Fig. 1d, 1e and S4b). The downstream key ANs including the output AN will share similar activation values (𝑦) for different sequences with shifting TFBSs. Hence, the motif sequences of TFBS recognized by an AN can be regarded as a latent variable mixture model. To decouple motif mixtures, we have to find some shift latent variables that reflecting different positions of the motif sequence (the top part of Fig. 2d). Only by controlling these latent variables can we obtain the consistent real sequence motifs (𝔼(𝑿|𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛)). This key issue is neglected by all existing methods (Fig. 1b, Fig. 1c and Fig. S1). We further found that TFBSs in the sequences may not share the same pattern . It indicates that we can find more than one motif by stacking TFBSs with grouped consistent pattern respectively . One of the cases is the reverse complementary sequences (bottom part of Fig. 2d). The mixing of these sequences can be controlled by another important type of latent variables in the mixture model named as synonymous latent variables, and we called the decoupled motifs as the synonymous motifs. The synonymous motifs represented by an AN should satisfy: (1) they are not shifted motifs; (2) all or part of input variables 𝑿 are conditionally independent under controlling synonymous latent variables; (3) the sampled sequences grouped by synonymous motifs should share similar maximun activation values so that they are all preferred by the AN. If these conditionally independent positions affect little on activation values, then the AN can be regarded as a single model (SM, 𝑦 = 𝑓&(𝒙), the left column of Fig. 2e). Otherwise, the motif sequences recognized by the AN is a mixture model (MM, 𝑦 = 𝑓#(𝒙), the right column of Fig. 2e). Both SM and MM share similar motif (the bottom part of Fig. 2e), but the sequences of the maximum activation value (SM:𝑥'; MM: 𝑥(,𝑥)) and the consensus sequences (𝒙*) show their difference. Different from 𝑓&(𝒙*) ≈ 𝑓&(𝒙') in the SM, 𝑓#(𝒙*) in MM usually strongly deviates from 𝑓#(𝒙(),𝑓#(𝒙)), and could even be negative (the top-right part of Fig. 2e). This is because 𝒙* may not match any conditionally dependent motifs embedded in the AN (the bottom-right of Fig. 2e). Thus, bases flipping at the conditionally dependent positions of the sequence is a kind of adversarial noise19 discussed in CV that can dramatically change the AN activation level (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 or even destroy prediction result. In addition, we also proved that severe mixing of synonymous motifs in the AN correlates to the lower maximum activation and weights of the AN (see Methods). The results above suggested that the mixture of synonymous motifs seems to be noise rather than motif signals due to its vulnerable characteristics. Hence, for a well-trained model with weak noises, we only need to decouple the signal of each AN depending on the max-pooling structure. One of the most widely-used types of DCNN models is composed of general convolution layers and max-pooling layers. We took this type of DCNN as an instance, and developed the NeuronMotif algorithm to uncover the motif combinatorial grammar from DCNN. First, we designed a sampling algorithm adapted from genetic algorithm to optimize seed samples and recorded the intermediate valid sequences as the sampling result (Fig 2c, see Methods for details). Second, we used K-means (K is pooling size) to decouple the mixture signal from different sequences by clustering the shifting similar sub-patterns in the input feature map of the max pooling layer to split the sequences set (Fig 1d,e and Fig. S4b). The decoupling process can be performed backward and recursively from the deepest layer to the first layer. Third, the algorithm can annotate an AN with motifs by estimating 𝔼[𝔼(𝑿|𝑌,𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑠𝑢𝑏𝑠𝑒𝑡)] from each subset samples clustered by K-means (Fig 1e, see Methods for detail). The steps above can only decouple the mixture of a single hard AN motif with shifting diversity. The hard motif refers to the motif or motifs combination with a fixed gap, which characterizes homodimer, heterodimer or multimer TFs that can be considered as a stable molecular cluster binding to DNA. However, a large portion of TFs cobinding are gapped by flexible intervals. Their sequence pattern is the soft motif that composed of more than one hard motif, and the space between any two adjacent hard motifs is in a certain range.To decouple the hard motifs in a soft motif represented by AN, users should run the decoupling algorithm in NeuronMotif (the second step) iterately for several times based on the number of hard motifs (Fig 1e, see Methods for details). NeuronMotif successfully decouple the motif mixture To evaluate the performance of NeuronMotif on decoupling the motif mixture signal, we applied NeuronMotif to annotate two well-known models, DeepSEA4 and Basset5, both of which are DNA-sequence based DCNN models with 3 general convolutional layers for genome function annotation. Basset annotates open chromatin region trained by DNase- seq data. In addition to chromatin accessibility, DeepSEA also annotates TFBSs and HMs trained by ChIP-seq data. NeuronMotif successfully decoupled the shifted mixture motifs from layer 2 (L2) and layer 3 (L3) of the both models (see Supplementary Information for all results). In the Basset model, the first- and second-layer pooling size are 3 and 4, so the numbers of shifted signals are 3 and 3 × 4 = 12 for L2 and L3 ANs, (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 respectively (Fig. 3a-c). In the DeepSEA model, the first- and second-layer pooling size are both 4, so the numbers of shifted signals are 4 and 4 × 4 = 16 for L2 and L3 ANs, respectively. In Fig 3, all adjacent AN motifs are shifted with 1bp and highly consistent. However, the state-of-the-art methods, such as Kelley et al5., Alipanahi et al8 and Saliency Map20 , cannot deal with the mixture signal which leads to the much lower information content and very noisy signals. The NeuronMotif-annotated results of Basset and DeepSEA models showed that ANs extract various kinds of motifs. Here, we took Basset as an instance. Some ANs extract important TF motifs correlated with Basset’s prediction targets (DNase I sensitivity). Many motifs of ANs can be matched with the known motifs in the JASPAR21 database (Fig. 3a and 3b). Some matched TFs, like NFI and cobinding TFs CTCF-CEBP, are highly correlated with chromatin openness22,23. In comparison, the interpretation result of existing methods can hardly be matched with any known motifs in JASPAR. Statistically, NeuronMotif found more motifs and more accurate motifs from JASPAR database (Fig 5a and 5c). Besides, some important functional sequence features and their reverse complement can also be identified from motifs of AN. One of the typical examples is the repeats of AAC triplets feature extracted by the Basset model (Fig. 3c). It has been reported that repeated triplets AAC is enriched in intron24. As the intron regions are usually open for gene transcription, it is reasonable that the Basset model extract this feature. DCNN diagnosis and architecture design guidance From the NeuronMotif result of DeepSEA and Basset models, we found that the outputs of some ANs were always zero no matter how we changed the input sequences. We called them dead ANs (Fig 4a). The dead ANs are redundant because they cannot affect the downstream network. During the sampling process of annotating DeepSEA model via NeuronMotif, we found the sampling algorithm cannot sample even one sequence that can activate some ANs in L2 and L3. For example, A total of 150 and 120 ANs in L2 and L3 are dead ANs in the DeepSEA model. Another problem is that some ANs may recognize synonymous motifs. We diagnosed this problem with two indicators of motifs based on the phenomena that an AN may represent synonymous motifs. One indicator is the activation value of motif consensus sequence and the other is the maximun activation value of the sampled sequences for motif estimation. We found that if the two indicators severely deviate from each other, or the max activation value is close to zero, then the corresponding AN may suffer from the (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 synonymous motifs problem. The two indicators of Basset L2 ANs are almost consistent (the first column of Fig. 4c). However, in L3, activation values of many decoupled motifs’ consensus sequences are negative and severely deviated from the maximum activation value (the second column of Fig. 4c). Most problematic motifs are mainly caused by stacking sequences of different synonymous motifs with the maximum activation value closed to zero (see Methods for details and Supplementary Information for case). To overcome these problems, the DCNN architecture should be optimized to avoid the mixed signals of synonymous motifs. Both of DeepSEA and Basset use large convolutional kernel size (> 8) and large pooling size (3 or 4) in each layer. For an AN with the certain receptive field, implementing its sub-structure with larger kernel size and pooling size tend to cause weaker coupling among sub-structures of the previous layer ANs (Fig. 4e). When using the same training set and optimization method for training the model, we found that the less coupling among the ANs, the more sensitive to noise generated by synonymous motifs (see Methods and Fig. S6). DeepSEA adopted strong regularization methods to successfully suppress learning these the noises (Fig. 4b) but with the cost of producing dead ANs. In the field of CV, building deeper networks with smaller kernels and pooling structures has been found to be a more robust strategy with better performance25. Thus, we built 10-convolution-layer new models and trained them on the Basset Dataset (BD-10) and the DeepSEA Dataset (DD-10) respectively. The synonymous motif problem was significant suppressed in BD-10 and DD-10 (the third column of Fig. 4b and Fig. 4c), and few dead kernels were found. Furthermore, Both BD- 10 and DD-10 show much better prediction performance (Fig. 4f and Fig.4g) than the original model. These results demonstrate how NeuronMotif can be used to help diagnose DCNN and guide architecture design. Accuracy and completeness of motif discovery in different layers of DCNN To study which layer is better for motif discovery in a DCNN model, we used NeuronMotif to interpreter the shallow convolutional layers with receptive field around 19bp and the deepest convolutional layers in three models with 3, 5 and 10 convolutional layers (Basset, BD-5 and BD-10) trained by the same data of Basset paper. To measure the interpretation performance, we matched the decoupled motifs to the motifs in the JASPAR21 database using Tomtom26. For each AN matched to known motifs (q-value < 0.1), we selected the best matched motif in JASPAR and took similarity measurement between the found motif and the JASPAR motif (q-value) as the performance of the AN. As the numbers of ANs are different in each layer, we only selected the q-values of top 100 ANs for further analysis. Given a DCNN model, we found that the motifs discovered from the deepest convolutional layer outperform the shallow layers with around 19bp receptive field (each column in Fig 5a). We further (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 compared the layers with similar receptive field (the first row in Fig 5a), and the deepest convolutional layers (the second row in Fig 5a) among different models. The deeper models (BD-5 and BD-10) outperform the Basset model by discovering much more known motifs (Fig. 5b) and the motif is matched better to the JASPAR database (each row in Fig 5a). Based on the comparison results, we recommand to use deep AN for motif discovery and representation. For example, we built a motif dictionary from layer 10 (L10) in BD-10 model. This dictionary contains 9056 motifs. Among them, 7974 motifs are matched to at least one of 399 JASPAR motifs by Tomtom (q-value < 0.01, one of the best-discovered motifs matched to each JASPAR motif is shown in Fig 5d) and remaining 1082 motifs are novel motifs. NeuronMotif successfully uncover motif grammar In previous works21,27, motif combination grammar is usually represented by the hard motif. They depict soft motif by enumerating different intervals among the component of hard motif (Fig 5a and 5b). In comparison, DCNN structure is more powerful to describe these soft motifs when the receptive field is long enough. Here, we take L10 ANs in DD- 10 as examples to study the AN soft motif. We assume the L10 ANs representing no more than two hard motifs, so we run decoupling algorithm twice in NeuronMotif and a total of 256 AN motifs are generated (Fig. 6a and 6b, see Methods for details). These AN motifs enumerating the combination of hard motifs with various sizes of gap. From these AN motifs, we can slice the shared hard motif to build motif dictionary. Based on the dictionary and all AN motifs, we can summarize the interval range between arbitray two adjacent hard motifs and build the syntax tree (Fig. 6a and 6b). Some of the soft motif can be supported by literature. For example, an AN in DeepSEA represents the soft CTCF homodimer with around 58 bp interval that play important roles in the transcriptional process of cancer and germ cells development28 (Fig. 6a). We also found that DDIT3::CEBPA can co-bind with CTCF, which is not reported in previous literatures. Interestingly, CTCF-DDIT3::CEBPA is shown to be an conservative hard trimeric motif that also occurs in the Basset model (Fig. 3b), which show the reliability of this discovery. We further used the ATAC-seq data footprinting to validate the discovered AN motif grammars. ATAC-seq uses Tn5 transposes to cut DNA into fragments. If there are some TFs or other molecules binding to DNA, the cutting frequency will be affected. For each AN, we aligned corresponding Tn5 transposes cutting frequency of top 3000 sequences (144bp) with max AN activation values in the test dataset. We extended the footprinting region to 1000bp in total. Most ANs have their own footprintings generated by ATAC-seq data from five cell types or tissue (Fig. 6c and Fig. 6d, see Supplementary Information for other ANs). Soft CTCF-DDIT3::CEBPA homodimer footprintings from five cell types or tissue share the pattern of three peaks and two valleys (Fig. 6c) but soft NFI homodimer footprinting signals are only significant in prostate tissue and LNCaP cell (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 lines which support the notion that NFI family can regulate prostate-specific gene expression29 (Fig. 6d). The results indicated that some motif grammars of multimers are cell-type specific. To further confirm the footprinting caused by the specific TF binding, we calculated the distribution of motif matched positions for both CTCF–DDIT3::CEBPA and NFIC motif (Fig. 6e and Fig. 6f). The peaks of motif-matched positions are consistent with the footprinting valley. All the results suggested that NeuronMotif provides a novel way to discover the soft multimer motif grammar on the genome and to better depict multimeric TF motifs. Discussion In summary, we presented NeuronMotif as an effective algorithm to reveal the cis- regulatory motif grammar learned by DCNN model that use DNA sequence to annotate genome function. We proposed the statistical form of AN motif representation and the latent variable mixture model to understand each convolutional neuron. Take max- pooling-convolutional structure as an instance, we uncovered the signal mixing mechanism including shifting latent variable and synonymous latent variable. The NeuronMotif used a K-means-based algorithm to decouple the latent variable mixture, and a sampling strategy adapted from genetic algorithm for motif estimation. We eveluated NeuronMotif interpretation performance on DeepSEA, Basset and some in- house deeper models. Many uncovered motif conbinatotial grammars are supported by literature and ATAC-seq data. Finally, we showed that NeuronMotif result can be used for model diagnoses and to guide model structure design for better prediction performance and motif extraction. Except for interpretating cis-regulatory motif grammar from DCNN, the application of NeuronMotif may be extend to many other problems. DNA sequence is a special one- dimensional discrete data with four elements. It is possible to apply NeruonMotif to the DCNN for amino acid sequence of protein or other continuous sequence like different kinds of sequencing profile. There are still some issues that should be addressed to further expand the application of NeuronMotif. For instance, NeuronMotif only focuses on max-pooling-CNN structure. Many new DCNN structures such as ResNet and DenseNet are put forward in recent years. As these structures show better performance in CV, it is valuable to adapt the NeuronMotif method for these more general and complex DCNN structures in genomics studies. In the future, we envision that DCNN model interpreted by NeuronMotif will advance our ability to discover and summarize the complicated regulatory rule, model transcriptional cis-regulatory process and understand DCNN blackbox itself. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 Acknowledgements We thank Z. Duren and H. Fang for valuable suggestions on motif discovery and relative biological issues. This work was supported by the National Natural Science Foundation of China (No. 62050152, 61721003), and the National Key R&D Program of China (No. 2020YFA0906900) Competing interests Tsinghua University has a patent pending for NeuronMotif. Author contributions Z.W., W.H.W. and X.W. conceived the main idea of the study. Z.W. completed the theorem proof and formula derivation. K.H. repeated and checked the proof and inference. Z.W. developed the algorithm, trained DCNN model, designed experiments and implemented all the experiments. R.J. provided and maintained the computing cluster. W.H.W. and X.W. designed some experiments and supervised the study. All authors wrote and revised the manuscript. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 Methods Statistical definition and estimation of PPM represented by a convolutional neuron An AN has its own sub-structure in the DCNN model (Fig. 2a and S4a). The sub- structure includes an input 𝒙 and an output 𝑦. The relation between 𝒙 and 𝑦 is defined by a non-linear function 𝑦 = 𝑓(𝒙) which depends on the sub-structure of the AN, because all upstream ANs in sub-structure can affect the characteristic of AN. The input of each AN is the output of AN in the previous layer. The sub-structure also determine the receptive field size 𝑁 (the length of 𝒙). For each valid DNA input sequence (𝒙,s. t.𝑓(𝒙) > 0), 𝒙 is a 4 × 𝑁 matrix of one-hot code. It can be sampled from a random variable matrix 𝑿 ∼ 𝑝(𝒙). 𝑿 contains 4 × 𝑁 random variables 𝑿+,- (𝑏 = A,C,G,T;𝑗 = 1,2,…,𝑁). Each column can be modeled as an independent multinomial distribution (𝑿∙,-~Multi[1,𝝅∙,𝒋]), where the 4 × 𝑁 probability matrix 𝝅 is the PPM that characterizes the preference of the nucleotide bases for the sequence motif. Based on the nature of multinomial distribution, the parameter 𝜋∙,- = 𝔼𝑿∙,- so PPM can be estimated through sampling 𝑿∙,- and calculating the element-wise average 𝒙∙,-. As the unknown distribution 𝑝(𝒙) is to be estimated, we cannot sample 𝑿 directly. We know that 𝑿 is not a free random variable, but depends on the free output random variable 𝑌~𝑝(𝑦) through 𝑌 = 𝑓(𝑿). Based on the identity equation 𝔼𝑿 = 𝔼[𝔼(𝑿|𝑌)], we can first sample 𝑿|𝑌 = 𝑦 to estimate 𝔼(𝑿|𝑌 = 𝑦), which represents the PPM for a specific activation value or affinity (𝑦). Given an arbitrary distribution 𝑝(𝑦), we can obtain the PPM by taking a weighted average of these PPMs with different affinities. 𝔼(𝑿) = 𝔼0[𝔼𝑿(𝑿|𝑌)] = _𝔼[𝑿|𝑌 = 𝑦]𝑝(𝑦)𝑑𝑦 2 ' = lim #→45 b𝔼c𝑿d𝑌 = 𝑖𝑚𝐴g𝑝h 𝑖 𝑚 𝐴i∆𝑦 # 67( = lim #→45 b𝔼k𝑿l𝑌 = 𝑦,𝑦 ∈ c𝑖 − 1𝑚 𝐴, 𝑖 𝑚𝐴go𝑃h𝑦 ∈ q 𝑖 − 1 𝑚 𝐴, 𝑖 𝑚 𝐴ri # 67( Numeric estimation of PPM for an AN needs enough valid sequence (𝒙) samples. In this work, we set ReLU[𝑓(𝒙)] = max{𝑓(𝒙),0} as the activation function of each convolutional neuron. Therefore, the valid sequence dataset is 𝑿4 = {𝒙|𝑓(𝒙) > 0 ∧ |𝒙|8 = 𝑁}. Here, 𝑓(𝒙) > 0 constrains the activation value of a valid sequence to be positive so that it can activate the AN, and |𝑥|8 = 𝑁 constrains that the length 𝒙 must match the AN receptive field size 𝑁. For convenient, we rewrote the sequence dataset as 𝑋4 = {𝒙6}67( |:7| and (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 corresponding activation value set as 𝑉 = {𝑓(𝒙6)}67( |:7| . The max activation value is 𝐴 = max(𝑉). The probability of TF binding to a DNA sequence depends on the binding affinity 30. To sample the sequences reflecting their affinity levels (𝑦), the sequence with high affinity should be sampled in higher frequency. Here, for the ease of calculation, we set the probability density function of 𝑌 as a linear function 𝑝(𝑦) = ) 28 𝑦,𝑦 ∈ [0,𝐴] (Fig. 2b). In practical, we split interval [0,𝐴] into 𝑚 (𝑚 = 20) bins to merge sequences with similar activation values into PPMs (Fig. 2b). In this way, we can get the average of PPMs weighted by activation values. For each bin 𝑖 (𝑖 = 1,2,…,𝑚), the sequence index set is 𝐽6 = }𝑗d 6;( # 𝐴 < 𝑓(𝒙6) ≤ 6 # 𝐴 ⋀ 𝒙- ∈ 𝑿4�. Sequences in bin 𝑖 share similar activation values. Thus, their average activation value and PPM can be calculated by 𝑽, < 6;( # 2, 6 #2= = ∑ 𝑓(𝒙𝒋)-∈?9 |𝐽6| 𝔼k𝑿l𝑌 = 𝑦,𝑦 ∈ c𝑖 − 1𝑚 𝐴, 𝑖 𝑚𝐴go ≈ 𝐏𝐏𝐌[6;(# 2, 6 #2] = ∑ 𝒙𝒋-∈?9 |𝐽6| where |𝐽6| is the number of sequences in sequence index set 𝑖. The probability or weight for each bin can be estimated by 𝑃h𝑦 ∈ q 𝑖 − 1 𝑚 𝐴, 𝑖 𝑚 𝐴ri ≈ 𝑃 < 6;( # 2, 6 #2= = 𝑽, < 6;( # 2, 6 #2= ∑ 𝑽, < 6;( # 2, 6 #2= # 67( Finally, 𝐏𝐏𝐌B×D of the AN can be estimated by the average of PPMs weighted by the activation value. 𝔼(𝑿) = 𝐏𝐏𝐌B×D ≈ b𝐏𝐏𝐌[6;(# 2, 6 #2] # 67( 𝑃 < 6;( # 2, 6 #2= The estimation above assumes that relative position of TFBS in the input sequence are the same and all of them share the same motif. In other words, it only works for SM neurons. However, the assumption was not suitable for most ANs especially ANs in deeper layer, where the estimation result is a mixture of different motifs. The random variable matrix 𝑿 can be considered as a MM. Hence, we first needed to find the latent variables that can split the dataset 𝑿4 into subsets 𝑿(4,𝑿)4,𝑿E4,…, each of which can be consider as an SM. The estimation should be applied on each subset respectively. It will generate several motifs 𝐏𝐏𝐌(,𝐏𝐏𝐌),𝐏𝐏𝐌E,… which are controlled by different conditions of the latent variables. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 Discovering latent variables in a mixture model of neuron The activation value of the AN is the only indicator to show the matching level of a sequence. Sequences with high activation values of an AN may be composed of completely different key TFBSs at various relative positions due to the powerful representation ability of the neural network. This characteristic shows that sequences recognized by an AN can be considered as a latent variable mixture model. The sequences matched by different sub-models in the MM are available to activate the AN at the same level. Hence, the activation value of the AN is the unified or mixed signal that cannot distinguish the sequences with different TFBSs. To find the mechanism of mixing process for an AN, we can investigate the activation values of all upstream ANs (feature maps) that reflects TFBS diversity in the sequences, which define the latent variables. Controlling these latent variables, the sampled sequences share the same pattern (hard motif). The sequences are only matched by one sub-model in the MM. In this way, different sampled sequences shared the similar TFBSs that are located at the same relative position. Only obtain these sampled sequences can we estimate the AN motif. In practical, when analyzing the feature maps for sampled sequences, we used K-means to cluster the feature maps of the convolutional layer and found shifted signals among each cluster. However, these clusters are not able to be rebuilt by the feature map of the downstream max-pooling layer. So, the max-pooling operation unify the shifted signals of various sequence, which removes the difference among clusters. In other words, AN just tries to detect if TFBS exist in sequence, the position of TFBS in the sequences is not important to final AN output. Subsequently, we found the best cluster number K (the maximum shifting offset) is the same as the max-pooling size. Each offset within K defines a shifting latent variable. The side-effect for the ANs representing different synonymous motifs Following the definition of synonymous motifs for an AN, if an AN (𝑦 = 𝑓(𝒙)) represents the mixture of two synonymous motifs, let 𝒙(,𝒙) be the vectors of flattened one hot code of the maximum activation sequences for the two motifs respectively, then they should satisfy 𝑓(𝒙() ≈ 𝑓(𝒙)) i.e. 𝑓(𝒙() − 𝑓(𝒙)) → 0. First, we studied the AN in the first layer (𝑦 = 𝑓(𝒙) = 𝒌𝒙F + 𝑏). The activation values of 𝒙(,𝒙) are � 𝑦( = 𝒌𝒙( F + 𝑏 𝑦) = 𝒌𝒙) F + 𝑏 Where 𝒌 is the weight of the AN, and 𝑏 is the bias or inceptor. Based on these two equations, we can easily obtain following equation (𝑦( − 𝑦))) = |𝒌|)[|𝒙(|𝟐 − 2𝒙(𝒙) F + |𝒙)|𝟐] (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 where, both |𝒙(|𝟐 and |𝒙)|𝟐 are equal to the length of the sequence. If the difference between two synonymous motifs is very great, 𝒙(,𝒙) matched by these two motifs respectively should share much less bases (𝒙(𝒙)F → 0). For an extreme case, the two sequence are totally different (𝒙( ⊥ 𝒙),𝒙(𝒙)F = 0). Based on the condition above including (𝑦( − 𝑦))) → 0, 𝒙(𝒙)F → 0 and the constant value of |𝒙(|𝟐 + |𝒙)|𝟐, we can infer that |𝒌|) → 0. Thus, the maximum activation value follows 𝑦 = 𝒌𝒙(F + 𝑏 → 𝑏. The AN becomes a dead AN if 𝑏 ≤ 0. So, compared with the AN of SM that cannot represent two synonymous motif, the AN of MM representing the mixture of synonymous motifs exhibits a lower maximum activation value and a smaller weight. The importance of this kind of ANs for downstream ANs will be suppressed. In a DCNN without the pooling layer, we further investigated an AN representing two synonymous motifs in deeper convolutional layers 𝑖. We assumed that there are no AN representing the mixture of synonymous motifs in layer 1 to layer 𝑖 − 1. Based on this assumption, the feature map (𝒙( (6;(),𝒙) (6;()) of 𝒙(,𝒙) at layer 𝑖 − 1 are of great difference especially for the key features with high activation values. The negative values of the feature map are set 0 by ReLU activation function (𝒙( (6;() ≥ 0,𝒙) (6;() ≥ 0). The key feature in 𝒙( (6;() with high activation may be low activated or 0 in 𝒙) (6;() (for key feature 𝑗, �𝒙(- (6;() − 𝒙)- (6;()� ) will be larger compared to the value of similar sequences). It indicated that we were able to distinguish the sequences matched to the two synonymous motifs with the feature map of layer 𝑖 − 1. The activation of the AN in layer 𝑖 is the linear combination of the previous layer feature map (𝑦 = 𝑓(𝒙) = 𝑔[𝒙(6;()] = 𝒌[𝒙(6;()] F + 𝑏). Similarly, for an AN in layer 𝑖, we can obtain the following equation (𝑦( − 𝑦))) = b𝒌- ) �𝒙(- (6;() − 𝒙)- (6;()� ) - Where 𝑗 is the feature number in layer 𝑖 − 1. If this AN mixed the signals of 𝒙( (6;(),𝒙) (6;() ((𝑦( − 𝑦))) → 0 ), the result is the same with the first layer (∑ �𝒙(- (6;() −- 𝒙)- (6;()� ) ↑⇒ |𝒌|) → 0). However, the AN representing the mixture of synonymous motifs is usually accompanied by representing the strong consistent main motif (Fig. 2e). In layer 𝑖 − 1 of this AN, for feature 𝑠 representing the strong consistent main motif and feature 𝑗 representing (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 synonymous motifs, they may satisfy 𝒙(& (6;() ≈ 𝒙)& (6;() ≫ 𝒙(- (6;() ≠ 𝒙)- (6;() or 𝒌𝒔 ≫ 𝒌𝒋. Although |𝒌|) of this kind of AN is smaller compared with the AN cannot recognize synonymous motifs, it can still obtain similar activation value 𝑦 = 𝒌[𝒙(6;()] F + 𝑏 in two ways rather than becoming a dead kernel. One way is increasing 𝒙(& (6;() and 𝒙)& (6;() through the weight 𝒌(6;)) of previous layer neurons (�𝒙K (6;()� F = ReLU(𝒌(6;))[𝒙(6;))] F + 𝑏)). The other way is increasing 𝒌𝒔 and decreasing 𝒌𝒋, which may greatly reduce the side-effect of mixture of synonymous motifs. In a well-trained model, for an AN, compared to the large weight on the high activations of subsequence matched by the main motif, the signal generated by the subsequence matched by synonymous motif can be neglected. Otherwise, AN only representing strong synonymous motifs will destroy the robustness of the AN (Fig. 2e). Weighted sampling algorithm adapted from the genetic algorithm The sequence sampling process is necessary to estimate the AN motifs. The first operation is the initialization of seed sequences. We randomly generated 5000 seed sequences that match the receptive field size. For each sequence, we randomly replaced a specific sub-sequence with one motif sequence of ANs in the previous layer. The position and the previous layer AN were randomly selected based on value of the normalized maximum contribution score: 𝒄6- = max �0,𝒘6-𝐴-�/bmax �0,𝒘6-𝐴-� 6,- where 𝑖 is the position number, 𝑗 is the previous layer AN number and 𝐴- is the maximum activation value of the previous layer AN 𝑗. The second operation is sequence optimization. The sequence (𝑥) is discrete so we cannot use the gradient decent method directly, so we adopt and adjusted the genetic algorithm. In one generation, we used the normalized gradient value 𝒈 = 𝜕𝑓(𝒙) 𝜕𝒙⁄ as the probability to guide randomly select better mutation bases: 𝒈𝒊𝒋 N = � 𝒈6- ,𝒈6- > 0 𝑒𝒈9: ,𝒈6- < 0 𝒑6- = 𝒈𝒊𝒋 N /b𝒈𝒊𝒋 N 6 where 𝑖 is the base of A,C,G,T, and 𝑗 is the position. We kept 10% samples with top activation values in each generation. We randomly shifted 20% sequence samples based on the DCNN structure. The remaining samples were generated by roulette wheel selection and crossover operation. The total number of sequences did not change in (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 each generation. The optimization would not stop until the maximum record of the mean activation value of each generation did not increase for 10 iterations. The third operation is sampling. At the end of each iteration in the genetic algorithm, sequence with positive activations were collected as samples. The duplicated sequences were removed. Based on the maximum activation value of existing samples, we split the activation value interval into 20 bins. We kept the number of samples in each bin less than 5000. If it was overflowed, we randomly selected 5000 samples among them. See Supplementary Information for the pseudo code of this algorithm. Shifting latent variable discovery and decoupling algorithm For one AN, we need to design an algorithm to split the sample set according to the latent variables depending on the DCNN structure. From deep layers to shallow layers in DCNN model, when the result of a convolutional layer was the input of a max-pooling layer, the algorithm calculated the feature map of the convolutional layer and used K- means (K is the max-pooling size) to cluster the sequence samples into K subsets according to the features in feature maps. The algorithm would continuously cluster and split each subset reclusively once it found the result of convolutional layer was the input of the max-pooling layer. Finally, the number of subsets is ∏ 𝑘P8;(P7( where 𝐿 is the layer number of the AN and 𝑘P is the pooling size of the pooling operation applied on each convolutional layer. Based on each subset, we obtained the numerical estimation of PPMs. The algorithm can be applied on the newly generated subset again for decoupling the secondary important shifting motif if the samples are enough. This process has been shown in Fig. 2d and 2e. See Supplementary Information for details and the pseudo code of this algorithm. Algorithm implements NeuronMotif were implemented in python. It depends on tensorflow and keras packages. Current version of NeuronMotif can only be applied to the DCNN implemented by tensorflow or keras. We only implemented the CPU version of NeuronMotif so it does not depend on GPU. The scripts are parallelized and can be run across the nodes of the computing cluster. The memory consumption depends on the DCNN structure and the AN receptive field size. We run the program on 4 servers. Each server contains 2 CPUs with 28 cores (Intel E5-2680) and 128GB memory. For each DCNN model mentioned in this work, the program can finish the decoupling of all convolutional ANs in about 3 days. Rebuilding and decoupling the DeepSEA and Basset models DeepSEA and Basset are both 3-convolutional-layer models implemented in Torch, (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 which is not compatible to NeuronMotif. We rewrote these two models with tensorflow and keras. We tried to keep the architecture, regularization, optimizer and so on consistent with the previous studies. We trained the models by the datasets that were used for training the original models. The dataset was split into a training set, a validation set and a test set according to the original papers. We also followed the training strategy described in the papers. We trained these two models with a single Nvidia P100 GPU card. We applied NeuronMotif on DeepSEA and Basset. DeepSEA had (320, 480, 960) kernels in each convolutional layer. The max-pooling size was 4 for every convolutional layer. Theoretically, we would obtain 320,480 × 4 = 1920,960 × 4 × 4 = 15360 AN motifs for L1, L2 and L3. However, some of them were absent for the dead kernel or low information content motifs that should be excluded. Similarly, Basset had (300, 200, 200) kernels and its max-pooling sizes were (3,4,4). Theoretically we would obtain 300,200 × 3 = 600, 200 × 3 × 4 = 2400 motifs from the L1, L2 and L3 of Basset Model. Synonymous motif mixture detection and diagnosis To estimate motifs from each sample subset, we calculated its maximum activation value and activation value of consensus sequence. The max activation value is obtained by feeding all sample sequences to the substructure of the AN. The consensus sequence was obtained from the PPM of the motif. For each position, the nucleotide base with the largest probability among 4 bases in PPM was selected as the nucleotide in the consensus sequence. We fed the consensus sequence to the substructure of the AN and got the activation value. For all motifs in the same layer of the DCNN model, we can draw a scatter plot to find if serious synonymous motif mixture exists. It can be diagnosed by observing if activation values of consensus sequences are deviated from corresponding maximum activation values. More low activation values of the consensus sequences indicate more synonymous motif mixture in this DCNN model. Problematic neuron analysis We investigated some problematic AN in the Basset model to find which part of the discovered motif makes the consensus sequence not be able to activate the AN. This is caused by the inconsistent sub-sequences at the certain position of the various sampled sequences playing key role in activating the AN. So, the consensus sequence of the motif cannot represent these sampled sequences. We call these motifs and their consensus sequences to be inconsistent, otherwise we call them to be consistent. Among the motif for an AN, consensus sequences of some motifs are consistent, which can activate the AN, but the inconsistent ones can’t. It is difficult to distinguish them by naked eyes because the information contents at different position are almost the same. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 Here, we took one inconsistent and one consistent motif consensus sequence as examples. We aligned two sequences and used a 51bp window to slide on it. For each position, we replaced the sub-sequence (51bp) of the inconsistent consensus sequence with the corresponding sub-sequence of consensus sequence that can activate the AN to test if it can activate the AN. We found the valid position and tried to find the latent variables through clustering the sub-sequence-related feature maps or sub-sequence one-hot code that make it becomes mixture. Finally, we found they were the mixture of different synonymous motifs rather than the same shifted motif. See Supplementary Information for details. DCNN architecture optimization and deeper DCNN model construction We tried to optimize the architecture solely without using regularization methods. Following the strategy of small kernels and max-pooling sizes, the kernel size and max- pooling size were set 3 and 2 respectively. We used ReLU as the activation function for each layer except for the last fully-connected layer with the sigmoid function. For Basset, we built a 5-convolutional-layer model BD-5 (kernel number and pooling operation: 32, pooling, 64, pooling, 128, pooling, 256, pooling, 512) and a 10-convolutional-layer model BD-10 (kernel number and pooling operation: 64, 64, pooling, 128, 128, pooling, 256, 256, pooling, 384, 384, pooling, 512, 512). At the end of convolutional layer, two fully- connected layers with 1024 and 164 ANs were appended. The number of kernel sizes was doubled based on the previous layer because the receptive field size was doubled for the deeper AN. In a longer receptive field, more combinations of the motifs need to be represented. For DeepSEA, we built a 10-convolutional layer model DD-10 (kernel number and pooling operation: 128, 128, pooling, 160, 160, pooling, 256, 320, pooling, 512, 640, pooling, 1024, 1280). At the end of convolutional layer, two fully-connected layers with 925 and 919 ANs were appended. However, the prediction performance of DD-10 was similar to DeepSEA. We found that the overlap of the first-layer receptive field is very small for the AN of the second layer. If we set the kernel size 3 in the first layer (receptive field size is 3bp), then the overlap proportion of the adjacent 3 ANs is 1/5 (receptive field size is 5bp). We need to get longer overlap by extending the kernel size in the first layer. We tried to train DD-10 with the first-layer kernel size equal to 5 (overlap proportion: 3/7 ≈ 43%), 7 (overlap proportion: 5/9 ≈ 56%) and 9 (overlap proportion: 7/11 ≈ 63%). The best one is the model with kernel size equal to 7 in the first layer. This result also matched the top convolution-pooling model in the ImageNet competition25. It seems to be a trade-off for the first kernel size. If it is too small, the structure is not good for training the second layer. On the contrary, the structure is not good for training the first layer. Hence, we finally set first-layer kernel size as 7 for the DD-10 and BD-10 model. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 Prediction performance comparison Five models were involved in this work. They are Basset, BD-5, BD-10, DeepSEA and DD-5. After they had been trained on the training set and the validation set, they were tested on the test set. For each prediction target, we calculated the value of Area Under the Precision-Recall Curve (AUPRC). We used AUPRC rather than Area Under the Receiver Operating Characteristic curve (AUROC) because AUPRC is more sensitive to the unbalanced data. In the dataset of DeepSEA and Basset, the negative samples were much more than the positive samples so AUPRC is a better indicator. To compare and test the performance difference between models, we assume that if the performance of two model is the same, the difference of AUPRC value of the same prediction target is ΔAUPRC ~𝑁(0,𝜎)) . We did one-side t-test for each pair of models for comparison. Motif discovery For each decoupled motif represented by the AN, we needed to filter and slice the motifs for regulatory elements. The decoupled motifs were generated by a sequence set. When the number of the sequences is very small, the motif is not reliable. We first applied the Laplace smoothing method to the PPMs of decoupled motifs. The smoothed PPM (𝐏𝐏𝐌′) can be obtain by 𝐏𝐏𝐌′ = 𝐏𝐏𝐌 × 𝑁 + [0.25]B×8 × 𝑀 𝑁 + 𝑀 where 𝑁 is the number of sequences that generate the PPM, [0.25]B×8 is the 4 × 𝐿 matrix with all elements of 0.25, and 𝑀 is the smoothing parameter. A larger 𝑀 means a stronger smoothing process. We set 𝑀 = 80 in our work. We regarded the nucleotide base position as a part of motif regions if its information content is greater than 1. We extended these motif regions with 3 bp at both the upstream and downstream. We merged these regions if they were overlapped. Regions longer than 8bp were regarded as motifs. We sliced these regions of PPM as the final discovered motifs. A large portion of these motifs can be matched with motifs in the JASPAR database. We showed a small portion of motifs in Fig. 5d with the motifStack31 package. Motif syntax discovery and validation We used the ANs of layer 10 in BD-10 and DD-10 for the motif syntax discovery. We applied the decoupling algorithm twice for each AN and obtain 256 decoupled motifs. These decoupled motifs of the same AN usually shared similar shifted motifs. For convenient, we summarized the motifs by using Tomtom to match them to motifs in JASPAR. Based on the summarized TF motif set, we knew the arrangement of these TF motif. For instance, in Fig. 6a, the TF motif set includes CTCF and DDIT3:CEBPA and the arrangement of this two motif is CTCF-6N-DDIT3:CEBPA-[18-28N]- DDIT3:CEBPA- (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 6N-CTCF, which is the motif syntax of the AN. Except for literature validations such as JASPAR databases or some published papers, we also used ATAC-seq data to valid the motif syntax. If the motif syntax is real on genome, the region matched by the motif syntax should interact with some important molecules like TFs. Thus, the Tn5 transposes cutting frequency in the aligned regions may show footprinting. We collected five ATAC-seq datasets of five cell types or tissue including GM128782, H132, K56233, LNCaP34 and prostate35 (GSM1155957, GSM2264819, GSM2902637, GSM3632983, GSM3320984). We used the esATAC36 package developed by us to preprocess the dataset. For a concerned AN, we used it to scan the test data of Basset or DeepSEA. We collected the top 3000 activated regions, extended the regions to 1000bp and stack their Tn5 cutting frequency. We also counted the hard motif matching frequency at each position of these 1000bp regions with motifmatchr37. Code and more relative results NeuronMotif code will be available at: https://github.com/wzthu/NeuronMotif Relative results will be exibit at: https://wzthu.github.io/NeuronMotif (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 References 1 Searls, D. B. The language of genes. Nature 420, 211-217, doi:10.1038/nature01255 (2002). 2 Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA- binding proteins and nucleosome position. Nature Methods 10, 1213-1218, doi:10.1038/Nmeth.2688 (2013). 3 Eraslan, G., Avsec, Z., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat Rev Genet 20, 389-403, doi:10.1038/s41576-019- 0122-6 (2019). 4 Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning- based sequence model. Nature Methods 12, 931-934, doi:10.1038/Nmeth.3547 (2015). 5 Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Research 26, 990-999, doi:10.1101/gr.200535.115 (2016). 6 Spitz, F. & Furlong, E. E. M. Transcription factors: from enhancer binding to developmental control. Nat Rev Genet 13, 613-626 (2012). 7 Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685 (2017). 8 Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology 33, 831-838, doi:10.1038/nbt.3300 (2015). 9 Searls, D. B. The Linguistics of DNA. Am Sci 80, 579-591 (1992). 10 Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). 11 Zou, J. et al. A primer on deep learning in genomics. Nature Genetics 51, 12-18 (2019). 12 He, Y., Shen, Z., Zhang, Q., Wang, S. & Huang, D.-S. A survey on deep learning in DNA/RNA motif mining. Briefings in Bioinformatics, doi:10.1093/bib/bbaa229 (2020). 13 Nguyen, A., Yosinski, J. & Clune, J. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks. arXiv preprint arXiv:1602.03616 (2016). 14 Koo, P. K. & Eddy, S. R. Representation learning of genomic sequence motifs with convolutional neural networks. Plos Computational Biology 15, doi:10.1371/journal.pcbi.1007560 (2019). 15 Jaganathan, K. et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell 176, 535-548 e524, doi:10.1016/j.cell.2018.12.015 (2019). 16 Bogard, N., Linder, J., Rosenberg, A. B. & Seelig, G. A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation. Cell 178, 91-106 e123, (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 doi:10.1016/j.cell.2019.04.046 (2019). 17 Stormo, G. D. Introduction to protein-DNA interactions: structure, thermodynamics, and bioinformatics. (Cold Spring Harbor Laboratory Press, 2013). 18 Jolma, A. et al. DNA-Binding Specificities of Human Transcription Factors. Cell 152, 327- 339, doi:10.1016/j.cell.2012.12.009 (2013). 19 Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014). 20 Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013). 21 Fornes, O. et al. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res 48, D87-D92 (2020). 22 Klemm, S. L., Shipony, Z. & Greenleaf, W. J. Chromatin accessibility and the regulatory epigenome. Nat Rev Genet 20, 207-220, doi:10.1038/s41576-018-0089-8 (2019). 23 Schwalie, P. C. et al. Co-binding by YY1 identifies the transcriptionally active, highly conserved set of CTCF-bound regions in primate genomes. Genome biology 14, doi:10.1186/gb-2013-14-12-r148 (2013). 24 Molla, M., Delcher, A., Sunyaev, S., Cantor, C. & Kasif, S. Triplet repeat length bias and variation in the human transcriptome. Proceedings of the National Academy of Sciences of the United States of America 106, 17095-17100, doi:10.1073/pnas.0907112106 (2009). 25 Russakovsky, O. et al. ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vision 115, 211-252, doi:10.1007/s11263-015-0816-y (2015). 26 Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L. & Noble, W. S. Quantifying similarity between motifs. Genome biology 8, doi:10.1186/gb-2007-8-2-r24 (2007). 27 Jolma, A. et al. DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature 527, 384-388, doi:10.1038/nature15518 (2015). 28 Pugacheva, E. M. et al. Comparative analyses of CTCF and BORIS occupancies uncover two distinct classes of CTCF binding genomic regions. Genome biology 16, doi:10.1186/s13059-015-0736-8 (2015). 29 Grabowska, M. M. et al. NFI Transcription Factors Interact with FOXA1 to Regulate Prostate-Specific Gene Expression. Mol Endocrinol 28, 949-964, doi:10.1210/me.2013- 1213 (2014). 30 Stormo, G. D. & Zhao, Y. Determining the specificity of protein-DNA interactions. Nat Rev Genet 11, 751-760, doi:10.1038/nrg2845 (2010). 31 Ou, J. H., Wolfe, S. A., Brodsky, M. H. & Zhu, L. H. J. motifStack for the analysis of transcription factor binding site evolution. Nature Methods 15, 8-9, doi:10.1038/nmeth.4555 (2018). 32 Liu, Q. et al. Genome-Wide Temporal Profiling of Transcriptome and Open Chromatin of Early Cardiomyocyte Differentiation Derived From hiPSCs and hESCs. Circ Res 121, 376- 391, doi:10.1161/Circresaha.116.310456 (2017). 33 Calviello, A. K., Hirsekorn, A., Wurmus, R., Yusuf, D. & Ohler, U. Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets using protocol-specific (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 bias modeling. Genome biology 20, doi:10.1186/s13059-019-1654-y (2019). 34 Zhang, Z. D. et al. Loss of CHD1 Promotes Heterogeneous Mechanisms of Resistance to AR-Targeted Therapy via Chromatin Dysregulation. Cancer Cell 37, 584-598 e511, doi:10.1016/j.ccell.2020.03.001 (2020). 35 Park, J. W. et al. Reprogramming normal human epithelial tissues to a common, lethal neuroendocrine cancer lineage. Science 362, 91-95, doi:10.1126/science.aat5749 (2018). 36 Wei, Z., Zhang, W., Fang, H., Li, Y. D. & Wang, X. W. esATAC: an easy-to-use systematic pipeline for ATAC-seq data analysis. Bioinformatics 34, 2664-2665, doi:10.1093/bioinformatics/bty141 (2018). 37 Schep, A. N., Wu, B. J., Buenrostro, J. D. & Greenleaf, W. J. chromVAR : inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nature Methods 14, 975-978, doi:10.1038/Nmeth.4401 (2017). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430606doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430606 10_1101-2021_02_10_430619 ---- Cutevariant: a GUI-based desktop application to explore genetics variations Journal Title Here, 2021, 1–8 doi: DOI HERE Advance Access Publication Date: Day Month Year Paper Cutevariant: a GUI-based desktop application to explore genetics variations Sacha Schutz,1,2 Tristan Montier1 and Emmanuelle Genin2 1Univ Brest, CHRU Brest, Inserm, EFS, UMR 1078, GGB, 29200, Brest, France and 2Inserm, Univ Brest, EFS, UMR 1078, GGB, 29200, Brest, France ∗Corresponding author. sacha@labsquare.org FOR PUBLISHER ONLY Received on Date Month Year; revised on Date Month Year; accepted on Date Month Year Abstract Cutevariant is a user-friendly GUI based desktop application for genomic research designed to search for variations in DNA samples collected in annotated files and encoded in the Variant Calling Format. The application imports data into a local relational database wherefrom complex filter-queries can be built either from the intuitive GUI or using a Domain Specific Language (DSL). Cutevariant provides more features than any existing applications without compromising on performance. The plugin based architecture provides highly customizable features. Cutevariant is distributed as a multiplatform client-side software under an open source licence and is available at https://github.com/labsquare/Cutevariant. It has been designed from the beginning to be easily adopted by IT-agnostic end-users. Key words: genomics, DNA variant, desktop application, Domain Specific Language, Graphic User Interface Introduction Next-Generation Sequencing (NGS) has opened new opportunities in genomic research such as identification of DNA variations from Genome, Exome or Panel experiments. These data are delivered as files encoded in the standard Variant Calling Format (VCF version 4.0) [1] where the variations are listed together with the genotype information of different samples. Tools such as VEP [2] or SnpSift [3] can be use to add annotations such as genes or functional impact. Biologists can then filter out variants applying customized criteria on these annotations. In medicine, the identification of mutations in rare diseases would be a typical use case. This filtering procedure implements sophisticated software tools that can be easily adopted by end-users who are not necessarily IT-aware. Several management systems have been developed to ease the usage of the filtering step. GEMINI [4] and VariantTools [5] are command line applications where data from the VCF files are loaded into a relational database managed by SQLite [6]. Filtering can thus be made very efficient using the SQL query syntax. Other tools such as SnpSift [3] or BCFtools [7] apply filters directly while reading the VCF files line by line, thus avoiding the need to create an intermediate data structure. This comes at the cost of poor timing efficiency especially when it is necessary to sort or group variants. While these tools are quite flexible allowing any kind of filtering, the command line interface is not very intuitive, thus reducing the incentive to use it for non IT-specialists. This called for the development of applications steered by user-friendly Graphical User Interfaces (GUI). Some specializing in diagnostics offer online solutions with a complete set of patient management features but require uploading the VCF files. The most popular of the kind are either private software such as SeqOne [8] and or those distributed under the open source licence such as the recently published VarFish [9]. A major drawbacks of this scheme comes from the transit of a large amount of genetic data through public networks raising on one hand confidentiality and performance issues, and requiring on the other hand a dedicated server which might not be available for every end-users. Moreover, these solutions are tailored for human species data and therefore cannot be adopted for all end-users. GUI Applications that do not require a server and offering an out-of-the-box solution are therefore a preferable solution. The web-based applications VCFMiner [10], BrowseVCF [11] and VCF.Filter [12] implement such a solution. VCFMiner is distributed as a package container running with Docker [13] requiring thus a customized desktop configuration. BrowseVCF provides its own launcher making it quite user friendly but the application is not supported anymore. Both applications import the data from VCF files into an indexed database and provide different GUI forms to create filters. Their main drawback resides in the limited filter settings available 1 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430619doi: bioRxiv preprint email:email-id.com https://github.com/labsquare/Cutevariant https://doi.org/10.1101/2021.02.10.430619 2 Short Article Title Fig. 1: Cutevariant database schema. Only mandatory fields are displayed. fields n are dynamically created during the import step based on the content of the VCF file through the GUI, complex filters requiring a domain specific language. In addition, web applications offer poor timing performances compared to native desktop applications. Despite the availability of these tools, many biologists still use Microsoft Excel to filter their variants and are facing severe problems [14]. To address the shortcomings of the existing applications, we have developed Cutevariant, a user-friendly and ergonomic desktop application implemented in Python within the Qt5 framework. It takes full advantage of both a GUI and command line user-interface, a Domain Specific Language called VQL allowing the user to build complex filter expressions. It is distributed as a multi-platform client-side software under an open source licence. Thanks to an architecture based on plugins, Cutevariant is fully customizable, allowing to easily extend the application with additional features. Materials and methods VCF file importation and preprocessing Cutevariant imports data from VCF files into a normalized SQLite database (Figure 1) stored as a *.db file, and optionally with a PED file to describe affected samples and their relationship. Fields from variants and annotations tables are dynamically created according to the content of the VCF file. This importation step proceeds using a VCF parser to produce json-like arrays tailored for populating the SQLite database. It is based on a strategy design pattern so that any formats can be supported by subclassing an abstract Reader object. The available distribution supports raw VCF files and VCF files annotated with VEP or SnpEff following the ANN specifications [15]. Before importation into the database, data are cleaned and normalized following the same procedure as the VT norm [16] application: single lines of multi-allelic variants are split into multiple lines. Computed annotations, not present in the original file, are automatically created. As for example, the count var field contains the number of samples that carry the variant. It is thus possible to filter variants present in more than N samples by filtering on this column. This feature is similar to countVar() from the SnpSift [3] filter command. From the Cutevariant main window, the new project button starts a wizard and triggers the importation process. Depending on the size of the input, the importation and indexation process might take some time but this has only minimal impact on the performance since this step is performed only once. Alternatively, VCF files import can be triggered from the command line using the Cutevariant-cli button. This feature offers to knowledgeable experts the possibility to integrate the import process at the end of a pipeline. User interface layout The main view (Figure 2) of the Cutevariant GUI displays the list of variants together with their annotations. Several GUI controllers allow the user to update the view and display the list in different formats. • fields editor: to show or hide selected annotations. • filter editor: to build a nested list of conditional rules with OR/AND binary operators. • variant info: to display in an organised way all annotations related to the currently selected variant. • source editor: to manage different views and perform set operations (union, intersection, difference) and bed file intersections. • word set: to manage lists of words used to generate simple filters, e.g., filter all variants belonging to a given gene list or a dbSNP list. Most of these actions end up building a VQL query that can be checked in the VQL-editor sub-window. The variants list can (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430619doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430619 Short Article Title 3 Fig. 2: The Cutevariant main view showing the variants list sub-window (middle), different controllers sub-windows but not all are displayed (left) and the VQL editor sub-window (bottom). then be updated either with the controllers or by editing the VQL query directly. Variant Query Language (VQL) To facilitate the composition of complex query-filters, the application integrates a Domain Specific Language (DSL) named Variant Query Language (VQL). The syntax of VQL has been designed to look like a subset of the SQL language working on a virtual database schema. It makes use of the Python module textX [17] which provides several tools to define a grammar and create parsers with an Abstract Syntax Tree. VQL queries can be composed in the VQL editor sub-window. However, to avoid forcing users to learn the VQL language, a query can as well be defined from the GUI using the different available controller sub-window listed above. The VQL query is translated through the intermediary of a JSON object into a well formatted SQL query and processed by the SQLite database manager. As an example, the following VQL query: SELECT chr,pos,consequence,sample['NA1223'].gt FROM variants WHERE gene = 'CFTR' AND impact = 'HIGH' is translated into the following SQL query : SELECT DISTINCT `variants`.`id`, `variants`.`chr`, `variants`.`pos`, `annotations`.`consequence`, `sample_NA1223`.`gt` AS "sample('NA1223').gt" FROM variants LEFT JOIN annotations ON annotations.variant_id = variants.id INNER JOIN sample_has_variant `sample_NA1223` ON `sample_NA1223`.variant_id = variants.id AND `sample_NA1223`.sample_id = 1 WHERE ( `annotations`.`gene` = 'CFTR' AND `annotations`.`impact` = 'HIGH') LIMIT 50 OFFSET 0 Filter expressions Filter expressions are defined from the VQL WHERE clause. From the filter editor, it is displayed as a nested set of editable condition rules. Logical (AND/OR) and arithmetic (=, <, >, ≤, ≥, 6=, IN, NOT IN, IS NULL) operators are supported. Regular expression using the binary ones complement operator (∼) and a special WORDSET keyword are included as well. This keyword allows the user to test if a fields belongs to a set of words defined a priori. For instance, in VQL, to select all variants from a list of a user-defined genes: CREATE SET genes ('gene.txt') SELECT * FROM variants WHERE gene IN WORDSET['genes'] (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430619doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430619 4 Short Article Title Fig. 3: Abstract Syntax Tree (AST) of the VQL query SELECT chr,pos,consequence FROM variants WHERE gene='CFTR' AND impact='HIGH'. The AST is parsed into a Python object. Group variants The GROUP BY keyword allows the user to split the view in two panels: left the list of groups and right the list of all variants belonging to the selected group. With this feature the exploration is made easier by, for instance, grouping variants by genes helping to detect compound heterozygous. Set operation Just like Variant Tools, Cutevariant supports operations between variant sets. Each query result can be stored in a view using the CREATE VQL keywords or by clicking the corresponding GUI button. For instance, the following query will create a new view called new view. CREATE new_view FROM variants WHERE gene='CFTR' It is then possible to build a query directly from this view. The following query returns the same output as the previous one: SELECT chr, pos FROM new_view Each view behaves as a set with three operations available (difference, intersection, union) by comparing variants fields on chr, pos, ref and alt. The following queries show how to create a new view based on different set operation: # difference CREATE second_view = variants - new_view # union CREATE second_view = variant + new_view # intersection CREATE second_view = variant & new_view Plugins architectures The Cutevariant GUI architecture relies entirely on plugins which source is available in the plugins directory. A plugin consists of a module containing different Python files implementing the creation of a Plugin class instance with several overloaded virtual methods. Adding or removing GUI controllers becomes therefore straightforward. In addition, similarly to excel, cells of the variant view can be formatted conditionally. By subclassing the Formatter class, one can change the style of the cell with different colors, text or icons according to the value of the cell. For instance, impact fields with HIGH as value can be displayed with a red background to catch the user’s attention. Currently, Cutevariant supports only one formatters: cuteStyle. Cutevariant allows the user to build a custom URL from a variant and open it from an external application. This is used for example to open a web link on a dbSNP database or to show BAM alignment from IGV software at the corresponding variant location. With plugins, experienced users can customize Cutevariant with dedicated features or create new ones and share them with the users community. Technical details and continuous integration Cutevariant is a cross platform application implemented in Python 3.7 using the Qt5 framework for the user interface (PySide2 ≥ 5.11). The VCF parser uses the PyVCF ≥ 0.6.8 library. Syntax and parser of the VQL language rely on the textX ≥ 1.8.0 library. SQLite3 is the database manager interfaced with the Python standard library. The source code and documentation are available on GitHub [18]. Continuous integration are made on GitHub-CI and unit tests are made with the Pytest framework [19]. The application is distributed as windows 32 bits and 64 bits packages. Cutevariant is also available as a Python package from the Python Package Index Pypi [20]. Results In Table 1 we list the features available in Cutevariant compared to other applications available on the market. The timing performance of Cutevariant to execute different actions is reported in Table 2 and compared to the timing performance of VCF-Miner, the fastest application we have evaluated. Cutevariant outperforms VCF-Miner except for 1KG.chr22.anno.vcf. The reason comes from the large number of samples required to compute the joint table between samples and variants. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430619doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430619 Short Article Title 5 Table 1. Features available in various applications available on the market. GUI Command Line Features Cutevariant BrowseVCF VCF-Miner VCF-Explorer VCF-Server VCF-Filters GEMINI Variant Tools SnpSift process annotations no no no no yes no yes no no VEP parser yes yes no no no no yes no no SnpEff parser yes yes no no no no yes yes yes SQL like query yes no no no no no yes yes yes regular expressions yes no no no no no no∗ no∗ yes bed file intersection yes no yes no no yes no no yes set operations yes no no no no no no yes yes sorting yes yes yes no yes no yes yes yes intersect with wordset yes yes no no no no no no yes plugins extension yes no no no no no no no no indexed database SQLite Berkeley DB MongoDB raw file MongoDB raw file SQLite SQLite raw file data encryption no∗∗ no no no yes no no no no language Py3/Qt Py2/HTML JS/HTML C++/Qt Node.js Java Py3 Py3 Java pedigree file yes no no no no no yes yes yes application type desktop web web web web desktop console console console multi-users support no no no no yes no no no CVS/Excel export yes yes yes yes yes no yes yes yes ∗Support LIKE SQL expression ∗∗Possible with SQLITE encryption extension Table 2. Comparaison of time performance between cutevariant and VCF-miner for importation and query execution. The query used filters the variants with QUAL ≥30 and DEPTH ≥ 30. Executed on Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz with 16Gb RAM input file 1KG.chr22.anno.vcf corpas.quartlet.vcf NA12878.vcf variant count 494’328 300’035 3’775’119 sample count 1092 4 1 software cutevariant VCF-miner cutevariant VCF-miner cutevariant VCF-miner importation time 6600s 2940s 78s 183s 810s 2220s query execution time* ≈ 1s ≈ 1s 0.02s ≈ 1s 0.02s ≈ 1s Use case 1: Sars-CoV-2-Analysis In the context of the Covid-19 pandemia, we have tested Cutevariant to identify mutations along the genome of the Sars- Cov-2 virus. For this, we have downloaded from the ENA database, a dataset (PRJNA673096) with 245 samples stored in a Fastq file produced by the Illumina sequencing plateform using an amplicon librarie. The pipeline is available on github [21].The data originate from the US Delaware Public Health Laboratory. Fastq files have been aligned on the NC045512.2 genome of Sars-CoV-2 with the BWA software [22]. Variants have been called with the FreeBayes application [23] and all 245 samples have been merged into one single VCF file annotated with SnpEff[24]. This file has been imported into Cutevariant for exploration. We executed a VQL statements (Fig. 4) to extract variants within the gene S and sorted the result by count var annotation showing the total number of samples carrying the variant. The sorting process is easily done by clicking on the corresponding header of the view. The mutation p.asp614Gly (highlighted in Fig. 4) is found in 239 samples out of 245. This variant has already been described [25] as a dominant one emerging at the beginning of the pandemia. In the same way, by scrutinizing all the genes, we have identified two others mutation: (ORF1ab)p.Thr265Ile and (ORF3a)p.Gln57His which are exclusive to the North American population [26]. Fig. 4: Mutation found in gene S of Sars-Cov-2 by a Cutevariant analysis of 245 samples. Use case 2: Cohort analysis We have repeated with Cutevariant the analysis given as an example by SnpSift [27]. It is a cohort analysis of 17 individuals among which 3 are affected by a nonsense mutation in the CFTR gene (G542*). This analysis cannot be performed with any of the graphics application listed previously (Table 1). After (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430619doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430619 6 Short Article Title importing the annotated VCF file and the corresponding PED file, the following VQL query was processed by Cutevariant selecting variants with HIGH impact which are homozygous in case samples but are not in control samples. SnpSift uses the following query: cat protocols/ex1.ann.cc.vcf \ | java -jar SnpSift.jar filter \ (Cases[0]=3) & (Controls[0]=0) ((ANN[*].IMPACT='HIGH')|\ (ANN[*].IMPACT='MODERATE')) \ > protocols/ex1.filtered.vcf The Cutevariant equivalent VQL query providing the same results reads as: SELECT chr, pos FROM variants WHERE case_count_hom=3 control_count_hom=0 AND impact IN ('HIGH', 'MODERATE') Discussion Performance Cutevariant is implemented within the open-source Qt for Python [28] that provides a set of Python bindings to build modern user interface. Instead of using native Qt/C++ as coding language, we have opted for Python because it is by far the most frequently used coding language in the bioinformatics community. This choice does not cause any significant performance degradation of the Cutevariant GUI. Execution time for queries performed on a complete genome with many filters can become particularly slow. This long execution time is primarily due to the SQL COUNT statement which browses through all the variants to calculate the total number of variants. The table JOIN statement is also time consuming. This is the consequence of the choice made for Curevariant, unlike GEMINI, to store samples and a few annotations in separate tables to avoid table denormalization and to minimize disk space occupation. This time penalty has been minimized on one hand by using a memory cache so that identical VQL queries do not need to recalculate the count of variants and, on the other hand, by using asynchronous queries performed in dedicated threads, thus avoiding to freeze the GUI with the progress bar showing the loading status. Web app vs Desktop app Cutevariant is a serverless desktop application and therefore does not provide annotation- or multiuser-features. The annotation step must be carried out upstream at the end of an analysis pipeline by using dedicated tools such as SnpSift or VEP. Multi-users capabilities allow users to share custom annotations and comments. For instance, a user marks a variant as pathogenic and this information is shared among all users. Although this feature is not supported by Cutevariant, it can be delegated to other tools such as MyVariant.info [29]. It provides a database of variants with which Cutevariant can communicate through a REST API. These data can then be used as a source of annotation in the annotation step of the pipeline. A general purpose and customizable tool Cutevariant is a general purpose tool to filter variants and is fully customizable thanks to its plugin-based implementation and thus offers features and modularity that are not available with existing applications. Since Cutevariant is not specific to the analysis of the human genome, it can be use with any VCF file as we demonstrated here with the Sars-Cov-2 example. GUI options dedicated to specific tasks are not hard coded in the application but can easily be added to Cutevariant by creating new plugins. As an example of such added GUI options, the Trio Analysis plugin selected from the Tools menu users to build from the GUI a VQL filter including transmission mode and the family tree. Conclusion Cutevariant is a new desktop application devoted to explore genetic variations in VCF data provided by next generation sequencing. It is the first GUI software of the kind that integrates both a user friendly graphical user interface and a domain specific language. Starting from a low learning threshold, end-users can easily perform complex filtering to identify variants of interest. Cutevariant is a standalone application that runs on standard desktop computers either under Linux, MacOS or Windows operating systems. The python-based plugins architecture makes the application easily expandable with the addition of new features, thus offering the possibility to involve the biocomputer scientists community at large in new features developments. Acknowledgments We would like to thank Lucas Bourneuf and Pierre Vignet for contribution to the development. Funding This work has been supported by UBO, Université de Bretagne Occidentale, France. Conflict of Interest: none declared References 1. Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert E. Handsaker, Gerton Lunter, Gabor T. Marth, Stephen T. Sherry, Gilean McVean, and Richard Durbin. The variant call format and VCFtools. Bioinformatics, 27:2156–2158, 8 2011. 2. William McLaren, Laurent Gil, Sarah E. Hunt, Harpreet Singh Riat, Graham R.S. Ritchie, Anja Thormann, Paul Flicek, and Fiona Cunningham. The ensemble variant effect predictor. Genome Biology, 17:1–14, 6 2016. 3. Pablo Cingolani, Adrian Platts, Le Lily Wang, Melissa Coon, Tung Nguyen, Luan Wang, Susan J. Land, Xiangyi Lu, and Douglas M. Ruden. A program for annotating and predicting the effects of single nucleotide polymorphisms, (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430619doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430619 Short Article Title 7 SnpEff: SNPs in the genome of drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 6:80–92, 2012. 4. Umadevi Paila, Brad A. Chapman, Rory Kirchner, and Aaron R. Quinlan. GEMINI: Integrative exploration of genetic variation and genome annotations. PLoS Computational Biology, 9, 7 2013. 5. Gao T. Wang, Bo Peng, and Suzanne M. Leal. Variant association tools for quality control and analysis of large- scale sequence and genotyping array data. American Journal of Human Genetics, 94:770–783, 5 2014. 6. Richard D Hipp. SQLite. https://www.sqlite.org/index.html, 2020. 7. Heng Li. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21):2987–2993, 09 2011. 8. Anne-Sophie Lebre and Jean-Marc Rey. SeqOne. https://seq.one/, Jan 2021. 9. Manuel Holtgrewe, Oliver Stolpe, Mikko Nieminen, Stefan Mundlos, Alexej Knaus, Uwe Kornak, Dominik Seelow, Lara Segebrecht, Malte Spielmann, Björn Fischer-Zirnsak, Felix Boschann, Ute Scholl, Nadja Ehmke, and Dieter Beule. VarFish: comprehensive DNA variant analysis for diagnostics and research. Nucleic Acids Research, 48(W1):W162–W169, 04 2020. 10. Steven N. Hart, Patrick Duffy, Daniel J. Quest, Asif Hossain, Mike A Meiners, and Jean-Pierre Kocher. VCF-Miner: GUI-based application for mining variants and annotations stored in VCF files. Briefings in Bioinformatics, 17(2):346–351, 07 2015. 11. et al. W James Kent. The human genome browser at UCSC. Genome Res., 12(6):996–1006, 06 2002. 12. Heiko Müller, Raul Jimenez-Heredia, Ana Krolo, Tatjana Hirschmugl, Jasmin Dmytrus, Kaan Boztug, and Christoph Bock. VCF.Filter: interactive prioritization of disease- linked genetic variants from sequencing data. Nucleic Acids Research, 45(W1):W567–W572, 05 2017. 13. Empowering app development for developers. https://www.docker.com/. 14. Mark Ziemann, Yotam Eren, and Assam El-Osta. Gene name errors are widespread in the scientific literature. Genome Biology, 17, 8 2016. 15. Pablo Cingolani, Fiona Cunningham, Will Mclaren, and Kai Wang. Variant annotations in VCF format. http://www.ensembl.org/Help/Glossary?id=492. 16. Adrian Tan, Gonçalo R. Abecasis, and Hyun Min Kang. Unified representation of genetic variants. Bioinformatics, 31(13):2202–2204, 02 2015. 17. I. Dejanović, R. Vaderna, G. Milosavljević, and Vuković. TextX: A Python tool for Domain-Specific Languages implementation. Knowledge-Based Systems, 115:1–4, 1 2017. 18. Cutevariant. https://github.com/labsquare/cutevariant. 19. Pytest. https://docs.pytest.org/en/stable. 20. Python Package Index. https://pypi.org/. 21. githubcovid.https : //github.com/dridk/Sars−CoV − 2 − NGS − pipeline. 22. Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25:1754–1760, 7 2009. 23. Erik Garrison and Gabor Marth. Haplotype- based variant detection from short-read sequencing. http://arxiv.org/abs/1207.3907, 7 2012. 24. P. Cingolani, A. Platts, M. Coon, T. Nguyen, L. Wang, S.J. Land, X. Lu, and D.M. Ruden. A program for annotating and predicting the effects of single nucleotide polymorphisms, snpeff: Snps in the genome of drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 6(2):80–92, 2012. 25. Bette Korber, Will M. Fischer, Sandrasegaram Gnanakaran, Hyejin Yoon, James Theiler, Werner Abfalterer, Nick Hengartner, Elena E. Giorgi, Tanmoy Bhattacharya, Brian Foley, Kathryn M. Hastie, Matthew D. Parker, David G. Partridge, Cariad M. Evans, Timothy M. Freeman, Thushan I. de Silva, Adrienne Angyal, Rebecca L. Brown, Laura Carrilero, Luke R. Green, Danielle C. Groves, Katie J. Johnson, Alexander J. Keeley, Benjamin B. Lindsey, Paul J. Parsons, Mohammad Raza, Sarah Rowland-Jones, Nikki Smith, Rachel M. Tucker, Dennis Wang, Matthew D. Wyles, Charlene McDanal, Lautaro G. Perez, Haili Tang, Alex Moon-Walker, Sean P. Whelan, Celia C. LaBranche, Erica O. Saphire, and David C. Montefiori. Tracking changes in sars-cov-2 spike: Evidence that d614g increases infectivity of the covid-19 virus. Cell, 182:812–827.e19, 8 2020. 26. Xumin Ou, Zhishuang Yang, Dekang Zhu, Sai Mao, Mingshu Wang, Renyong Jia, Shun Chen, Mafeng Liu, Qiao Yang, Ying Wu, Xinxin Zhao, Shaqiu Zhang, Juan huang, Qun Gao, Yunya Liu, Ling Zhang, Maikel Peopplenbosch, Qiuwei Pan, and Anchun Cheng. Tracing two causative snps reveals sars- cov-2 transmission in north america population. bioRxiv, page 2020.05.12.092056, 5 2020. 27. Snpeff usage example. https://pcingola.github.io/SnpEff/examples/. 28. The Qt Company. Qt for Python: The official Python bindings for Qt. https://www.qt.io/qt-for-python. 29. Variant annotation as a service. https://myvariant.info/. 30. Silvia Salatino and Varun Ramraj. BrowseVCF: a web-based application and workflow to quickly prioritize disease-causative variants in VCF files. Briefings in bioinformatics, 18:774–779, 9 2017. 31. Steven N. Hart, Patrick Duffy, Daniel J. Quest, Asif Hossain, Mike A. Meiners, and Jean Pierre Kocher. VCF-Miner: GUI- based application for mining variants and annotations stored in VCF files. Briefings in Bioinformatics, 17:346–351, 3 2016. 32. Jianping Jiang, Jianlei Gu, Tingting Zhao, and Hui Lu. VCF- Server: A web-based visualization tool for high-throughput variant data mining and management. Molecular Genetics and Genomic Medicine, 7, 7 2019. 33. F. Anthony San lucas, Gao Wang, Paul Scheet, and Bo Peng. Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools. Bioinformatics, 28:421–422, 2 2012. 34. The Qt Company. Cross-platform software development for embedded and desktop. https://www.qt.io/. 35. Manuel Holtgrewe, Oliver Stolpe, Mikko Nieminen, Stefan Mundlos, Alexej Knaus, Uwe Kornak, Dominik Seelow, Lara Segebrecht, Malte Spielmann, Björn Fischer-Zirnsak, Felix Boschann, Ute Scholl, Nadja Ehmke, and Dieter Beule. VarFish: comprehensive DNA variant analysis for diagnostics and research. Nucleic acids research, 48:W162–W169, 7 2020. 36. Damian Smedley, Julius O B Jacobsen, Marten Jager, Sebastian Köhler, Manuel Holtgrewe, Max Schubach, Enrico Siragusa, Tomasz Zemojtel, Orion J Buske, Nicole L Washington, (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430619doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430619 8 Short Article Title William P Bone, Melissa A Haendel, and Peter N Robinson. Next-generation diagnostics and disease-gene discovery with the Exomiser. Nature protocols, 10:2004, 2015. 37. DNA sequencing. https://www.integragen.com/service- solutions/dna-sequencing, Oct 2020. 38. Adrian Tan, Gonçalo R. Abecasis, and Hyun Min Kang. Unified representation of genetic variants. Bioinformatics, 31:2202–2204, 7 2015. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430619doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430619 Introduction Materials and methods VCF file importation and preprocessing User interface layout Variant Query Language (VQL) Filter expressions Group variants Set operation Plugins architectures Technical details and continuous integration Results Use case 1: Sars-CoV-2-Analysis Use case 2: Cohort analysis Discussion Performance Web app vs Desktop app A general purpose and customizable tool Conclusion Acknowledgments Funding 10_1101-2021_02_10_430623 ---- “Single-subject studies”-derived analyses unveil altered biomechanisms between very small cohorts: implications for rare diseases “Single-subject studies”-derived analyses un- veil altered biomechanisms between very small cohorts: implications for rare diseases Dillon Aberasturi1-3,i, Nima Pouladi2,6,i, Samir Rachid Zaim1-3, Colleen Kenost1- 3,6, Joanne Berghout1-2,4, Walter W. Piegorsch1,3,5, *, Yves A. Lussier1-6* 1Center for Biomedical Informatics and Biostatistics (CB2), 2Dept. Of Medicine, 3Graduate Interdisci- plinary Prog. in Statistics & Data Science, 4Ctr for Appl. Genetics and Genomic Medic., 5Bio5 Institute; University of Arizona, Tucson, AZ, USA. 6Dept of Biomedical Informatics; University of Utah, UT, USA * To whom correspondence should be addressed; i These authors contributed equally Abstract Motivation: Identifying altered transcripts between very small human cohorts is particularly challenging and is compounded by the low accrual rate of human subjects in rare diseases or sub-stratified common disorders. Yet, single-subject studies (S3) can compare paired transcriptome samples drawn from the same patient under two conditions (e.g., treated vs pre-treatment) and suggest patient-specific respon- sive biomechanisms based on the overrepresentation of functionally defined gene sets. These improve statistical power by: (i) reducing the total features tested and (ii) relaxing the requirement of within- cohort uniformity at the transcript level. We propose Inter-N-of-1, a novel method, to identify meaningful biomechanism differences between very small cohorts by using the effect size of “single-subject-study”- derived responsive biomechanisms. Results: In each subject, Inter-N-of-1 requires applying previously published S3-type N-of-1-pathways MixEnrich to two paired samples (e.g., diseased vs unaffected tis- sues) for determining patient-specific enriched genes sets: Odds Ratios (S3-OR) and S3-variance using Gene Ontology Biological Processes. To evaluate small cohorts, we calculated the precision and recall of Inter-N-of-1 and that of a control method (GLM+EGS) when comparing two cohorts of decreasing sizes (from 20 vs 20 to 2 vs 2) in a comprehensive six-parameter simulation and in a proof-of-concept clinical dataset. In simulations, the Inter-N-of-1 median precision and recall are > 90% and >75% in cohorts of 3 vs 3 distinct subjects (regardless of the parameter values), whereas conventional methods outperform Inter-N-of-1 at sample sizes 9 vs 9 and larger. Similar results were obtained in the clinical proof-of-concept dataset. Availability: R software is available at Lussierlab.net/BSSD. Contact: Lussier.y@gmail.com, Piegorsch@math.arizona.edu 1 Introduction Empirical evidence unveils a methodological gap when comparing tran- scriptomic differences in biomechanisms within very small human cohorts due to variations in heterogenicity, uncontrolled biology (age, gender, etc.), and diversity of environmental factors (nutrition, sleep, etc.). (Griggs, et al., 2009; Liu, et al., 2014; Schurch, et al., 2016; Soneson and Delorenzi, 2013). Paradoxically, rare diseases are common: 8% preva- lence in the population (Elliott and Zurynski, 2015) and 26% of children who attend disability clinic (Guillem, et al., 2008). As timely and sizeable patient accrual of rare or micro-stratified diseases are prohibitive, there lies an opportunity for empowering clinical researchers with feasible sta- tistical designs that enable smaller cohorts. On the other hand, well-controlled isogenic studies (e.g., cellular mod- els) can yield differentially expressed genes (DEGs) between two small samples. We and others have applied the power of the isogenic framework through the comparison of two sample transcriptomes from one subject in single-subject studies (S3). While transcript-level differences between .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430623doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430623 http://creativecommons.org/licenses/by-nd/4.0/ Table 1. Abbreviations Abbreviation Term DEG Differentially Expressed Gene Inter-N-of-1 “Responsive Pathway Effect Size”-based cross cohort comparison EGS Enriched Gene Set of responsive pathways between two conditions within a single-subject-study (e.g., cancer vs control tissue) FET Fisher’s Exact Test FDR False Discovery Rate GEO Gene Expression Omnibus GLM+EGS Generalized Linear Models with Enriched Gene Sets GO-BP Gene Ontology Biological Processes GS Gene Set (calculated from GO-BP) ITS Information Theory-Based Similarity between GO-BPs log2FC Log base 2 transformation of the transcripts fold-change MLE Maximum Likelihood Estimator OR , S3-OR Odds Ratio: S3-prioritized transcripts enriched in GO-BP PCA Principal Component Analysis PIK3CA Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit alpha gene; HGNC:8975 RSEM RNA-Seq normalization by Expectation Maximization S3 Single-Subject Studies TP53 Tumor protein p53 gene; HGNC:11998 two-sample remains inaccurate (Vitali, et al., 2017; Zaim, et al., 2019), gene set-level (pathway/biosystem) S3 have been shown to accurately dis- cover altered biomechanisms from paired transcriptome samples drawn from the same patient under two conditions (e.g., tumor-normal, treated- untreated) (Ozturk, et al., 2018; Vitali, et al., 2017). The results of the S3 gene set analyses have been validated in various contexts such as cellu- lar/tissular models (Balli, et al., 2019; Gardeux, et al., 2014; Gardeux, et al., 2015), retrospectively in predicting cancer survival (Li, et al., 2017; Schissler, et al., 2015; Schissler, et al., 2018) circulating tumor cells (Schissler, et al., 2016), biomarker discovery simulations (Zaim, et al., 2018), and therapeutic response (Li, et al., 2017). Despite the success of these models to derive effect sizes and statistical significance in single- subject studies of transcriptomes, these samples are isogenic or quasi-iso- genic, and thus do not necessarily generalize to a group of subjects (co- hort-level signal). To address the latter, we reported that determining sin- gle cohort-level significance by combining gene set signal (e.g., pathways) from S3 analyses can be more accurate than conventional DEG analyses followed by gene set enrichment analysis (GSEA) (Subramanian, et al., 2005) in small cohort simulations (Zaim, et al., 2018) and in previously published datasets (Li, et al., 2017)]. However, these methods still used simplistic cohort-level assumptions of centrality (median) and did not ex- plore comparing signal divergence between two cohorts. To address the methodological gap, we therefore hypothesized that sin- gle-subject transcriptomic studies of gene sets increase the transcriptomic signal-to-noise ratio within subject and lead to an improved signal be- tween small patient cohorts, as small as 3vs3 subjects per group. While technically different from the analysis of the standard two factor interac- tions in conventional cohort statistics, the proposed framework is concep- tually related to a statistical interaction in that a within-single-subject anal- ysis (subject-specific transcriptome dynamics) is followed by within- group agreement for characterizing Factor 1 (e.g., cancer vs paired normal tissue) and between group comparisons (Factor 2; e.g., responsive vs un- responsive to therapy). The strategy improves the statistical power by: (i) reducing the total features tested (gene set-level rather than transcript- level), (ii) relaxing the requirement of within-cohort uniformity at the tran- script level as the coordination is conducted at the gene set-level, and (iii) reducing confounding factors through the paired sample design of S3- analyses within subject. The novel bioinformatic method identifies mean- ingful biomechanism differences between very small cohorts by using sin- gle-subject-study-derived effect sizes for gene sets. Additionally, we show through both a simulation and a real data case example that within cohorts of varying sizes (3 to 7 subjects) this method outperforms traditional meth- ods, which are based on generalized linear modeling followed by common gene set enrichment or overlap analysis. We then apply this novel method to the effect sizes of two different single-subject analyses to illustrate the flexibility and utility of the proposed method for a variety of inputs. 2 Methods Fig. 1. Overview of the gene set analyses (Inter-N-of-1) that leverage effect sizes and variances from single-subject studies to conduct subsequent group compari- sons. Single-subject studies details are provided in Figure 2. Table 1 defines abbreviations and Figure 1 provides an overview of the proposed new method (Inter-N-of-1). To motivate the development of transcriptome analytics between very small human samples, by nature het- erogenicity, we first demonstrate the limitation of a Generalized Linear Model to DEGs between 23 TP53 and 19 PIK3CA breast cancer samples. Next, we describe two new methods Inter-N-of-1 (MixEnrich) and Inter- N-of-1 (NOISeq) and compare them to a Generalized Linear Model (im- plemented in LIMMA) (i) in simulation studies with parameters estimated from empirical analyses of real datasets and (ii) in a proof-of-concept study of breast cancer subjects. Also, the evaluation of the proposed new methods is conservative as it is conducted against a reference standard built with a distinct Generalized Linear Model (edgeR) using all samples. 2.1 Datasets We obtained 5,179 gene sets from Gene Ontology Biological Processes (GO-BP) (downloaded on 02/07/2019). For the determining realistic sim- ulation parameters, we used two datasets (I and II) that are composed of paired samples. (I) We downloaded 7 estrogen-stimulated and 7 unstimulated MCF7 breast cancer cells sample replicates provided by (Liu, et al., 2014) that were from the Gene Expression Omnibus (GEO) (Edgar, et al., 2002) on .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430623doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430623 http://creativecommons.org/licenses/by-nd/4.0/ 10/14/2020. The sequences within the Sequence Read Archive files for the 30M reads of MCF7 cells were aligned using hg19 as the reference ge- nome and the resulting RNA-seq counts were processed into fpkm units (Fragments Per Kilobase of transcript per Million mapped reads). (II) We obtained 224 samples of paired breast cancer tumor and tissue- matched normal RNA-seq expression profiles (Factor 1) from the same subjects (n= 112) from The Cancer Genome Atlas (TCGA) Breast Inva- sive Carcinoma data collection (Cancer Genome Atlas, 2012; Ciriello, et al., 2015) using the Genomic Data Commons tools (Grossman, et al., 2016) (Obtained 10/22/2015). As a proof-of-concept application of the proposed methods, we sampled small groups of subjects from a subset of the TCGA breast cancer dataset comprising subjects with somatic (tumor) mutations in either TP53 (n = 23) or PIK3CA (n = 19), but not both. TP53 and PIK3CA (Factor 2) have been reported as the two most commonly mutated genes observed in breast cancer and differ as follows: (i) in ex- pression patterns (Cancer Genome Atlas, 2012), (ii) cancer subtypes (Van Keymeulen, et al., 2015), (iii) clinical outcomes (Kim, et al., 2017), and (iv) responsiveness to specific therapies (Andre, et al., 2019). These data were downloaded using the R package TCGA2STAT(n=42 cases; 84 files) (Wan, et al., 2016). Data access and preparation: (A) For the single-subject studies, we ap- plied a three-stage filtering of the transcripts in which - within each sample pair – (i) we removed all transcripts with mean expression less than 5 counts, (ii) found the union of all genes remaining amongst all pairs, and (iii) excluded all genes not present in the union of these two steps (17,923 genes remaining). We added 1 to expression counts to eliminate “zeros”. (B) For the generalized linear model-based analyses, we applied a dif- ferent filtering process to the raw data where we eliminated all the tran- scripts with 0 counts for each subject and then calculated the coefficient of variation (CV) for each transcript. We selected the transcripts with CVs within the top 70 percentile of those remaining (13,932 genes remaining). 2.2 Proposed S3-anchored Responsive Pathway Effect Size Methods for comparing very small human cohorts The following paragraphs will develop the methodology by which we con- duct single-subject studies prior to cross-cohort comparison to discover the effect size of responsive pathways in each subject and increase the features signal-to-noise ratio. Table 2 summarizes the variables. Identification of overrepresented gene sets for each subject: As il- lustrated in Panel A of Figure 2, we applied to each of the tumor-normal pairs the N-of-1-pathways MixEnrich method that we had previously de- veloped and validated (Berghout, et al., 2018; Li, et al., 2017; Zaim, et al., 2019). Briefly, this method models the absolute value of the log2 trans- formed fold change (FC) for each gene across the two paired transcrip- tomes being studied and uses a probabilistic Gaussian mixture to assign a posterior probability that the gene is differentially expressed between tu- mor and normal conditions. Within the simulation, prioritized transcripts were defined as those with a posterior probability of being differentially expressed higher than 0.99. Within the TCGA breast cancer dataset, said definition included having both a posterior probability of being differen- tially expressed higher than 0.99 and an absolute-valued log2FC higher than log2(1.2). Genes were assigned to gene sets using the Gene Ontology (Ashburner, et al., 2000) Biological Process (GO-BP) hierarchy, filtered to those terms with gene set size between 15-500 genes, with subsumption to maximize interpretability. These DEGs were used to determine the overrepresented, or enriched, gene sets of interest using a two-sided Fisher’s Exact Test (FET) (Fisher, 1935) with a False Discovery Rate (FDR) of 5%. The output of this analysis generated lists of gene sets, with Table 2. Variable Definitions Variable Definition 𝑔!",$! The number of DEGs within gene set gs for subject k% in cohort K 𝑔′!",$! The number of genes NOT differentially expressed in gene set gs for subject k% in cohort K ℎ!",$! The number of DEGs NOT in gene set gs for subject k% in cohort K ℎ′!",$! Number of genes neither differentially expressed nor in gene set gs for subject 𝑘& in cohort K 𝑁 Number of gene sets 𝑃(⋅) Probability of Event (⋅) occurring 𝑝,!",' p-value for gene set gs produced by the Inter-N-of-1 𝑄!",$! Continuity-corrected log S3-OR corresponding to gene set gs for subject k% in cohort K 𝑄/!",$! The mean continuity-corrected log S3-OR of in gene set gs for subject k% in cohort K 𝑆) The number of subjects in a cohort K (e.g. those with a PIK3CA or with TP53 somatic mutation) 𝜃) Expected value of the continuity corrected log S3-OR for the molecular-defined cohort K var5𝑄!",$!6 Variance associated with continuity-corrected log S3- OR corresponding to gene set gs for subject k% in co- hort K 𝑊!",' The test statistic for the Inter-N-of-1 for gene set gs 𝑍 A standard normal random variable each list representing a single subject’s tumor-normal pair and comprising GO-BP terms accompanied by contingency table counts which were used to calculate an odds ratio (S3-OR) as the effect size. We also applied NOISeq to each of the tumor-normal pairs (Tarazona, et al., 2015) as shown in Panel B of Figure 2. For these applications of NOISeq with no replicates, the “pnr” and “v” parameters were set to 0.0002 and 0.00002 to prevent the method from producing any errors re- lated to setting the size of the inherent multinomial distributions to an in- teger too large for R to handle. The criteria for identifying genes as differ- entially expressed for NOISeq were the same as those used for N-of-1- MixEnrich. As shown in Panel C of Figure 2 (next page), we subse- quently used this information to construct contingency tables and calculate the natural log odds ratio for Inter-N-of-1. This process generated two dif- ferent applications of Inter-N-of-1, N-of-1-MixEnrich and NOISeq, to conduct the single-subject analyses preceding the cohort comparison. Comparing Enriched Gene Sets across Distinct Cohorts: We first combined the data within two distinct cohorts into single statistics whose null reference distributions were at least approximately normal. These within-cohort statistics were contrasted via scaled subtraction in a manner reminiscent of the two-sample t-test to establish the difference in gene set enrichment between the two cohorts. Let 𝑔𝑠 ∈ {1,…,𝑁} index the specific gene set being studied where N is the total number of gene sets, kj indexes a specific subject in cohort 𝐾 composed of 𝑆) individuals with subjects numbered 𝑗 ∈ {1,…,𝑆)}, and 𝐾 ∈ {𝐴,𝐵} indexes a specific cohort. Let 𝜟 signify quantities relating to the difference between the two cohorts. The Inter-N-of-1 analytics for combining information within a cohort considers the abstract contingency table shown as Table 3 where the cell counts are representative for the gene set indexed by gs and the subject indexed by 𝑘&. We obtain DEGs from the application of a chosen single-subject analy- sis method (either N-of-1-MixEnrich or N-of-1-NOISeq) for a specific gene set gs in individual kj of cohort 𝐾 to fill out the contingency table .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430623doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430623 http://creativecommons.org/licenses/by-nd/4.0/ Table 3: Notation for 𝟐 𝒙 𝟐 Contingency Table Cross-classifying DEG Status with Gene Set Status DEG Not DEG Gene set gs 𝑔!",$! 𝑔′!",$! Not Gene set gs ℎ!",$! ℎ′!",$! Fig. 2. Overview of two single-subject study methods conducted from one sample per condition without replicate generating effect sizes and variance for each gene set. We apply single-subject studies to each subject to identify either prioritized tran- scripts (Panel A) or DEGs (Panel B) between paired tumor-normal samples. We iden- tify patient specific enriched gene sets and associated effect sizes in the form of natural log odds ratios through a FET (Panel C). Each effect size is approximately normally distributed with known variance and mean, simplifying subsequent analyses between cohorts. The gene set-level variance enables the extraction of more information from each individual subject than typical variance estimators that work across subjects and thereby leads to increased statistical power. The N-of-1-MixEnrich method was previ- ously described and validated (Berghout, et al., 2018; Li, et al., 2017; Zaim, et al., 2018). NOISeq is also considered as an alternative meriting evaluation because of its performance in prior single-subject studies evaluations (Zaim, et al., 2019). with counts in the format shown in Table 3. We apply a continuity cor- rection by adding 0.5 to each of the cells in the contingency table to pro- vide a small-sample adjustment in the odds ratio (Agresti and Kateri, 2011). The natural log S3 OR, denoted as 𝑄!",$!, Equation (5), is approx- imately normally distributed with variance var5𝑄!",$!6given in Equation (6) (Woolf, 1955). 𝑄!",$! = 𝑙𝑛J 5 𝑔!",$! + 1 26 ⋅ 5ℎ′!",$! + 1 26 5ℎ!",$! + 1 26 ⋅ 5𝑔′!",$! + 1 26 M (5) var5𝑄!",$!6 = 1 5 𝑔!",$! + 1 26 + 1 5𝑔′!",$! + 1 26 + 1 5ℎ!",$! + 1 26 + 1 5ℎ′!",$! + 1 26 (6) We average the Q!",$! values within their respective cohorts to obtain the average ln ORs 𝑄/!",) = 1 𝑆) O𝑄!",$! *" &+, ∼ 𝑁 Jθ),O var5𝑄!",$!6 𝑆) - *" &+, M (7) When the null hypothesis 𝐻.:θ/ = 𝐸[𝑙𝑛(OR/)] = 𝐸[𝑙𝑛(OR0)] = θ0 is true then 𝑊!",' = 𝑄/!",/ − 𝑄/!",0 Zvar[𝑄/!",/\ + var[𝑄/!",0\ ∼ 𝑁(0,1) (8) at least approximately. The corresponding two-sided p-value for gene set gs is 𝑝,!",' = 2 ⋅ 𝑃[𝑍 > _𝑊!",'_\ (9) where Z represents a standard normal random variable. An FDR adjust- ment via the Benjamini-Hochberg method (Benjamini and Hochberg, 1995) is then applied to the 𝑝,!",' across all the GO terms tested in the particular application. To ensure that the method positively identifies gene sets that are enriched in at least one of the cohorts, we set all FDR adjusted p-values to 1.0 if both cohort means of the log odds ratios are negative. This step ensures interpretable results since impoverished GO terms with significantly fewer-than-expected DEGs are not well understood from a biological context. 2.3 Description of the Generalized Linear Models and ap- plication of Inter-N-of-1 methods for small cohort com- parison and their evaluation in the Breast Cancer Data Table 4. Three experimental designs used for the generalized linear models. In the analysis of subsets of the TCGA Breast Cancer data, genes were declared differ- entially expressed if their abs(log2FC) > log2(1.2) and their FDR-adjusted p-value < 0.05. Within the simulation, genes were declared differentially expressed if their FDR-adjusted p-values < 0.05. Name Level What is compared Results Simple Transcript TP53_Tumoral – PIK3CA_Tumoral Fig. 3 Panel A Interaction Transcript (TP53_Tumoral – TP53_Normal) – (PIK3CA_Tumoral – PIK3CA_Normal) Fig. 3 Panel B GLM+EGS Gene set 1) Find DEGs using Interaction Contrast 2) Enrichment via FET Fig. 4 - 5 Generalized Linear Model (GLM) Designs: For the cohort analyses, we applied a generalized linear model as implemented in limma (Smyth, et al., 2005). Preceding application of the generalized linear model, we per- formed trimmed mean of M values (TMM) normalization (Robinson and Oshlack, 2010) on the data pre-processed for cohort analysis. We applied the voom normalization (Law, et al., 2014) via the limma function voom- withQualityWeights in R. We used the three different designs described in Table 4 for these gen- eralized linear model-based analyses, which were called the simple de- sign, the interaction design, and GLM+EGS respectively. We blocked by subject for each of these GLM designs, and all FDR adjustments of p- values were done using the Benjamini-Hochberg False Discovery Rate (FDR) method (Benjamini and Hochberg, 1995). Reference standard construction of enriched pathways using edgeR Generalized Linear Model followed by Gene Set enrichment: After .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430623doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430623 http://creativecommons.org/licenses/by-nd/4.0/ pre-processing for cohort analyses, we applied generalized linear models as implemented in the R software package edgeR (Robinson, et al., 2010) at FDR< 5% to the entire TCGA breast cancer dataset to construct three reference standards corresponding to the three designs discussed in Table 4. Each reference standard evaluated the analyses of the TCGA breast can- cer cohorts (TP53 vs PIK3CA) and used the same filter thresholds for clas- sifying transcripts as differentially expressed. In the GLM followed by enrichment of gene set (GLM+EGS), the prioritized interacting transcripts are followed by a FET at FDR<5%. Subsampling of the TCGA Breast Cancer Cohort and application of GLM and Inter-N-of-1 methods: For each of the values 𝑆/ = S0 = S ∈ {2,3,4,5,7,8,9} we ran 100 subsamples of the total cohorts where we ran- domly selected without replacement 𝑆 subjects with TP53 and 𝑆 subjects with PIK3CA, without requiring non-redundancy of the random sam- plings. We applied the GLM+EGS method and the N-of-1-MixEnrich and NOISeq versions of the Inter-N-of-1 method to each of the selected co- horts (TP53 vs PIK3CA). For each of the three methods, FDR<5% adjust- ment of the p-values was done with respect to all 5,179 GO terms tested. For random subsamples of size 𝑆/ = 𝑆0 = 𝑆 ∈ {2,3,4,…19} of sub- jects, we applied the two transcript-level analyses using generalized linear models as implemented in limma. The performance of these transcript- level applications of limma were assessed and illustrated in Figure 3 to demonstrate the necessity and benefit of transforming from transcript- level to gene set-level analyses. Accuracy measures within TCGA breast cancer dataset: For each method, we calculated the precision and recall using the following func- tions. When a method produced no positive predictions for the gene sets, we assigned values of zero to the precision and recall of the given method. Otherwise, we calculated the precision and recall using Powers' calcula- tions with adjustments of adding 0.5 to numerators and 1.0 to denomina- tors to avoid divisions by zero (Powers, 2020). In addition, we have pre- viously published extensions to conventional accuracy scores that we termed "similarity Venn Diagrams" and "Similarity Contingency Tables" (Gardeux, et al., 2015). In these approaches, identical as well as highly similar GO-BP terms between the prediction set and the reference standard account for true positive results. We calculated the precision and recall of the gene set level analyses using Information Theoretic Similarity (ITS) (Tao, et al., 2007). For precision, we included in the intersection those predicted GO-BPs which had an ITS similarity of 0.70 or higher with any of the GO terms in the reference standards, while the denominator re- mained as all predicted GO-BPs. Similarly, for recall we included in the intersection the reference standard GO-BPs which had an ITS similarity score of 0.70 or higher with any of the predicted GO terms, while the de- nominator remained as the total positive reference standard GO-BP terms. Of note, we previously reported that this ITS>0.70 similarity criteria is highly conservative since ~0.0056 pairs of GO-BP terms are similar at ITS>0.7 (58,577 pairs among 10,458,756 non-identical combinations of GO-BPs) (Gardeux, et al., 2015). 2.4 Simulation of small cohort comparisons to compare GLMs to Inter-N-of-1 methods Data generation for Simulation: The overall scheme for the simulation began by constructing two cohorts of paired tumor-normal RNA-seq ex- pression profiles. We calculated simulation parameters to most realisti- cally create these expression values as described below (Table 5). To cal- culate statistical interactions between two factors, we had to design two cohorts of subjects and each subject with two sample conditions. We sought to recreate the TCGA Breast cancer conditions with these parame- Table 5. Simulation Parameter Values. Only the balanced cohort size and the proportion of subjects with coordinated DEGs were varied. All other parameters were held constant. 30 datasets were generated for each parameter configuration leading to a total of 540 datasets. Parame- ters How Estimated Values Control Samples Randomly sample without replacement from TCGA breast cancer normal samples NA log2FC dis- tribution of non-differ- entially ex- pressed genes 1) Calculate log2FCs of randomly paired MCF7 unstimulated breast cancer sam- ples 2) Split log2FCs into deciles by baseline ex- pression a) All deciles containing 0 are combined into one category 3) Sample with replacement from decile containing transcript name in first random pair NA Gamma pa- rameters of log2FCs of DEGs 1) Run N-of-1-MixEnrich (Fig.2) on within- subject tumor-normal pairs in TP53 and PIK3CA cohorts to identify DEGs 2) MLEs for gamma parameters fit to abso- lute log2FCs of DEGs a) Used egamma function in EnvStats R package (Millard, et al., 2020) Scale pa- rameter = 6.06 Shape pa- rameter = 0.55 Proportion of DEGs in enriched GO-BPs 1) Split enriched GO terms from edgeR ref- erence standard into deciles based on size 2) Calculated DEGs median proportion for deciles containing GO-BPs (size: 47, 200) (GO size 200): 0.10 (GO size 40): 0.19 Proportion of Subjects with coor- dinated DEGs 1) Split log2FCs of DEGs within edgeR ref- erence standard into categories a) >1.3, b) < -1.3, or c) neither 2) Assign the maximum proportion of sub- jects per categories (a) or (b) for each transcript 3) Find the median proportion of subjects across all transcripts 0.25, 0.48, 0.75 Balanced Cohort Size NA 2, 3, 7, 10, 20, 30 GO-BP terms 1) Enriched: GO:0002221 (200 genes) 2) Enriched: GO:0000096 (47 genes) 3) Control: GO:0006733 (196 genes) 4) Control: GO:0090184 (41 genes) NA ters, using the observed median values in the TCGA dataset as the medians of the simulation parameters and varying the parameters around said me- dians. The TCGA dataset did not comprise repeated samples in the same condition, and thus we utilized the unstimulated MCF7 cell lines with seven replicates to estimate the variation expected between two paired normal tissues. In our previous pathway expression studies ((Yang, et al., 2012) and data not shown) where we compared two cohorts, about two- thirds of the observed responsive gene set patterns - as shown in Figure 2 - consisted of a gene set responsive in one subject cohort and unresponsive in the other cohort. These paired tumor-normal samples represented within-subject samples were constructed to have a proportion of the transcripts with altered ex- pression between the tumor and normal states. Through the use of ran- domly sampling without replacement, we generated the normal tissue samples for these pairs after filtering out all genes in the 112 TCGA breast cancer normal tissues, which were not present within the MCF7 breast cancer dataset (leaving 17,414 genes). For each sampled normal breast tissue sample, we generated transcript .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430623doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430623 http://creativecommons.org/licenses/by-nd/4.0/ expression for a paired breast cancer sample of that subject rather than sampling the corresponding breast cancer sample from the TCGA data. To produce a paired tumor expression value for a non-differentially expressed gene, we first followed the steps outlined in Table 5 to randomly generate empirical log2 Fold Changes (log2FC) and then we set the gene’s expres- sion as the product of the gene’s paired normal expression and 2 raised to the exponent of the log2FC value. To generate the expression value for an altered transcript in a tumor sample, we randomly sampled a log2FC from a gamma distribution with parameters described in Table 5 and set said gene’s expression to the product of the gene’s normal expression and 2 raised to the exponent of the log2FC value. We generated only positive log2FCs for the DEGs to improve the GLM's ability to detect them as dif- ferentially expressed cross subjects. We specified a gamma distribution for these positive log2FCs since all the absolute-valued log2FC distribu- tions we examined possessed significant right-skew. We chose to evaluate the methods using the 4 GO terms described in Table 5. In simulation cohort A, 2 of these GO-BPs would be seeded with altered transcripts, thus enriched, and 2 would serve as controls. In cohort B, none of the 4 GO terms were enriched, thereby setting up an interaction effect between the within-subject and between-subject factors. Within the two enriched GO terms in cohort A, we randomly selected the proportions of genes specified in Table 5 to have altered expression. We used Ber- noulli random variables with probabilities of success outlined in Table 5 to designate subjects within cohort A, which would share all their ran- domly selected DEGs. The remaining subjects within cohort A had all their DEGs randomly vary across subjects. It was hypothesized that the percentage of subjects with shared altered transcripts would strongly in- fluence the performance of the GLM+EGS method since limma assumes the presence of coordination of gene expression across subjects. Thus, we varied the expected proportion of subjects with shared DEGs within cohort A (0.25, 0.48, 0.75) along with the sizes of the two cohorts (2, 3, 7, 10, 20, 30) while holding all other parameters constant. We consequently gener- ated 30 datasets for each parameter combination leading to a total of 540 datasets for our downstream simulations. Data preprocessing within Simulation: (A) For the generalized linear model analyses, we preprocessed the simulated data by removing all genes with mean expression values less than 30 across all the simulated tran- scripts and subsequently added 1 to each of the expression counts. (B) For the single-subject analyses, we applied a two-stage pre-processing method in which we (i) removed all the transcripts with mean expression less than 30 within each sample-pair and (ii) found the union across all pairs of genes remaining and eliminated any genes not contained within. The re- maining genes for the single-subject analyses then had 1 added to their expression counts to eliminate any remaining zeroes. Application of Methods to Simulated Data: The GLM+EGS and the two versions of the Inter-N-of-1 method were applied to each of the generated datasets as described previously. The Benjamini-Hochberg False Discov- ery Rate (FDR) (Benjamini and Hochberg, 1995) adjustments of the p- values generated for each technique were performed with respect to only the 4 selected GO terms that were tested for each combination of dataset and method. GOBPs were declared positive for a method if their associ- ated FDR adjusted p-values for said method were below 0.05. Accuracy measures within the Simulation: To estimate the overall per- formance of each method within the simulation, we calculated the number of true positives, true negatives, false positives, and false negatives occur- ring within the 2 enriched and 2 control GO terms across all 30 resampling of each combination of parameters. When any of the methods made no positive predictions for the gene sets, we artificially assigned values of 0 to the precision and recall of the given method. Otherwise, we calculated the precision and recall through the use of their traditional formulae (Powers, 2020). 30 accuracy scores are thus available for each combina- tion of parameters for each GO term size (40 and 200). 3 Results Fig. 3. At the transcript level, limited accuracies of Generalized Linear Models for calculating conventional simple contrast or interactions in small heterogenic breast cancer cohorts. While GLMs can deliver DEGs in small cohorts for isogenic cellular and animal models, we recapitulate in the TCGA datasets that small human cohorts are underpowered statistically. We calculated the precision and recall scores associated with each of the 100 random sub-samplings of cohort sizes 2vs2, 3vs3, …, to 19vs19 for TP53 vs PIK3CA and report median accuracies. The left panel used a simple linear contrast of the tumor levels on the molecular subtypes. The right panel used a linear contrast corresponding to the interaction between the molecular subtypes (TP53 vs PIK3CA) and tumor status (Breast cancer vs normal breast). Discoveries were performed with limma while the reference standard was constructed with edgeR. We showed that using a two-step process, where we first enrich the signal- to-noise ratio by applying S3-analyses to paired data in single-subjects be- fore combining across subjects, can capture stable signal and yield results comparable to those in the reference standard, even as cohort size de- creases. By contrast, traditional techniques for identification of gene set- level biomechanisms that differentiate between two cohorts rapidly lose power and yield unreliable results as the sample size decreases below 5 subjects per cohort. The transcriptomic analyses of TCGA data in Figure 3 recapitulates that small human cohorts are particularly difficult to analyze using GLMs due to their heterogenic conditions and lack of controlled environment. Thus, small human cohorts present a stark contrast to isogenic controlled experiment cell lines or animal models where the high signal to noise ratio makes transcriptomic analyses possible for very small sample sizes. These unsurprising results provide the justification for the development of the proposed GLM+EGS and Inter-N-of-1 methods conducted at the gene set level. They also attest to the intrinsic lack of signal within the TCGA breast cancer data for such transcriptomic analyses. The performance results for subsets of the TCGA breast cancer data shown in Figure 4 establish that the two versions of the proposed Inter- N-of-1 method degrade more gracefully in performance with decreasing cohort size than traditional generalized linear model-based methods, thereby allowing them to outperform for smaller cohort sizes. Figure 4 shows that the niche where the Inter-N-of-1 methods outperform in terms of median precision and recall extends to all cohort sizes below 7vs7, with the GLM+EGS method achieving higher median performance scores for 9vs9 and above. The sizes of the crosses suggest a further boon for the developed methods beyond this better ‘on average’ performance. The In- ter-N-of-1 methods tend to have very small tight crosses suggesting low .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430623doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430623 http://creativecommons.org/licenses/by-nd/4.0/ Fig. 4. At the gene set–level, two Inter-N-of-1 methods outperform a GLM fol- lowed by enrichment in small heterogenic human cohorts. While Inter-N-of-1 methods (Inter-N-of-1 (NOISeq) and Inter-N-of-1 (MixEnrich)) outperform the GLM followed by enrichment in gene sets for sample sizes of 7vs7 and smaller, the GLM+EGS shows better accuracy at sample sizes 9vs9 and above. Of note, GLM+EGS shows large variations in performance measures within the samples of size 8vs8 sug- gesting that despite its improved median accuracy it remains unreliable at that level. In all cases, the discovery of differentially responsive gene sets (Inter-N-of-1 methods) or enriched gene sets (GLM+EGS) substantially outperform the accuracies of transcript- level analyses shown in Fig. 3. While the Inter-N-of-1 and GLM+EGS methods iden- tify related signals, the reference standard designed by a distinct GLM+EGS approach favors the accuracies of the latter. In addition, Inter-N-of-1 methods can assess the ef- fect size of responsive gene sets in each subject, which can be illustrated as box plots of gene set response. In contrast. GLM+EGS methods are limited to a single descrip- tion of over-representation calculated on interacting transcripts of the entire study. We calculated the precision and recall scores associated with each of the 100 random sub- sampling of cohort sizes 2vs2, 3vs3, 4vs4, 5vs5, 7vs7, 8vs8, 9vs9 for TP53 and PIK3CA subjects with the GLM+EGS and Inter-N-of-1 methods: (i) Inter-N-of-1 (NOISeq), and (ii) Inter-N-of-1 (MixEnrich). The arms extend from the lower quartile to the upper quartile of the respective performance measure, and the two arms cross at the median for the precision and recall for that technique at the indicated cohort size. variation in performance and greater consistency. The GLM+EGS method on the other hand possesses very large crosses until cohort size 9vs9, sug- gesting wild swings in performance across the different subsets evaluated. In addition, even the gene set-level GLM+EGS method outperforms tran- script-level GLM analyses (Fig.3 vs Fig.4). Figure 4 also establishes that the N-of-1-MixEnrich version of the Inter-N-of-1 method outperforms the NOISeq version in terms of consistency and median precision and recall. Although these differences remain small for larger cohort sizes of 7vs7 and above, they increase gradually with decreasing cohort sizes. The simulations indicates that the proposed Inter-N-of-1 methods out- perform GLM+EGS for small sample sizes within parameters derived from cancer datasets and extended to. investigate other conditions. Fig. 5 shows that the two Inter-N-of-1 methods are unaffected by changes in the expected proportion of subjects within cohorts with shared DEGs since their performance scores typically oscillate randomly around a fixed point given a fixed cohort size. These fixed points come closer to the perfect score of 1.0 precision and 1.0 recall with increasing cohort size, suggesting that mainly the cohort size affects the Inter-N-of-1 method. The N-of-1- MixEnrich version of the Inter-N-of-1 method generally performs the best out of all three methods, with its precision always staying 90% or higher and its recall staying 75% or above for all parameter configurations. The NOISeq version of the Inter-N-of-1 method suffers from a higher rate of false negatives for the two smallest tested cohort sizes of 2 and 3 and so displays significantly less recall than the N-of-1-MixEnrich version of the Inter-N-of-1 method, although it does display similar levels of precision. Thus, this simulation also unveils the reason for which Inter-N-of-1 (NOISeq) did not perform as well. Both cohort size and the expected pro- portion of subjects within groups with coordinated DEGs affect the per- formance of the GLM+EGS method. Increasing either of these parameters significantly improves the performance of the GLM+EGS method, with the single exception of the 2vs2 cohort size where GLM+EGS produces 0 precision and recall for all specifications of the proportion of subjects within group with coordinated DEGs. At the anti-conservative levels for these parameters, the GLM+EGS method matches the performance of the two versions of the Inter-N-of-1 method. However, decreasing either pa- rameter quickly leads the GLM+EGS method to underperform. For cohort sizes of 10vs10 and lower, the GLM+EGS method fails to match the per- formance of the two versions of the Inter-N-of-1 method and so supports the superiority of Inter-N-of-1 in such small sample sizes for breast can- cer-like data. 4 Discussion As stated in the introduction, empirical evidence suggests the existence of a methodological gap when comparing transcriptomic differences in bio- mechanisms within very small human cohorts due to variations of hetero- genicity, uncontrolled biology (age, gender, etc.), and diversity of envi- ronmental factors (nutrition, sleep, etc.).As expected, state of the art gen- eralized linear models decline in performance with sample sizes less than 5 (Soneson and Delorenzi, 2013). Smaller datasets require variances to be as low as those observed between technical replicates or with the isogenic conditions of cellular and animal models. Yet, even in such isogenic con- ditions, two studies have recommended at least 6 biological replicates for applying generalized linear models (Liu, et al., 2014; Schurch, et al., 2016). Examining two-factor interactions in transcriptomes (Cohorts × tu- mor status) further inflates the required sample size by a factor of 4 (Brookes, et al., 2004; Fleiss, 2004; Leon and Heo, 2009). Traditional co- hort-based methods impose sample size requirements which simply can- not be met within the framework imposed by rare diseases, prompting the need to develop new methods. On the other hand, we and others have shown it is possible to obtain statistical significance of gene set-level effect size measures from single samples without replicates taken in two conditions, namely single-subject studies (S3) (Li, et al., 2017; Li, et al., 2017; Schissler, et al., 2015; Vitali, et al., 2017). We have shown evidence from breast cancer studies and sim- ulations that the S3-anchored Inter-N-of-1 addresses this methodological gap. Their slow decay in performance when contrasted with the abrupt decay of GLM+EGS establishes the superiority of these methods for sam- ple sizes of 𝑆/ = 𝑆0 ∈ {2,3,4,5,6} when applied to our TCGA breast can- cer dataset. Comparison of the median precision and recall of the three considered techniques shows that on average our methods exhibit greater power and importantly less variable performance than GLM+EGS at these low cohort sizes. Furthermore, our simulation study confirmed that both versions of the Inter-N-of-1 provide substantially improved recall over the GLM+EGS method at small cohort sizes while still maintaining equiva- lent levels of precision. The simulation results also establish that the ex- pected proportion of subjects with coordinated DEGs within cohorts plays a critical role in determining the range of cohort sizes in which the devel- oped methods outperform traditional generalized linear model-based tech- niques. In datasets where the proportion of subjects within cohorts sharing their DEGs is lower than 48%, the Inter-N-of-1 methods continue to out .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430623doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430623 http://creativecommons.org/licenses/by-nd/4.0/ Figure 5. Comparison of accuracy of GLM+EGS and Inter-N-of-1 methods within the simulation. We generated subject tumor-normal pairs for a variety of co- hort sizes (2vs2, 3vs3, 7vs7, 10vs10, 20vs20, 30vs30) and expected proportion of sub- jects with shared DEGs in cohort A (0.25, 0.48, 0.75). We simulated 30 datasets for each parameter configuration and applied the proposed developed Inter-N-of-1 meth- ods and GLM+EGS method to each. We calculated the total number of true positives, false positives, false negatives, and true negatives across all iterations and used them to calculate the precision and recall for each combination of method, parameter con- figuration, and GO term size. Separate graphs are made for each parameter configura- tion and plot the resulting precision and recall measures for each method for the gene sets of size 40. The results for gene sets of size 200 were very similar to the above results and so were excluded. The N-of-1-MixEnrich version of the Inter-N-of-1 method performs excellently and achieves near perfect scores for cohort sizes above 2. The NOISeq version of the Inter-N-of-1 method often fails to identify positive signal for cohort sizes of 3 or smaller, but otherwise achieves performance scores near those of the N-of-1-MixEnrich version of the Inter-N-of-1 method. The two versions of Inter- N-of-1 appear to be unaffected by changes in the expected proportion of subjects with shared DEGs since their performance scores within each graph oscillate around the same general area and show no overall trend. The GLM+EGS method often struggles to identify positive signal for smaller cohort sizes, although increasing the expected proportion of subjects within cohorts with coordinated DEGs improves the recall of the method and decreases the minimum sample size needed for it to perform near perfectly. The GLM+EGS method always shows excellent precision and control of the overall FDR for all except the cohort sizes of 2. perform the GLM+EGS method for cohort sizes larger than 20. Several limitations were observed. (1) This study focuses on parameters related to cancers, where there are substantial differences between normal paired tissue to cancer tissue. While single-subject studies have been shown to be effective in viral response (Gardeux, et al., 2017; Gardeux, et al., 2015) or response to therapy (Li, et al., 2017), it remains to be demon- strated that the downstream Inter-N-of-1 methods can outperform tran- script-level methods in those biological conditions. (2) The simulation does present some inconsistencies with observations made within the TCGA breast cancer subsets. This can probably be explained by the fact that the breast cancer analyses used a reference standard that favored GLM+EGS over Inter-N-of-1 methods by design. (3) We explored only one type of difference within gene set response between cohorts in the simulations: a cohort responsive vs unresponsive. We are thus undertaking the complementary analysis to compare the more general paradigm of gene sets more responsive in one cohort than in the other. (4) Finally, alt- hough the developed methods allow for a more accurate testing of inter- actions in datasets with small sample sizes, the importance of balancing confounders between the two cohorts should not be overstated. The small samples used within these analyses prevent randomization from balancing key covariates and confounders between cohorts. Future studies could model unbalanced covariates through data or knowledge fusion with ex- ternal datasets. (5) Transcript independence assumptions in the calculation of the single-subject odds ratio and its variance (Inter-N-of-1 methods) may be transgressed. However, many such assumptions are routinely overlooked in related analyses, such as BH-FDR (Benjamini and Hochberg, 1995) with similar limitations later rectified as the BY- FDR (Benjamini and Yekutieli, 2001). When viewed under that perspective, computational biology may progress by first proving new models and then addressing their biases in subsequent studies. (6) Other unbiased ap- proaches to generating gene sets could have been utilized (e.g., co-expres- sion network from independent datasets, protein interaction networks, etc.). (7) Of note, few datasets are available with two measures in different conditions per subject and more than one clinical cohort of subjects. Sim- ilar to physics where experimentalist and theory influence one another, our work presents improvements on solving an experimental design that is infrequently used and merits more consideration for increasing the sig- nal-to-noise ratio in the study of rare and infrequent diseases. (8) Prospec- tive biologic validation of results is also required in future studies as we have done with single-subject studies in the past (Gardeux, et al., 2014). Another consideration concerns that how GLM+EGS and Inter-N-of-1 evaluate different phenomena. The GLM+EGS method primarily discov- ers GO terms enriched for transcripts – primarily require the coordination of signals at the transcript-level before the enrichment across subjects be- longing to similar classness. The Inter-N-of-1, on the other hand, assesses whether the proportion of responsive transcripts within a given GO term measured in each subject significantly differs across cohorts at the gene set-level. In other words, in the Inter-N-of-1, the transcripts contribution to the gene set signal may be different between subjects, while in the GLM_EGS methods a transcript-level coordination is required. The Inter- N-of-1 favors clinical applications where gene set mechanisms are causal to the disease. Cancer is one such condition where numerous genetic and transcriptomic root causes may differ between subjects and yet converge to comparable cellular and clinical phenotypes. In conclusion, the proposed S3-anchored Effect Size-methods demon- strate the utility of within-subject paired sample designs for better control- ling within-patient background genetic variation and thereby identifying clearer signal with small numbers of subjects. These approaches first sim- plify the heterogenicity between subjects with better controlled single- subject studies reminiscent of experimental isogenic models (e.g., cell lines or mice models). These results motivate further studies of new ex- perimental designs, where paired within-subject samples allow analyses of datasets previously considered too small. The new design not only pre- sents opportunities in terms of performance within small subject cohorts, but also in terms of utility. The use of single-subject methods within the Inter-N-of-1 creates an avenue for examining subject variability within co- horts. By examining the single-subject results one can directly see the degree of concordance and discordance amongst subjects and answer questions pertaining to whether specific subjects possess the overall ob- served signal. Thus, the Inter-N-of-1 presented here represents not just a new method that performs better within small sample sizes, but also an example for how to borrow knowledge from gene sets for more powerful measures of dispersion in a single subject to conduct studies of rare or infrequent diseases and analyses on patient variability within and across cohorts. In addition, precision therapies designed for increasingly sub- stratified common disorders can benefit from the proposed methods. The strategies and methods presented here open a new frontier that may greatly enrich our understanding of the genetic foundations of rare diseases. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430623doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430623 http://creativecommons.org/licenses/by-nd/4.0/ Acknowledgements We acknowledge Branden Lau for performing alignment of the MCF7 SRA files. Funding This work was supported in part by The University of Arizona Health Sciences Center for Biomedical Informatics and Biostatistics, the BIO5 Institute, and the NIH (U01AI122275 and 5R21AI152394). This article did not receive sponsor- ship for publication. Conflict of Interest: none declared. References Agresti, A. and Kateri, M. Categorical data analysis. Springer Berlin Heidelberg; 2011. Andre, F., et al. Alpelisib for PIK3CA-Mutated, Hormone Receptor-Positive Advanced Breast Cancer. N Engl J Med 2019;380(20):1929-1940. Ashburner, M., et al. Gene ontology: tool for the unification of biology. Nature Genetics 2000;25(1):25. Balli, M., et al. Autologous micrograft accelerates endogenous wound healing response through ERK-induced cell migration. Cell Death & Differentiation 2019:1-19. Benjamini, Y. and Hochberg, Y. Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing. J R Stat Soc B 1995;57(1):289-300. Benjamini, Y. and Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Annals of statistics 2001:1165-1188. Berghout, J., et al. Single subject transcriptome analysis to identify functionally signed gene set or pathway activity. In, PSB. World Scientific; 2018. p. 400-411. Berghout, J., et al. Single subject transcriptome analysis to identify functionally signed gene set or pathway activity. Pac Symp Biocomput 2018;23:400-411. Brookes, S.T., et al. Subgroup analyses in randomized trials: risks of subgroup- specific analyses;: power and sample size for the interaction test. Journal of Clinical Epidemiology 2004;57(3):229-236. Cancer Genome Atlas, N. Comprehensive molecular portraits of human breast tumours. Nature 2012;490(7418):61-70. Ciriello, G., et al. Comprehensive Molecular Portraits of Invasive Lobular Breast Cancer. Cell 2015;163(2):506-519. Edgar, R., Domrachev, M. and Lash, A.E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research 2002;30(1):207-210. Elliott, E.J. and Zurynski, Y.A. Rare diseases are a'common'problem for clinicians. Australian family physician 2015;44(9):630. Fisher, R.A. The logic of inductive inference. Journal of the Royal Statistical Society 1935;98(1):39-82. Fleiss, J. The design and analysis of clinical experiments. 1986. New York, John Wiley& Sons 2004. Gardeux, V., et al. Concordance of deregulated mechanisms unveiled in underpowered experiments: PTBP1 knockdown case study. BMC medical genomics 2014;7(1):1-13. Gardeux, V., et al. Concordance of deregulated mechanisms unveiled in underpowered experiments: PTBP1 knockdown case study. BMC Med Genomics 2014;7 Suppl 1(S1):S1. Gardeux, V., et al. A genome-by-environment interaction classifier for precision medicine: personal transcriptome response to rhinovirus identifies children prone to asthma exacerbations. Journal of the American Medical Informatics Association 2017;24(6):1116-1126. Gardeux, V., et al. Towards a PBMC “virogram assay” for precision medicine: Concordance between ex vivo and in vivo viral infection transcriptomes. Journal of biomedical informatics 2015;55:94-103. Griggs, R.C., et al. Clinical research for rare disease: opportunities, challenges, and solutions. Mol Genet Metab 2009;96(1):20-26. Grossman, R.L., et al. Toward a Shared Vision for Cancer Genomic Data. N Engl J Med 2016;375(12):1109-1112. Guillem, P., et al. Rare diseases in disabled children: an epidemiological survey. Arch Dis Child 2008;93(2):115-118. Kim, J.Y., et al. Clinical implications of genomic profiles in metastatic breast cancer with a focus on TP53 and PIK3CA, the most frequently mutated genes. Oncotarget 2017;8(17):27997-28007. Law, C.W., et al. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome biology 2014;15(2):R29. Leon, A.C. and Heo, M. Sample sizes required to detect interactions between two binary fixed-effects in a mixed-effects linear regression model. Computational statistics & data analysis 2009;53(3):603-608. Li, Q., et al. N-of-1-pathways MixEnrich: advancing precision medicine via single- subject analysis in discovering dynamic changes of transcriptomes. BMC Med Genomics 2017;10(Suppl 1):27. Li, Q., et al. kMEn: Analyzing noisy and bidirectional transcriptional pathway responses in single subjects. J Biomed Inform 2017;66:32-41. Liu, Y., Zhou, J. and White, K.P. RNA-seq differential expression studies: more sequence or more replication? Bioinformatics 2014;30(3):301-304. Millard, S.P., Kowarik, A. and Kowarik, M.A. Package ‘EnvStats’. 2020. Ozturk, K., et al. The Emerging Potential for Network Analysis to Inform Precision Cancer Medicine. J Mol Biol 2018;430(18 Pt A):2875-2899. Powers, D.M. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv preprint arXiv:2010.16061 2020. Robinson, M.D., McCarthy, D.J. and Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010;26(1):139-140. Robinson, M.D. and Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 2010;11(3):R25. Schissler, A.G., et al. Dynamic changes of RNA-sequencing expression for precision medicine: N-of-1-pathways Mahalanobis distance within pathways of single subjects predicts breast cancer survival. Bioinformatics 2015;31(12):293-302. Schissler, A.G., et al. Analysis of aggregated cell–cell statistical distances within pathways unveils therapeutic-resistance mechanisms in circulating tumor cells. Bioinformatics 2016;32(12):i80-i89. Schissler, A.G., Piegorsch, W.W. and Lussier, Y.A. Testing for differentially expressed genetic pathways with single-subject N-of-1 data in the presence of inter-gene correlation. Stat Methods Med Res 2018;27(12):3797-3813. Schurch, N.J., et al. How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? Rna 2016;22(6):839-851. Smyth, G.K., et al. LIMMA: linear models for microarray data. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Statistics for Biology and Health. 2005. Soneson, C. and Delorenzi, M. A comparison of methods for differential expression analysis of RNA-seq data. BMC bioinformatics 2013;14(1):91. Subramanian, A., et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 2005;102(43):15545-15550. Tao, Y., et al. Information theory applied to the sparse gene ontology annotation network to predict novel gene function. Bioinformatics 2007;23(13):i529-538. Tarazona, S., et al. Data quality aware analysis of differential expression in RNA- seq with NOISeq R/Bioc package. Nucleic acids research 2015;43(21):e140- e140. Van Keymeulen, A., et al. Reactivation of multipotency by oncogenic PIK3CA induces breast tumour heterogeneity. Nature 2015;525(7567):119-123. Vitali, F., et al. Developing a ‘personalome’for precision medicine: emerging methods that compute interpretable effect sizes from single-subject transcriptomes. Briefings in Bioinformatics 2017;20(3):789-805. Wan, Y.W., Allen, G.I. and Liu, Z. TCGA2STAT: simple TCGA data access for integrated statistical analysis in R. Bioinformatics 2016;32(6):952-954. Woolf, B. On estimating the relation between blood group and disease. Ann Hum Genet 1955;19(4):251-253. Yang, X., et al. Single sample expression-anchored mechanisms predict survival in head and neck cancer. PLoS Comput Biol 2012;8(1):e1002350. Zaim, S.R., et al. Evaluating single-subject study methods for personal transcriptomic interpretations to advance precision medicine. Bmc Medical Genomics 2019;12(5):96. Zaim, S.R., et al. Emergence of pathway-level composite biomarkers from converging gene set signals of heterogeneous transcriptomic responses. Pac Symp Biocomput 2018;23:484-495. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430623doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430623 http://creativecommons.org/licenses/by-nd/4.0/ 10_1101-2021_02_10_430649 ---- Bfimpute: A Bayesian factorization method to recover single-cell RNA sequencing data PAPER Bfimpute: A Bayesian factorization method to recover single-cell RNA sequencing data Zi-Hang Wen,1 Jeremy L. Langsam,2 Lu Zhang,3 Wenjun Shen4, ∗ and Xin Zhou2, 5, 6 ∗ 1School of Optical and Electronic Information, Huazhong University of Science and Technology, 1037 Luoyu Road, Wuhan, 430074, Hubei, China, 2Department of Biomedical Engineering, Vanderbilt University, 2301 Vanderbilt Place, 37235, Nashville, USA, 3Department of Computer Science, Hong Kong Baptist University, Room R708, Sir Run Run Shaw Building, Kowloon Tong, Hong Kong, 4Department of Bioinformatics, Shantou University Medical College, No. 22 Xinling Road, Shantou, 515041, Guangdong, China, 5Department of Computer Science, Vanderbilt University, 2301 Vanderbilt Place, 37235, Nashville, USA and 6Data Science Institute, Vanderbilt University, Sony Building, 1400 18th Ave S Building, Suite 2000, 37212, Nashville, USA ∗Corresponding authors: maizie.zhou@vanderbilt.edu; wjshen@stu.edu.cn FOR PUBLISHER ONLY Received on Date Month Year; revised on Date Month Year; accepted on Date Month Year Abstract Single-cell RNA-seq (scRNA-seq) offers opportunities to study gene expression of tens of thousands of single cells simultaneously, to investigate cell-to-cell variation, and to reconstruct cell-type-specific gene regulatory networks. Recovering dropout events in a sparse gene expression matrix for scRNA-seq data is a long-standing matrix completion problem. We introduce Bfimpute, a Bayesian factorization imputation algorithm that reconstructs two latent gene and cell matrices to impute final gene expression matrix within each cell group, with or without the aid of cell type labels or bulk data. Bfimpute achieves better accuracy than other six publicly notable scRNA-seq imputation methods on simulated and real scRNA-seq data, as measured by several different evaluation metrics. Bfimpute can also flexibly integrate any gene or cell related information that users provide to increase the performance. Availability: Bfimpute is implemented in R and is freely available at https://github.com/maiziezhoulab/Bfimpute. Key words: single cell; RNA-seq; imputation; Bayesian factorization Introduction Single-cell RNA-seq (scRNA-seq) has been widely used to study genome-wide transcriptomes in single cell resolution. The cellular resolution made possible by scRNA-seq data distinguishes it from bulk RNA-seq and makes it advantageous in investigating cell-to-cell variation [1]. Today, different commercial platforms are available to perform scRNA-seq, including Fluidigm C1, Wafergen ICELL8 and 10X Genomics Chromium. Droplet-based methods via 10X Genomics Chromium can process tens of thousands of cells; microwell- based, microfluidic-based methods via Fluidigm C1 and Wafergen ICELL8 process fewer cells but with a higher sequencing depth. For all these platforms, missing values make up a large proportion of scRNA-seq data, ranging from 40% - 90% in the gene expression count matrix [2, 3, 4, 5, 6]. In scRNA-seq data, this large percentage of missing events is defined as the so-called ‘dropout’ phenomenon [7]. Gene ‘dropout’ means a gene is observed at a moderate expression level in one cell but it is not detected in another cell of the same type. Analyses of scRNA-seq data, including dimensionality reduction, clustering, and Differential Expression (DE) analysis have shown that effective imputations for dropout events improve downstream analyses and assist biological interpretations [8, 9, 10, 11]. To date, several notable imputation methods have been proposed: scImpute [12], DrImpute [13], MAGIC [14], SAVER [15], VIPER [16] and SCRABBLE [17]. scImpute first performs clustering to identify cell subpopulations and further identifies dropout events through a Gamma-Normal mixture model, finally imputes dropout events by a non-negative least squares regression [12]. DrImpute optimizes the step of identifying cell subpopluations to impute dropout events by averaging the imputation from multiple clustering results [13]. MAGIC builds a Markov affinity-based graph for imputation relying on cell to cell interactions [14]. SAVER uses a Bayesian- based model by various prior probability, and alters all gene expression values [15]. VIPER imputes dropout events relying on local neighborhood cells via non-negative sparse regression models [16]. SCRABBLE has been recently introduced to impute dropout events by adopting the bulk RNA-seq data [17]. Even though a lot of efforts have been taken into analyzing and imputing real dropout events, imputation of dropout events is still a difficult problem because of the high dropout rate and complex cellular heterogeneities for different scRNA-seq datasets. Relying on matrix completion to 1 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint email:email-id.com https://github.com/maiziezhoulab/Bfimpute https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 Wen et al. Latent vectors True values Dropouts Training Imputing Gene1 Gene2 GeneN Cell1 CellM G1 T G2 T GN T C1 CM Expression Matrix E Cell Latent Matrix C p(E|G,C,α) = N i=1 M j=1 N(Ei j |Gi TCj,α −1) Iij Ê = GTC · · · = × · · · · · · · · · · · · · · · · · · · · · · · · · · · Gi TCj Gene Latent Matrix GT Fig. 1. A brief illustration blueprinting the architecture of Bfimpute method. In each group, Bfimpute borrows information from true values and factorizes the expression matrix into two latent matrices using MCMC. After training, Bfimpute imputes dropouts by performing product of the latent matrices. The details are shown in Methods section. impute missing values is a long-standing question and has been investigated in biological sciences, including gene expression prediction, miRNA–disease, protein-protein interaction [18] etc. Even though similar mathematical models could be applied to different biological problems, to solve matrix completion problem in scRNA-seq (recovering the dropout events), it is crucial to take the features of scRNA-seq into consideration. Most of existing scRNA-seq imputation methods have shown it is advantageous for imputation to borrow and leverage information from similar cells. In recent years, researchers also start to integrate additional gene or cell related information (e.g. bulk data for SCRABBLE) to assist imputation which is important in matrix completion problem. In this study, we present Bfimpute, a powerful imputation tool for scRNA-seq data that recovers dropout events by factorizing the count matrix into the product of gene-specific and cell-specific feature matrices [19, 20]. Bfimpute uses full Bayesian inference to describe the latent information for genes and cells and carries out a Markov chain Monte Carlo scheme which is able to easily incorporate any gene or cell related information to train the model and perform the imputation [18] (Figure 1). We demonstrate that Bfimpute performs better than the six other notable published imputation methods mentioned above (scImpute, SAVER, VIPER, DrImpute, MAGIC, and SCRABBLE) in both simulated and real scRNA-seq datasets on improving clustering and differential gene expression analyses and recovering gene expression temporal dynamics (pseudotime analysis) [21]. Methods Cell clustering and dropout detection Bfimpute first provides an optional normalization step to smooth the gene expression values (counts per million, followed by logarithm base 10 with bias 1.01). Bfimpute then performs a local imputation within each cell group. We adopt the same approach as scImpute [12] to detect cell clusters, which applies spectral clustering methods on the result of Principal Component Analysis (PCA) to reduce the impact of dropout events. We integrate spectral clustering by using the ’Spectrum’ function of the Spectrum R package [22] or the ’specc’ function of the kernlab R package [23]. Bfimpute also adopts the Gamma-Normal mixture distribution model from scImpute to determine dropout events [12]. Probabilistic model for scRNA-seq expression matrix imputation After above-mentioned steps, we adapted a multi-variate priors model from Bayesian Probabilistic Matrix Factorization (BPMF) [20] to recover dropouts for scRNA-seq datasets. Since every cell group is mathematically equivalent, we arbitrarily choose one to demonstrate local imputation in Bfimpute. Suppose we have N genes and M cells in one cell group, and the expression matrix is E ∈ RN×M. Each entity Eij represents the expression level of gene i in cell j. Bfimpute factorizes E into G ∈ RD×N and C ∈ RD×M which are defined as gene and cell latent matrix, respectively, where D is the dimension of the latent factor. Column vector Gi and Cj represent the gene-specific and cell-specific latent vector, respectively. The imputed matrix to recover E will be given as Ê = GTC. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ Bfimpute 3 We introduce the Gaussian noise model for the gene expression profile E with precision α, which was firstly proposed by Probabilistic Matrix Factorization (PMF) [19]: p(E|G,C,α) = N∏ i=1 M∏ j=1 [ N(Eij|Gi T Cj,α −1 ) ]Iij (1) where Iij is the indicator function that is 0 if the Eij is a dropout and equal to 1 otherwise. To get use of gene or cell related information such as bulk data or other data user provided, we add entity features SG ∈ RFG×N and SC ∈ RFC×M as gene and cell feature matrix, respectively, where FG and FC are the dimentionalities of these additional features. The Gaussian model for the prior distributions over genes and cells latent vectors adapted from Macau [18] will be given by: p(Gi|SGi ,µG, ΛG,βG) = N(Gi|µG + βG TSGi , Λ −1 G ) p(Cj|SCj ,µC, ΛC,βC) = N(Cj|µC + βC TSCj , Λ −1 C ) (2) where {µG,µC} and {ΛG, ΛC} are the means and precisions, and βG ∈ RFG×D and βC ∈ RFC×D are the weight matrices for the entity features. Weight initialization by a zero mean normal distribution is used and they will be updated iteratively by the Bayesian inference steps (details described later). Also, direct imputation of single cell RNA-seq data could be applied by initiating zeros into feature vectors SG and SC(where FG = FC = 1) if no additional information is given. To perform Bayesian inference, we introduce the priors referring to BPMF [20] for {µG, ΛG} and {µC, ΛC}. p(µG, ΛG|µ0,β0,ν0,W0) = N(µG|µ0, (β0ΛG) −1 ) ×W(ΛG|W0,ν0) p(µC, ΛC|µ0,β0,ν0,W0) = N(µC|µ0, (β0ΛC) −1 ) ×W(ΛC|W0,ν0) (3) where W is the Wishart Distribution with ν0 as the degrees of freedom and W0 as the scale matrix. We also set a zero mean normal distribution as βG and βC’s priors and a gamma distribution as the problem dependent αG and αC’s hyperpriors adapted from Macau [18]: p(βG|ΛG,αG) = N(vec(βG)|0, ΛG−1 ⊗ (αGI)−1) p(βC|ΛC,αC) = N(vec(βC)|0, ΛC−1 ⊗ (αCI)−1) (4) p(αG|k,θ) = G(αG|k/2, 2θ/k) p(αC|k,θ) = G(αC|k/2, 2θ/k) (5) where vec(βX) is the vectorization of βX, ⊗ represents the Kronecker product and αX is the precision (X ∈ {G,C}). k/2 and 2θ/k are shape and scale, respectively. k and θ are hyperparameters which are set to 1. Gibbs sampler to impute dropout events We use Markov Chain Monte Carlo (MCMC) algorithm to train Bfimpute, which is a sampling based approach to tackle the Bayesian inference problem. Bfimpute constructs a Markov Chain from a random initial value and after running the chain for K̃ steps, it will eventually converge to its stationary distribution. Bfimpute then uses the average of (K − K̃) stationary stages to approximate the real distribution of E and gain the estimated values Êij for dropouts: p(Êij|E,G,C) ≈ 1 K − K̃ K∑ k=K̃+1 p(Êij|Gi (k) ,Ci (k) ,α) (6) More specifically, Bfimpute chooses Gibbs sampler to achieve Bayesian matrix factorization. In every cycle, we sample the conditional distribution from the posterior distribution in Bayes’ theorem. Since the probabilistic models of genes and cells are symmetric, the conditional distributions over genes and the conditional distribution over cells have the same form. In particular, based on (1) and (2), the conditional probability for Gi is: p(Gi|E,C,α,S G i ,µG, ΛG,βG) = N(Gi|µ (G)′ i , Λ (G)′ i ) (7) ∝ M∏ j=1 [ N(Eij|Gi T Cj,α −1 ) ]Iij ×p(Gi|S G i ,µG, ΛG,βG) where  Λ (G)′ i = ΛG + α ∑ j ( SjSj T )Iij µ (G)′ i = ( [Λ (G)′ i ] −1 )[ ΛG ( µG + βG Tx (G) i ) + α ∑ j (EijCj) Iij ] According to (2) and (3), we can derive the conditional probability for µG and ΛG: p(µG, ΛG|G,S G ,βG,αG,µ0,β0,ν0,W0) = N(µG|µ0 ′ , ( β0 ′ ΛG )−1 )W(ΛG|W0 ′ ,ν0 ′ ) (8) ∝ p(Gi|S G i ,µG, ΛG,βG) ×p(µG, ΛG|µ0,β0,ν0,W0) where  µ0 ′ = β0µ0+NḠ β0+N β0 ′ = β0 + N ν0 ′ = ν0 + N + FG W0 ′ = [W0 −1 + NH̄ + β0µ0µ0 T −β0′µ0′µ0′ T + αGβG TβG] −1 Ḡ = 1 N ∑ N i=1 ( Gi −βGTSGi ) H̄ = 1 N ∑ N i=1 ( Gi −βGTSGi )( Gi −βGTSGi )T Considering (4) and (5), we get the conditional probability for αG: p(αG|βG, ΛG,k,θ) = G(αG|k ′ /2, 2θ ′ /k ′ ) (9) ∝ p(βG|ΛG,αG) ×p(αG|k,θ) (10) where { k′ = (FGD+θ)k θ+θ·tr(βGTβGΛG) θ′ = FGD + θ From (2) and (4), we are able to know the conditional probability for βG: p(βG|ΛG,αG,G,S G ,µG) = N(µβG, ΛβG ) (11) ∝ p(βG|ΛG,αG) × ∏ i p(Gi|S G i ,µG, ΛG,βG) Because the size of the precision matrix ΛβG is too large to compute, we consider to do this part in an alternative way which is firstly proposed by Macau [18] by calculating: β̃G = ( S GT S G + αGI )−1 ( S GT ( G̃ + E1 ) + √ αGE2 ) (12) where G̃ = (G−µG)T , and each row of E1 ∈ RN×D and E2 ∈ RFG×D is sampled from N(0, ΛG−1). .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4 Wen et al. Algorithm 1 Gibbs sampling in Bfimpute 1. Initialize {G0,C0,βG(0),βC(0),αG(0),αC(0)} 2. For k = 1, 2, . . . ,K a. Sample the means {µG,µC} and precisions {ΛG, ΛG} of gene and cell latent matrices: µG (k) , ΛG (k) ∼ p(µG, ΛG|G (k−1) ,S G ,βG (k−1) ,αG (k−1) ) µC (k) , ΛC (k) ∼ p(µC, ΛC|C (k−1) ,S C ,βC (k−1) ,αC (k−1) ) b. Sample gene and cell latent matrices {G,C}: • For each i = 1, . . . ,N sample gene latent vectors in parallel: Gi (k) ∼ p(Gi|E,C (k−1) ,S G i ,µG (k) , ΛG (k) ,βG (k−1) ) • For each i = 1, . . . ,M sample cell latent vectors in parallel: Ci (k) ∼ p(Ci|E,G (k) ,S G i ,µG (k) , ΛG (k) ,βG (k−1) ) c. Sample the precisions {αG,αC} of weight matrices: αG (k) ∼ p(αG|βG (k−1) , ΛG (k) ) αC (k) ∼ p(αC|βC (k−1) , ΛC (k) ) d. Sample weight matrices {βG,βC}: βG (k) = ( S GT S G + αG (k) I )−1 ( S GT ( G̃ (k) + E1 ) + √ αG(k)E2 ) βC (k) = ( S CT S C + αC (k) I )−1 ( S CT ( C̃ (k) + E1 ) + √ αC(k)E2 ) The Gibbs sampling steps of Bfimpute are shown in Algorithm 1: Generation of simulated data We first simulated a single cell RNA-seq count matrix with 20000 genes and 500 cells evenly split into 5 groups using the scater(v1.16.2) [24] package and Splatter(v1.12.0) [25] package. The parameter which controls the probability that a gene will be selected as DE was set to 0.08 while the location and scale factor were set to 0.3 and 0.5, respectively. We used ’experiment’ to add the global dropout for every cell. In order to show the universal applicability of Bfimpute, we further generated 6, 7, 8 groups of cells with 600, 700, 800 as total cell numbers and 10 runs for each data with different seeds using the same parameters mentioned above. Quality Control for real datasets We did quality control (QC) (https://github.com/gongx030/ scDatasets) for all real datasets to ensure fairness for all methods before imputation except for PBMCs dataset (see details in Github). As the PBMCs dataset is based on 10X Genomics platform with an extremely high dropout rate, the QC step for PBMCs datasets could remove and lose nearly 80% genes. Evaluation metrics of clustering results We used four evaluation methods: adjusted Rand index [26], Jaccard index [27], normalized mutual information (nmi) [28], and purity score, to analyse the agreement between true cluster labels and the spectral clustering [22] results on the first two Principle Components (PCs) of imputed matrix. Most of these four measurements vary from 0 to 1, with 1 indicating perfect match between them, except the adjusted Rand index which could yield negative values when agreement is less than expected by chance. The adjusted Rand index is an adjusted version of Rand’s statistic [29] which is the probability that a randomly selected pair is classified in agreement. The Jaccard index is similar to Rand Index, but disregards the pairs of elements that are in different clusters for both clusterings [30]. The normalized mutual information combines multiple clusterings into a single one without accessing the original features or algorithms that determine these clusterings. The purity score shows the rate of the total number of cells that are classified correctly. Results We demonstrated the performance of Bfimpute in gene expression recovering, data visualization, cell subpopulation clustering, pseudotime and DE analysis on five publicly available scRNA-seq datasets (Supplementary Table 1), and we compared Bfimpute with six state-of-the-art imputation methods: scImpute, SAVER, VIPER, DrImpute, MAGIC, and SCRABBLE in the following sections. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://github.com/gongx030/scDatasets https://github.com/gongx030/scDatasets https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ Bfimpute 5 a Raw Bfimpute scImpute SAVER VIPER DrImpute MAGIC Jaccard index nmi purity Adjusted Rand index Jaccard index nmi purity Adjusted Rand index Jaccard index nmi purity Adjusted Rand index Jaccard index nmi purity Adjusted Rand index 0.00 0.25 0.50 0.75 1.00 V a lu e b k = 5 k = 6 k = 7 k = 8 −10 0 10 20 −10 10 −8 −4 0 4 −4 −10 0 10 20 −20 −10 100 −20 −10 0 10 20 −10 100 Group1 Group2 Group3 Group4 Group5 −5 0 5 −5 0 5 −4 0 4 8 −5 100 5 −10 0 10 20 −10 100 −20 −10 0 10 −20 −10 20100 Group1 Group2 Group3 Group4 Group5 a VIPER DrImpute MAGICSAVER Raw Bfimpute scImputeComplete TSNE 1 TSNE 1 TSNE 1 TSNE 1 T S N E 2 T S N E 2 0 40 Fig. 2. Bfimpute recovers dropout values and improves cell type identification in the simulated data. a. The scatter plots show the first two dimensions of the t-SNE results calculated from the complete data, the raw data, and the imputed data by Bfimpute, scImpute, SAVER, VIPER, DrImpute, and MAGIC. b. k represents the number of cell clusters in simuated data. The adjusted Rand index, Jaccard index, nmi, and purity scores of clustering results are based on the raw and imputed data. Bfimpute improves both visualization and cell type identification PCA and t-distributed stochastic neighbor embedding (t-SNE) [31, 24] are two popular dimensionality reduction techniques often used to visualize high-dimensional scRNA-seq datasets. Since the dropout values were unknown in real datasets, we first tested accuracy of all different imputation methods using a simulated dataset where the ground truth was known. We applied the Splatter method to generate simulated datasets, which simulated many features observed in the scRNA-seq data, including zero-inflation, gene-wise dispersion, and differing sequencing depths between cells. To test the strength and robustness of different imputation methods, we simulated a wide range of datasets to include 5, 6, 7 and 8 different cell types (Methods section). Bfimpute achieved the most compact and well separated clusters on the simulation, followed by scImpute and DrImpute (Figure 2). For all different cell types simulations, we also evaluated the clustering performances by the evaluation metrics, where Bfimpute achieved the best scores for adjusted Rand index, Jaccard index, normalized mutual information and purity score compared to the raw data and other five imputation methods (Methods section). We further used two real datasets for this analysis and the first two principal components (PCs) from PCA were plotted to compare every dataset across seven different conditions: raw dataset, and six imputed ones through the Bfimpute, scImpute, SAVER, VIPER, DrImpute, and MAGIC methods. We first applied all imputation methods to a real scRNA-seq dataset from a human embryonic stem (ES) cell differentiation study [2] to demonstrate the capacity of Bfimpute for improving the performance of data visualization. The dataset contains 1018 single cells from seven cell groups: Neuronal progenitor cells (NPCs), definitive endoderm cell (DEC), endothelial cells (ECs) and trophoblast-like cells (TBs) are progenitors differentiated from H1 human ES cells. H9 human ES cells and human foreskin fibroblasts (HFFs) were used as controls cells. The raw dataset (i.e. without imputation) clearly identified the cluster of HFF cells, however five other cell types were clustered very closely. After imputation by Bfimpute, the homogeneous subpopulations of H1 and H9 human ES cells were observed to substantially overlap and well separated from the rest of the progenitors. The DECs, ECs, HFFs, NPCs and TBs were also compactly clustered and well separated on the PCA plot (Figure 3a). Compared with the raw dataset, SAVER, VIPER and DrImpute had no significant improvement for cell groups identification. scImpute was the second best and generated similar compact cell groups as Bfimpute. We then compared clustering results of the spectral clustering algorithms [22] on the first two PCs to demonstrate the capability of Bfimpute to improve clustering accuracy in cell type identifications. For .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ 6 Wen et al. −40 0 40 −50 0 50 100 PCA 1 (28%) P C A 2 ( 1 0 % ) Raw −50 −25 0 25 PCA 1 (35%) P C A 2 ( 1 7 % ) Bfimpute −20 0 20 40 60 PCA 1 (41%) P C A 2 ( 1 5 % ) DEC EC H1 H9 HFF NPC TB scImpute −30 0 30 60 PCA 1 (30%) P C A 2 ( 9 % ) DEC EC H1 H9 HFF NPC TB −25 0 25 PCA 1 (32%) P C A 2 ( 9 % ) VIPER −30 0 30 60 PCA 1 (35%) P C A 2 ( 1 2 % ) DrImpute −60 −40 −20 0 20 PCA 1 (58%) P C A 2 ( 1 6 % ) DEC EC H1 H9 HFF NPC TB 0.00 0.25 0.50 0.75 1.00 V a lu e Cell types -100 -50 0 50 -50 0 50 100 -25 0 25 50 75 -50 0 50 100 -40 0 40 80 -100 -50 0 50 SAVER MAGIC Jaccard index nmi purity Adjusted Rand index Raw Bfimpute ScImput SAVER VIPER DrImpute MAGIC same different Raw Bfimpute scImpute SAVER VIPER DrImpute MAGIC 1.0 0.9 0.8 0.7 0.6 0.5 P e a rs o n C o rr e la ti o n a b c Fig. 3. Bfimpute improves PCA visualization and cell type identification. a. The first two PCs calculated from the raw data, and the imputed data by Bfimpute, scImpute, VIPER, DrImpute, MAGIC, and SAVER. b. The adjusted Rand index, Jaccard index, nmi, and purity scores of clustering results based on the raw and imputed data. c. Average Pearson correlations between any two cells from same type and different type. the true labels, we had seven cell types for this dataset, and we evaluated the clustering results by four different metrics: adjusted Rand index, Jaccard index, normalized mutual information (nmi), and purity (Methods section). All four metrics suggested Bfimpute achieved the best clustering accuracy compared with raw and other five imputation methods (Figure 3b). We also showed the comparison of visualization performance through t-SNE. t-SNE on the raw dataset can better identify the seven cell types comparing to PCA. Bfimpute, DrImpute and SAVER can further separate different cell groups and improve the visualization, however the other four imputation methods demonstrated worse t-SNE results than raw data (Supplementary Figure 1). To illustrate the recovering of dropouts in individual cells by imputation, we calculated the Pearson correlation from log10-transformed read counts between every pair of cells in the same type and from different cell types. This result indicated imputation did recover the zero counts in every cell and the Pearson correlation increased from 0.70 to 0.87 for Bfimpute, 0.85 for scImpute, 0.72 for SAVER, 0.73 for VIPER, 0.78 for DrImpute, and 0.97 for MAGIC (Figure 3c, blue bars). One scatter plot of correlations between two randomly selected stem cells of the same cell type was demonstrated in Supplementary Figure 2. As we expected, imputation methods usually increased the Pearson correlation between any two cells in the same cell type. Imputation should not increase the correlation between cells in different cell types by disregarding the biological variation between them. Among all imputation methods, MAGIC achieved the highest correlation in the same cell type, but the correlation between different cell types was also the highest (Figure 3c, red bars). Bfimpute demonstrated the best balance, by maximizing the difference between correlation for the same over different cell types. We further investigated Bfimpute’s performance of visuali- zation and cell type identification on another zebrafish [3] scRNA-seq dataset. This dataset contains 246 single cells from six cell groups, and Hematopoietic stem and progenitor cells (HSPCs) and HSPCs/thrombocytes among them come from one defined cell type with expected heterogeneity. After the QC step, the zebrafish dataset was still sparse with zeros .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ Bfimpute 7 composing over 87.5% of the total counts. The comparison of visualization performance via PCA on the raw and six imputed datasets is shown in Supplementary Figure 3. The raw dataset only roughly identified the cluster for neutrophil cells, whereas cells from other cell types were mixed and spread wildly. After imputation by Bfimpute, four distinct immune cell subpopulations can be identified for neutrophils, T, Natural Killer (NK) and B cells, where the cluster members were much more compact compared to those of the raw dataset. Neutrophils, T, NK and B cells were distantly positioned on the PCA plot. HSPCs and HSPCs/thrombocytes were from one defined cell type with expected heterogeneity, so after Bfimpute’s imputation, they were still spatially closer than other cells (Supplementary Figure 3a). The raw data and the imputed data by other five imputation methods did not correctly identify the four immune cell subpopulations. Clustering accuracy results from the four metrics for Bfimpute were better than the other five imputation methods, and Bfimpute achieved a better correlation for the same cell type without loosing variation between different cells types (Supplementary Figure 3b,c). Bfimpute improves DE and pseudotime analysis DE analysis is widely used in bulk RNA-seq data. Performing DE analysis for scRNA-seq data to reveal the stochastic nature of gene expression in single cells is challenging since scRNA- seq data suffers from high dropout events. However, it has been proven that good imputation methods could lead to a better agreement between scRNA-seq and bulk RNA-seq data of the same biological condition on genes known to have little cell-to-cell heterogeneity. We utilized a real dataset by Chu et al [2] with both bulk and scRNA-seq data available on human embryonic stem cells and definitive endoderm cells (DEC) [32, 33], to compare Bfimpute with the raw dataset and other five imputation methods for DE analysis. This dataset contained six samples of bulk RNA-seq (four in H1 ES cells and two in DEC) and 350 samples of scRNA-seq (212 in H1 ES cells and 138 in DEC). The percentages of zero entries were 8.8% in bulk data and 44.9% in scRNA-seq data, respectively. We first performed DE analysis in the bulk data and identified the top 200 DE genes by DESeq2 [10]. We then plotted these 200 genes’ expression profiles in scRNA-seq data for seven conditions: raw dataset, Bfimpute, scImpute, SAVER, VIPER, DrImpute, and MAGIC. We found these top 200 genes’ expression profiles after Bfimpute’s imputation demonstrated better concordance with those in bulk data (Figure 4a). To further evaluate whether imputation improves DE analysis in scRNA-seq data, we first used DESeq2 to identify DE genes for raw scRNA-seq dataset and scRNA-seq datasets after six different imputations. We then generated different lists of DE genes for the bulk data by applying different thresholds for false discovery rates of genes. Finally for every threshold, we compared the DE genes for the bulk data and scRNA-seq data of those seven different conditions and calculated the AUC values for each condition. The AUC values suggested all imputation methods improved DE analysis. Bfimpute generated DE genes most consistent with the bulk data (AUC values raw: 0.568, Bfimpute: 0.670, scImpute: 0.665, SAVER: 0.624, VIPER: 0.639, DrImpute: 0.657 and MAGIC: 0.668). Bulk data for the same biological condition was provided and could be used as a gold standard to compare the average gene expression level with the scRNA-seq data, even though the scRNA-seq data presented more cell-to-cell variation. We expected that average gene expression level in the scRNA- seq data was highly correlated with bulk RNA-seq data. To investigate this, we plotted correlations between gene expression in single-cell and bulk data and found that all imputation methods did improve the correlation between bulk and scRNA-seq data, and Bfimpute, MAGIC and scImpute had the best improvement (Supplementary Figure 4). We further selected several genes (e.g., ANGPT1,GDF3, BMP4, EPB41L5) of DECs from different time points to plot their average gene expression levels in both bulk and scRNA-seq data. These genes were annotated with the GO term “endoderm development”, and they were likely to be affected by dropout events [13, 34]. Imputed read counts for these genes by Bfimpute showed higher gene expression correlation and better consistency with the bulk data (Figure 4b and Supplementary Figure 5, 6). In addition to the DE analysis, we also used the time course scRNA-seq data [2] from the same Chu et al study to show Bfimpute improved gene expressions temporal dynamics through pseudotime analysis. In this dataset, a total of 758 single cells were captured and profiled by scRNA-seq at 0, 12, 24, 36, 72, and 96 h of differentiation. We first applied Bfimpute, scImpute and Drimpute to the raw scRNA-seq data with true cell type labels, and then study how the time-course expression patterns change in the imputed data. The PCA results showed that imputed read counts by Bfimpute better distinguished cells of different time points and the six time points cell groups were compact (Supplementary Figure 7a), and the first principle component from PCA indicated that imputed read counts from Bfimpute reflected more accurate transcriptome dynamics along the different time course (Figure 4d). Bfimpute could better differentiate the last two time points (72h and 96h). In the next section, we will discuss impuation with the aid of cell type labels more in details. Bfimpute improves performance with the aid of additional experimental information Imputation methods including Bfimpute, scImpute and DrImpute all first identified similar cells based on clustering, and imputation was then performed by leveraging the expression values from similar cells. Being able to first identify the appropriate cell groups enhanced the ability of imputing the dropout events. A substantial number of scRNA-seq studies have identified cell types from experimental design or marker genes. We applied Bfimpute, scImpute and DrImpute to the raw scRNA-seq data with true cell type labels in three real datasets we have used before, and two more new real datasets. In this study, SAVER, VIPER, DrImpute, and MAGIC were excluded since they were not applicable to use cell labels. We then investigated again the PCA and t-SNE visualizations for cell subpopulations identification. Our results showed Bfimpute outperformed the other two methods and clearly differentiated almost every cell group in different datasets (Figure 5 and Supplementary Figure 7, 8). For the human embryonic stem cell dataset, Bfimpute further correctly identified three outlier cells into correct groups compared to the previous imputation without cell labels (see Figure 5a versus Figure 3a: one EC (orange point), one DEC (blue point), and one NPC (yellow point) cell were brought back to the corresponding EC, DEC and NPC cell groups, respectively). H9 cells were also further apart from H1 cells in the vertical dimension. For the zebrafish dataset, even the most mixed B, NK, T cells (blue, green, and yellow colors) from the raw dataset were separated from each other after Bfimpute’s imputation, and HSPCs and .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ 8 Wen et al. Cell type DEC H1 0 1 2 3 4 5 00h 12h 24h 36h 72h 96h T im e p o in t Bfimpute scImpute 00h 12h 24h 36h 72h 96h DrImputeRaw First principal component First principal component First principal component First principal component -60 -30 0 30 -30 0 30 -50 -25 0 25 50 -60 -30 0 30 a c Bulk Raw Bfimpute scImpute SAVER VIPER MAGICDrImpute Cell type DEC H1 0 1 2 3 4 5 L o g 1 0 (E x p re s s io n + 1 ) ANGPT1 Bfimpute scImpute SAVERRaw 00h 12h 24h 36h 72h 96h 00h 12h 24h 36h 72h 96h 00h 12h 24h 36h 72h 96h 00h 12h 24h 36h 72h 96h 2 1 0 b Fig. 4. Bfimpute improves DE and pseudotime analysis. a. The expression profiles of the top 200 DE genes detected in the bulk data by DESeq2 for seven conditions: raw dataset, Bfimpute, scImpute, SAVER, VIPER, DrImpute, and MAGIC. b. Time-course expression patterns of the example gene ANGPT1 that is annotated with GO term “endoderm development”. The small black triangles marks the average bulk data for each time point. c. The first principal component is plotted to show cells of different time points along the differentiation. HSPCs/thrombocytes cells were spatially close, but split into two cell groups (Figure 5b and supplementary Figure 7b). To test Bfimpute with another kind of cell-label information, we used a human preimplantation embryonic development dataset (t-SNE and pseudotime analyses are shown in Figure 5c). The Petropoulos dataset [4] included single cells from five stages of human preimplantation embryonic development, ranging from developmental day (E) 3 to 7. The five different stages were clearly distinguished from each other after Bfimpute’s imputation. We also applied three imputation methods to a large 10X dataset generated by the high-throughput droplet-based system. To generate this dataset, we randomly selected 500 cells from nine immune cell types, so it contained a total of 4500 peripheral blood mononuclear cells (PBMCs) [12, 5]. In the raw data, 98.3% read counts are exactly zeros. Our PCA and t-SNE results indicated that Bfimpute’s imputation identified nine immune cell types from raw data (5d). In summary, these results suggested that Bfimpute with the aid of labels always further improved visualization and identification of cell subpopulations, and the downstream analysis. SCRABBLE is another recent approach integrating bulk data to impute dropout events in scRNA-seq data. Since Bfimpute can easily adopt bulk data as additional information into the gene latent matrix, we have also tested if bulk data can further improve performance. In the scRNA-seq dataset of .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ Bfimpute 9 −40 0 40 PCA 1 (28%) P C A 2 ( 1 0 % ) −25 0 25 50 PCA 1 (35%) P C A 2 ( 1 7 % ) Bfimpute −60 −40 −20 0 20 40 PCA 1 (41%) P C A 2 ( 1 5 % ) scImpute −60 −30 0 30 PCA 1 (35%) P C A 2 ( 1 2 % ) DEC EC H1 H9 HFF NPC TB DrImpute −4 0 4 TSNE 1 T S N E 2 −10 −5 0 5 TSNE 1 T S N E 2 Bfimpute −8 −4 0 4 TSNE 1 T S N E 2 scImpute −3 0 3 TSNE 1 T S N E 2 B cells HSPCs HSPCs/thrombocytes neutrophils NK cells T cells DrImpute −20 −10 0 10 20 TSNE 1 T S N E 2 −30 −20 −10 0 10 20 TSNE 1 T S N E 2 Bfimpute −20 −10 0 10 20 TSNE 1 T S N E 2 scImpute −30 −20 −10 0 10 20 TSNE 1 T S N E 2 E3 E4 E5 E6 E7 DrImpute Raw Raw Raw -50 0 50 100 -50 0 50 100 -50 0 50 100 -50 0 50 100 -5 0 5 -10 -5 0 5 10 -10 -5 0 5 10 -20 0 20 40 -20 -10 0 10 20 30 -20 0 20 40 -20 0 20 a b c d -10 -5 0 5 10 Zebrafish Human embryonic stem cell differentiation Human preimplantation embryonic developement −20 −10 0 10 20 T S N E 2 30 -20 0 20 T S N E 2 40 -20 0 20 -40 -20 0 20 -40 -40 -20 0 20 -20 0 20 40 -20 0 20 40 -20 0 20 B Cytotoxic T Helper T Memory T Monocyte Naive cytotoxic T Naive T Natural killer Regulatory TT S N E 2 T S N E 2 TSNE 1 TSNE 1 TSNE 1 TSNE 1 Bfimpute scImpute DrImputeRaw Peripheral blood mononuclear cells (PBMCs) Fig. 5. Bfimpute with labels improves PCA and t-SNE visualizations and cell type identification. a. The first two PCs calculated from the raw data, and the imputed data by Bfimpute, scImpute, and DrImpute for the human embryonic stem cell differentiation study. b. The first two dimensions from the raw data, and the imputed data by Bfimpute, scImpute, and DrImpute for the zebrafish data. c. The first two dimensions from the raw data, and the imputed data by Bfimpute, scImpute, and DrImpute for the human preimplantation embryonic development. d. The first principal component is plotted to show cells of different time points along the embryonic development. human embryonic stem cells with bulk data, we did not observe significant differences between Bfimpute and Bfimpute with bulk data as additional information (Supplementary Figure 8 versus Figure 5a). The reason could be that similar gene level information has less effect than similar cell level information for the imputation of dropout events. We also found that in these scRNA-seq datasets, SCRABLE’s performance after integrating cell labels information with bulk data, was not better than Bfimpute (Supplementary Figure 8). Discussion and Conclusion ScRNA-seq has become an indispensable tool in recent years, as it has made it possible to study genome-wide transcriptomes in single cell resolution. Due to sequencing technical issues, a large proportion of dropout events exist in scRNA-seq data, which limit its usefulness. Several approaches have been proposed to solve this problem, with modest results. In this study, we introduced Bfimpute to recover dropout events in scRNA-seq data. We have shown that Bfimpute can improve performance in recovering gene expression detected by bulk RNA-seq, as well .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10 Wen et al. as in downstream analyses, including identification of cell sub- populations, differential expressed genes and gene expressions temporal dynamics. Bfimpute uses a fully Bayesian probabilistic matrix factorization by substituting hyperparameters with hyperpriors and performing Gibbs sampling for the approximate inference. The advantage of this Bayesian model is that it provides a predictive distribution instead of just a single number during recovering each dropout event, and the confidence in the prediction can be quantified and considered into the model. The use of a full Bayesian model proved to be a considerable advantage for Bfimpute to outperform other imputation methods. Bfimpute imputes two latent cell and gene matrices for each cell group through a Gibbs sampling process, and reaches a stationary state to generate the final cell-gene expression matrix, in which the dropout events will be recovered. Another advantage of Bfimpute is being able to integrate any gene or cell related information of scRNA-seq data into these two latent gene and cell matrices to impute missing values. Information from both similar cells or/and bulk data can be easily integrated into our model. Even though scImpute and DrImpute have a similar functionality in this respect, that allows them to impute dropout events with the aid of number of cell types or cell labels, they fail to achieve as good performance as Bfimpute for most of scRNA-seq data that we tested. Any resource provided by the users from the cell level and gene level could be used as additional information to improve dropout events imputation in scRNA-seq data in the future. Key Points • Imputation to recover dropout events for scRNA- seq data is important for determining genome-wide transcriptomes in single cell resolution. • Bfimpute uses a fully Bayesian probabilistic matrix factorization by substituting hyperparameters with hyperpriors and performing Gibbs sampling for approximate inference. • The advantage of this Bayesian model is that it provides a predictive distribution instead of just a single number during recovering each dropout event, and the confidence in the prediction can be quantified and considered into the model. • Bfimpute is able to integrate any gene or cell related information of scRNA-seq data into these two latent gene and cell matrices to impute missing values. • Bfimpute achieves better accuracy than other six widely used scRNA-seq imputation methods on simulated and real scRNA-seq data, as measured by several different evaluation metrics. Competing interests There is NO Competing Interest. Author contributions X.Z. conceived and led this work. Z.H.W. and X.Z. designed the model and implemented the Bfimpute software. Z.H.W., J.L.L, W.S and X.Z led the data analysis. Z.H.W, W.S and X.Z wrote the paper with feedback from J.L.L and L.Z. Funding This work was supported by Vanderbilt University Development Funds (FF 300033). L.Z. is partially supported by Research Grant Council Early Career Scheme (HKBU 22201419). References 1. Fuchou Tang, Catalin Barbacioru, Yangzhou Wang, Ellen Nordman, Clarence Lee, Nanlan Xu, Xiaohui Wang, John Bodeau, Brian B Tuch, Asim Siddiqui, et al. mrna- seq whole-transcriptome analysis of a single cell. Nature methods, 6(5):377–382, 2009. 2. Li-Fang Chu, Ning Leng, Jue Zhang, Zhonggang Hou, Daniel Mamott, David T Vereide, Jeea Choi, Christina Kendziorski, Ron Stewart, and James A Thomson. Single- cell rna-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome biology, 17(1):173, 2016. 3. Qin Tang, Sowmya Iyer, Riadh Lobbardi, John C Moore, Huidong Chen, Caleb Lareau, Christine Hebert, McKenzie L Shaw, Cyril Neftel, Mario L Suva, et al. Dissecting hematopoietic and renal cell heterogeneity in adult zebrafish at single-cell resolution using rna sequencing. Journal of Experimental Medicine, 214(10):2875–2887, 2017. 4. Sophie Petropoulos, Daniel Edsgärd, Björn Reinius, Qiaolin Deng, Sarita Pauliina Panula, Simone Codeluppi, Alvaro Plaza Reyes, Sten Linnarsson, Rickard Sandberg, and Fredrik Lanner. Single-cell rna-seq reveals lineage and x chromosome dynamics in human preimplantation embryos. Cell, 165(4):1012–1026, 2016. 5. Grace XY Zheng, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, Tobias D Wheeler, Geoff P McDermott, Junjie Zhu, et al. Massively parallel digital transcriptional profiling of single cells. Nature communications, 8(1):1–12, 2017. 6. Peng Qiu. Embracing the dropouts in single-cell rna-seq analysis. Nature communications, 11(1):1–9, 2020. 7. Peter V Kharchenko, Lev Silberstein, and David T Scadden. Bayesian approach to single-cell differential expression analysis. Nature methods, 11(7):740–742, 2014. 8. Ingrid Lönnstedt and Terry Speed. Replicated microarray data. Statistica sinica, pages 31–46, 2002. 9. Simon Anders and Wolfgang Huber. Differential expression analysis for sequence count data. Nature Precedings, pages 1–1, 2010. 10. Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome biology, 15(12):550, 2014. 11. Oliver Stegle, Sarah A Teichmann, and John C Marioni. Computational and analytical challenges in single-cell transcriptomics. Nature Reviews Genetics, 16(3):133–145, 2015. 12. Wei Vivian Li and Jingyi Jessica Li. An accurate and robust imputation method scimpute for single-cell rna-seq data. Nature communications, 9(1):1–9, 2018. 13. Wuming Gong, Il-Youp Kwak, Pruthvi Pota, Naoko Koyano-Nakagawa, and Daniel J Garry. Drimpute: imputing dropout events in single cell rna sequencing data. BMC bioinformatics, 19(1):1–10, 2018. 14. David Van Dijk, Roshan Sharma, Juozas Nainys, Kristina Yim, Pooja Kathail, Ambrose J Carr, Cassandra Burdziak, .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ Bfimpute 11 Kevin R Moon, Christine L Chaffer, Diwakar Pattabiraman, et al. Recovering gene interactions from single-cell data using data diffusion. Cell, 174(3):716–729, 2018. 15. Mo Huang, Jingshu Wang, Eduardo Torre, Hannah Dueck, Sydney Shaffer, Roberto Bonasio, John I Murray, Arjun Raj, Mingyao Li, and Nancy R Zhang. Saver: gene expression recovery for single-cell rna sequencing. Nature methods, 15(7):539–542, 2018. 16. Mengjie Chen and Xiang Zhou. Viper: variability- preserving imputation for accurate gene expression recovery in single-cell rna sequencing studies. Genome biology, 19(1):1–15, 2018. 17. Tao Peng, Qin Zhu, Penghang Yin, and Kai Tan. Scrabble: single-cell rna-seq imputation constrained by bulk rna-seq data. Genome biology, 20(1):88, 2019. 18. Jaak Simm, Adam Arany, Pooya Zakeri, T Haber, Jörg K Wegner, V Chupakhin, Hugo Ceulemans, and Yves Moreau. Macau: Scalable bayesian factorization with high- dimensional side information using mcmc. In 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2017. 19. Andriy Mnih and Russ R Salakhutdinov. Probabilistic matrix factorization. In Advances in neural information processing systems, pages 1257–1264, 2008. 20. Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using markov chain monte carlo. In Proceedings of the 25th international conference on Machine learning, pages 880–887, 2008. 21. Robrecht Cannoodt, Wouter Saelens, and Yvan Saeys. Computational methods for trajectory inference from single-cell transcriptomics. European journal of immunology, 46(11):2496–2506, 2016. 22. Christopher R John, David Watson, Michael R Barnes, Costantino Pitzalis, and Myles J Lewis. Spectrum: Fast density-aware spectral clustering for single and multi-omic data. Bioinformatics, 36(4):1159–1166, 2020. 23. Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 2:849–856, 2002. 24. Davis J McCarthy, Kieran R Campbell, Aaron TL Lun, and Quin F Wills. Scater: pre-processing, quality control, normalization and visualization of single-cell rna-seq data in r. Bioinformatics, 33(8):1179–1186, 2017. 25. Luke Zappia, Belinda Phipson, and Alicia Oshlack. Splatter: simulation of single-cell rna sequencing data. Genome biology, 18(1):1–15, 2017. 26. Leslie C Morey and Alan Agresti. The measurement of classification agreement: An adjustment to the rand statistic for chance agreement. Educational and Psychological Measurement, 44(1):33–37, 1984. 27. Paul Jaccard. The distribution of the flora in the alpine zone. 1. New phytologist, 11(2):37–50, 1912. 28. Alexander Strehl and Joydeep Ghosh. Cluster ensembles— a knowledge reuse framework for combining multiple partitions. Journal of machine learning research, 3(Dec):583–617, 2002. 29. William M Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850, 1971. 30. Silke Wagner and Dorothea Wagner. Comparing clusterings: an overview. Universität Karlsruhe, Fakultät für Informatik Karlsruhe, 2007. 31. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. 32. Pei Wang, Ryan T Rodriguez, Jing Wang, Amar Ghodasara, and Seung K Kim. Targeting sox17 in human embryonic stem cells creates unique strategies for isolating and analyzing developing endoderm. Cell stem cell, 8(3):335–346, 2011. 33. Pei Wang, Kristen D McKnight, David J Wong, Ryan T Rodriguez, Takuya Sugiyama, Xueying Gu, Amar Ghodasara, Kun Qu, Howard Y Chang, and Seung K Kim. A molecular signature for purified definitive endoderm guides differentiation and isolation of endoderm from mouse and human embryonic stem cells. Stem cells and development, 21(12):2273–2287, 2012. 34. Judith A Blake, Janan T Eppig, James A Kadin, Joel E Richardson, Cynthia L Smith, Carol J Bult, and Mouse Genome Database Group. Mouse genome database (mgd)- 2017: community knowledge resource for the laboratory mouse. Nucleic acids research, 45(D1):D723–D729, 2017. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ Introduction Methods Cell clustering and dropout detection Probabilistic model for scRNA-seq expression matrix imputation Gibbs sampler to impute dropout events Generation of simulated data Quality Control for real datasets Evaluation metrics of clustering results Results Bfimpute improves both visualization and cell type identification Bfimpute improves DE and pseudotime analysis Bfimpute improves performance with the aid of additional experimental information Discussion and Conclusion Competing interests Author contributions Funding 10_1101-2021_02_10_430656 ---- A like-for-like comparison of lightweight-mapping pipelines for single-cell RNA-seq data pre-processing A like-for-like comparison of lightweight-mapping pipelines for single-cell RNA-seq data pre-processing Mohsen Zakeri1, Avi Srivastava2, Hirak Sarkar3, and Rob Patro1,� 1Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA 2New York Genome Center and NYU Center for Genomics and Systems Biology, New York City, NY, USA 3Harvard Medical School, Boston, Massachusetts, USA Abstract: Recently, Booeshaghi and Pachter (1) published a benchmark comparing the kallisto-bustools pipeline (2) for single-cell data pre-processing to the alevin-fry pipeline (3). Their benchmarking adopted drastically dissimilar configurations for these two tools, and overlooked the time- and space-frugal configurations of alevin-fry previously benchmarked by Sarkar et al. (3). In this manuscript, we provide a small set of modifications to the benchmarking scripts of Booeshaghi and Pachter that are necessary to perform a like-for-like comparison between kallisto-bustools and alevin-fry. We also address some misuses of the alevin-fry commands and include important data on the exact reference transcriptomes used for processing1. Using the same benchmarking scripts of Booeshaghi and Pachter (1), we demonstrate that, when configured to match the computational com- plexity of kallisto-bustools as closely as possible, alevin-fry processes data faster (∼2.08 times as fast on average) and uses less peak memory (∼0.34 times as much on average) compared to kallisto-bustools, while producing results that are similar when assessed in the manner done by Booeshaghi and Pachter (1). This is a notable inversion of the performance characteristics presented in the previous benchmark. RNA-seq, single-cell RNA-seq, quantification Correspondence: rob@cs.umd.edu Introduction Alevin-fry (3) is a new pipeline for single-cell RNA-seq pre-processing, which is currently being developed. While there are many relevant design decisions and performance implications we hope to discuss in detail in the preprint describing alevin-fry, one crucial aspect motivating the development of the alevin-fry pipeline is to allow testing the effect of different algorithmic choices on the gene expression estimates eventually produced by the pipeline. For example, alevin-fry exposes both a selective-alignment (4, 5) mode and pseudoalignment (6) with structural constraints mode for mapping reads. Further, after read mapping, the alevin-fry tool exposes multiple algorithms for generating a permit list (sometimes called a “whitelist”) of barcodes corresponding to what are believed to be high-confidence cells, and for resolving UMIs into counts. When applying any of the probabilistic methods it implements for UMI resolution, alevin-fry also allows assessing quantification uncertainty in the estimated counts via a bootstrapping procedure that can output either the bootstrap samples, or their summary statistics. Exploring these different algorithms in a unified framework is an important task to optimize the pre-processing of single-cell 1The authors later updated their repository to contain a link to a deposition with the reference data they used, but that information was not available in the original repository commit d0549e0351c4875428b153c7804f21eed7fa82eb at the time the preprint was published, and our framework was in place by the time this update was made. This is further described in "Methods." sequencing data, and there may not be a single algorithm that is best suited to all different single-cell technologies. For example, while the benefits of selective-alignment and the use of an expanded index in the processing of bulk RNA-seq data have been highlighted in a growing number of scenarios (4, 7, 8), these tradeoffs have not been thoroughly explored in the context of single-cell (and particularly tagged-end) data. Given that the majority of common tagged-end single-cell analyses are performed at the gene rather than transcript level, and in light of the extensive use of techniques like unique molecular identifier (UMI) tagging, it may be the case that different tradeoffs in mapping specificity versus speed are appropriate or desirable — indeed, an argument for simpler but faster methods in this space has been made by Melsted et al. (2) and in subsequent work by the same authors. Likewise, the effect of different approaches for UMI error correction and UMI resolution (and how they may interact with different read mapping strategies) has not been thoroughly evaluated across many different single-cell technologies, to understand if, and when, different approaches may lead to different results in downstream analysis. In the alevin-fry poster (3), we described the results of benchmarking STARsolo (9), kallisto-bustools (2) and alevin-fry (3), running the latter tool with a number of different configurations of read mapping algorithm and UMI resolution algorithm. We observed that the “fast” configurations of alevin-fry tested in (3), which adopt some of the major simplifications argued for by Melsted et al. (2), are faster than kallisto-bustools, and that all of the configurations tested there use less peak memory. The recent preprint of Booeshaghi and Pachter (1) omits all of the fast and memory-frugal configurations tested in Sarkar et al. (3), and instead compares the time and memory requirements of only the most computationally- and memory-intensive configuration of the alevin-fry pipeline to the kallisto-bustools pipeline. We are encouraged that others in the community are eager to try out new tools like alevin-fry for the pre-processing of single-cell data, and we recognize that fairly comparing new pipelines to existing ones can be a difficult task in the absence of sufficient documentation and tutorials. Admittedly, we have not yet produced sufficient tutorials or documentation for alevin-fry given that our efforts have been in continuing to develop the tool itself. At the same time, it is not possible to “faithfully” follow recommended practice (1) when the best practices have not yet been established for a fledgling method; in such a case, benchmarking multiple configurations (especially those that have already been tested in previous benchmarks (3)) may be a reasonable approach. Spurred by Booeshaghi and Pachter (1), we have now created a simple-to-follow tutorial for speed- optimized single-cell pre-processing using alevin-fry (https:// combine-lab.github.io/alevin-fry-tutorials/ 2021/running-alevin-fry-fast/). Here, we benchmark Zakeri et al. | bioRχiv | February 10, 2021 | 1–7 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430656doi: bioRxiv preprint https://combine-lab.github.io/alevin-fry-tutorials/2021/running-alevin-fry-fast/ https://combine-lab.github.io/alevin-fry-tutorials/2021/running-alevin-fry-fast/ https://combine-lab.github.io/alevin-fry-tutorials/2021/running-alevin-fry-fast/ https://doi.org/10.1101/2021.02.10.430656 http://creativecommons.org/licenses/by-nc-nd/4.0/ this workflow for alevin-fry (using the same versions of salmon (1.4.0) and alevin-fry (0.1.0)2, and kallisto(0.46.2)-bustools(0.40.0) adopted in (1)). Methods In order to assess how the results of the benchmark proposed by Booeshaghi and Pachter (1) change when a like-for-like com- parison between alevin-fry and kallisto-bustools is carried out, we start with the experimental framework introduced in that preprint and describe here the necessary modifications to the benchmarking scripts that were made. The first difference we note is the versions of the references used for quantification. The reposi- tory provided by Booeshaghi and Pachter (1) (https: //github.com/pachterlab/BP_2021/ at commit d0549e0351c4875428b153c7804f21eed7fa82eb, which was the version available when the preprint was published) lacked both the specific reference sequences used and the URLs from which these reference sequences were obtained3. The paper refers the reader to (2), wherein the relevant metadata is contained in an Excel file, which still lacks adequate specificity (e.g. it lists the Caenorhabditis elegans transcriptome used only as “modified ws260”). Thus, in this manuscript, we have adopted the following procedure for normalizing the reference transcriptomes. For human, mouse, and combined human/mouse data, we have used the latest reference bundles provided by 10x Genomics as of Jan 29, 2021 (named as “2020-A”), and extracted the transcriptomes from the provided genomes and respective GTF files using gffread (10). For all other organisms, we have adopted the latest Ensembl (11) reference transcriptomes for each organism. For Danio rerio, C. elegans, Drosophila melanogaster and Rattus norvegicus this is from release 102; for Arabidopsis thaliana it is from release 49. These updated reference transcriptomes lead, in some cases, to quite different memory usages from those reported in (1). For the alevin-fry pipeline, this is largely explained by the fact that we index the same reference sequences as used for kallisto (6) (that is, we do not compare indexing the transcriptome in kallisto to indexing the transcriptome and genome in salmon (12)). However, the increased memory usage of kallisto-bustools likely stems from variation in the specific reference transcriptomes used. For example, using the current 10x reference transcriptome for GRCh38 (2020-A, from https://cf.10xgenomics.com/supp/ cell-exp/refdata-gex-GRCh38-2020-A.tar.gz), the peak memory usage of kallisto becomes ∼7GB during mapping ( Fig. 1), rather than the ∼4GB reported in (1) (while the peak memory usage of salmon during mapping reaches ∼1.7GB). Our version of the repository contains a file called gather_refs.sh with the commands used to obtain these reference transcriptomes. Furthermore, the following additional modifications have been made to the benchmarking, which otherwise remains the same 2We do not use the modified version of alevin-fry 0.1.0 that Booeshaghi and Pachter (1) altered to convert encoded barcode identifiers in the generate- permit-list step into character strings, but instead the tagged 0.1.0 release with the rand crate additionally pinned at 0.7.3 to enable compilation. 3A subsequent commit included a link to a deposition of the references they used, but our framework was in place and benchmarking underway by the time that commit was made. Further, it is informative to see how even modest changes in the specific reference used can lead to large changes in the memory requirements of a tool. as was performed in (1). We run alevin with the --sketch flag when producing the mapping file (called a RAD file); this uses pseudoalignment (6) with structural constraints rather than selective-alignment4. We do not consider a configuration of either kallisto-bustools or alevin-fry that corrects to or uses the full 10x permit list. In the original benchmark, the authors used alevin-fry with the -b flag, treating the full list of 10x barcodes as a filtered permit list; however the -b flag is meant to accept a list of barcodes corresponding to high-confidence cells that have passed external filtering. Passing the full 10x barcode list to the -b flag is neither intended nor currently supported in alevin-fry (though we are planning to add this functionality), which we have now clarified in the documentation — as we had previously clarified this same point in the alevin (13) documentation. We have both methods generate their own permit list, and perform quantification on their corresponding filtered cells. We pass the -d fw flag to alevin-fry’s generate-permit-list step rather than -d either, as the RAD file records the orientation of each read with respect to the target transcript, and all the technologies evaluated here expect the second read to map to the transcriptome in the forward orientation. Mappings in an unexpected orientation should be filtered. We have used the cr-like resolution strategy when invoking the alevin-fry quant command; this implements a simple but fast UMI resolution algorithm that breaks ties by UMI frequency alone and discards reads for which a most frequent unique gene cannot be determined. We have also removed the step of the pipeline that converts the RAD (respectively BUS) file into a text format. The binary to text conversion may be useful for debugging purposes, but is not a standard or necessary part of these pre-processing pipelines, as the BUS and RAD files are primarily intended for the storage and processing of data rather than human inspection. Further, contrary to the supposition of Booeshaghi and Pachter (1), this conversion is likely a case where language choice, and usage of standard language idioms, leads to different performance characteristics. Unlike C++, Rust places the standard output stream behind a lock to ensure threadsafe access, a decision that imposes a cost for programs that are heavy on writing to the standard output stream in a line-oriented manner when standard idioms are used. While we do not view the optimization of the command that dumps a RAD file to text as particularly high-priority, we will nonetheless explore making use of unsafe C system calls in this command until a comparable solution is exposed natively in Rust. The benchmarking scripts used to produce the results described here can be found at https://github.com/COMBINE-lab/ BP_2021-lfl (these are the same as the benchmarking scripts of https://github.com/pachterlab/BP_2021 at commit d0549e0351c4875428b153c7804f21eed7fa82eb with the modifications described above). We encourage users to run these benchmarks for themselves, and welcome feedback and suggestions. Despite the additions and modifications we describe here, neither our repository nor the original repository of Booeshaghi and Pachter (1) enable full reproducibility without non-trivial effort or investigation. One complication is that there existed multiple candidate scripts for performing specific steps of the data analysis within different directories of the repository, and none had complete 4This sketch mode was evaluated in detail in the poster of Sarkar et al. (3), where its scalability was assessed and its mappings were paired with a number of different UMI resolution strategies. 2 | bioRχiv Zakeri et al. | lightweight single-cell RNA-seq pre-processing .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430656doi: bioRxiv preprint https://github.com/pachterlab/BP_2021/ https://github.com/pachterlab/BP_2021/ https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz https://github.com/COMBINE-lab/BP_2021-lfl https://github.com/COMBINE-lab/BP_2021-lfl https://github.com/pachterlab/BP_2021 https://doi.org/10.1101/2021.02.10.430656 http://creativecommons.org/licenses/by-nc-nd/4.0/ $ kallisto index $ salmon index Purpose: Build mapping index command 0 5 10 15 Ti m e [m in ] time kallisto alevin 0 5 10 M em or y [G B] memory kallisto alevin $ kallisto bus $ salmon alevin --rad --sketch Purpose: Perform mapping 0 20 Ti m e [m in ] kallistoalevin 0 2 4 6 M em or y [G B] kallisto alevin BUS file RAD file Purpose: Store result of mapping 0 5 10 Si ze [G B] kallisto alevin $ bustools sort + bustools whitelist $ alevin-fry generate-permit-list --knee-distance Purpose: Error correct of barcodes 0 2 4 Ti m e [m in ] kallistoalevin 0 2 4 M em or y [G B] kallisto alevin $ bustools correct + bustools sort $ alevin-fry collate Purpose: Enable streaming 0.0 2.5 5.0 7.5 Ti m e [m in ] kallistoalevin 0 2 4 M em or y [G B] kallisto alevin $ bustools count $ alevin-fry quant Purpose: Generate count matrix 0.0 0.5 1.0 Ti m e [m in ] kallistoalevin 0 1 2 M em or y [G B] kallisto alevin $ kallisto + bustools pipeline $ salmon + alevin pipeline Purpose: Process single cell data 107 108 Number of reads 0 20 40 Ti m e [m in ] kallistoalevin 107 108 Number of reads 0 2 4 6 M em or y [G B] kallisto alevin mouse-SRR8599150_v2 worm-SRR8611943_v2 mouse-SRR6998058_v2 human_mouse-hgmm1k_v3 human-pbmc1k_v3 human_mouse-hgmm1k_v2 mouse-heart1k_v3 mouse-SRR8206317_v2 mouse-heart1k_v2 human-SRR8524760_v2 rat-SRR7299563_v2 fly-SRR8513910_v2 zebrafish-SRR6956073_v2 arabidopsis-SRR8257100_v2 human-SRR8327928_v2 mouse-EMTAB7320_v2 mouse-neuron10k_v3 mouse-SRR8639063_v2 human-pbmc10k_v3 human_mouse-hgmm10k_v3 mouse-SRR8599150_v2 worm-SRR8611943_v2 mouse-SRR6998058_v2 human_mouse-hgmm1k_v3 human-pbmc1k_v3 human_mouse-hgmm1k_v2 mouse-heart1k_v3 mouse-SRR8206317_v2 mouse-heart1k_v2 human-SRR8524760_v2 rat-SRR7299563_v2 fly-SRR8513910_v2 zebrafish-SRR6956073_v2 arabidopsis-SRR8257100_v2 human-SRR8327928_v2 mouse-EMTAB7320_v2 mouse-neuron10k_v3 mouse-SRR8639063_v2 human-pbmc10k_v3 human_mouse-hgmm10k_v3 Fig. 1. The time and memory used by the relevant steps of the alevin-fry and kallisto-bustools pipelines for pre-processing the 20 diverse tagged-end single-cell RNA-seq datasets used in (1). The plots are generated using the analysis/notebooks/memtime.ipynb notebook. or adequate instructions for generating the plots. For example, there exist multiple versions of the run_gsea_bar_full.R script for performing gene set enrichment analysis, which each required building certain sub-directories in the main di- rectory of the repository in order to be executed without any errors. Eventually, we used the run_gsea_bar_full.R script located within analysis/notebooks rather than the one located in analysis/scripts/code, since the latter version had hard-coded paths and no central way to uniformly and globally change the working directory (e.g. https://github.com/pachterlab/BP_2021/blob/ e87e98713bf7967d2fa22716dbbebd10609c1dd9/ analysis/scripts/code/gsea_bar_full.R#L39). After providing the required data, we ran mkdata.py and mk- plot.py within the analysis/scripts/code directory to prepare the plots for comparing the gene count estimates provided by both tools. Furthermore, since we benchmarked an unmodified version of alevin-fry, we had to modify the mkdata.py script to load a single column file as alevin’s permit list (which we took from the quants_mat_rows.txt file accompanying each cell by gene count matrix), and also to remove the lines which were intended for dealing with decoy aware results. For producing the Zakeri et al. | lightweight single-cell RNA-seq pre-processing bioRχiv | 3 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430656doi: bioRxiv preprint https://github.com/pachterlab/BP_2021/blob/e87e98713bf7967d2fa22716dbbebd10609c1dd9/analysis/scripts/code/gsea_bar_full.R#L39 https://github.com/pachterlab/BP_2021/blob/e87e98713bf7967d2fa22716dbbebd10609c1dd9/analysis/scripts/code/gsea_bar_full.R#L39 https://github.com/pachterlab/BP_2021/blob/e87e98713bf7967d2fa22716dbbebd10609c1dd9/analysis/scripts/code/gsea_bar_full.R#L39 https://doi.org/10.1101/2021.02.10.430656 http://creativecommons.org/licenses/by-nc-nd/4.0/ Fig. 2. A comparison of the resulting count matrices obtained from alevin-fry and kallisto-bustools, as run in this manuscript, for the pbmc_10k_v3 dataset. Panels A-H have the same inter- pretation as in Fig. 2 of Booeshaghi and Pachter (1), and compare the count matrices at the gene and cell levels. The plots are generated using the analysis/scripts/mkplots.py, analysis/scripts/mkdata.py and analysis/notebooks/run_gsea_bar_full.R scripts. 4 | bioRχiv Zakeri et al. | lightweight single-cell RNA-seq pre-processing .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430656doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430656 http://creativecommons.org/licenses/by-nc-nd/4.0/ time and memory plots, we used the memtime.ipynb notebook located in analysis/notebooks after making the required modifications to compare the time and memory of the relevant steps in both tools. While we have addressed any issues as we encountered them, and have documented how we have run this pipeline, we have not undertaken the effort of fully removing all barriers to “trivial” reproducibility, as it is outside the scope of the current work. Finally, we also note that the benchmark of Booeshaghi and Pachter (1) focuses only on comparing kallisto-bustools to a single configuration of alevin-fry, excluding other relevant tools like STARsolo (9), which is a fast, flexible, and popular tool for the pre-processing of tagged-end single-cell data. The benchmark also omits another recently-published, lightweight-mapping based tool, Raindrop (14), from the benchmark (though, seemingly, this would currently have to be restricted to 10x chromium v2 data). A more extensive benchmark, including other tools, is likely to provide greater value to the broader community. However, the primary focus in this manuscript is to highlight the effect on the original benchmark that results from running the tools considered therein in a like-for-like configuration. Thus, we have not added STARsolo or Raindrop to the current benchmark, though it may provide a useful perspective on these tools to the broader community. All experiments were performed on a server with dual Intel Xeon CPUs (E5-2699 v4), each with 22 cores clocked at 2.20 GHz, 512 GB of 2.4GHz DDR4 memory, and an array of 8 3.6TB Toshiba MG03ACA4 HDDs configured as independent disks. Results Fig. 1 shows the overall time and peak memory taken by both alevin-fry and kallisto-bustools when pre-processing the 20 diverse tagged-end single-cell 10x chromium datasets evaluated in (1). Alevin- fry is faster than kallisto-bustools on all datasets (between ∼1.28 and ∼ 2.87 times as fast, and ∼2.08 times as fast on average). Also, alevin-fry uses less peak memory than kallisto-bustools on 19 of the 20 datasets tested, with the peak memory of kallisto-bustools ranging from ∼91% of that used by alevin-fry to ∼8 times that used by alevin-fry (kallisto- bustools used ∼2.92 times as much peak memory on average). In addition to the overall runtime and peak memory usage (bottom row of Fig. 1), the figure also shows the time and memory required for the main steps of the pipelines. While there is not a perfect correspondence between the specific set of commands used by the two tools, the fundamental steps include mapping the reads, generating a permit-list of valid filtered barcodes (with each method using its own algorithm to infer the filtered set of corrected barcodes), rearranging the mapping information for all records having the same corrected barcode so that they are adjacent in the resulting file, and applying a UMI resolution algorithm to obtain a gene-by-cell count matrix. Looking across the datasets, some general characteristics emerge. If one evaluates the ratio of the total runtime of kallisto-bustools to the total runtime of alevin-fry, one observes that alevin-fry is faster in the processing of every dataset, with a speedup (i.e. runtime of kallisto- bustools/ runtime of alevin-fry ratio) ranging from ∼1.28 up to ∼2.87 (with an average runtime speedup of ∼2.08). If one evaluates the same ratio in terms of peak memory usage instead of total runtime, a similar trend emerges. In 19 of the 20 datasets tested here, the kallisto-bustools pipeline exhibits a higher peak memory usage than alevin-fry. In the mouse-SRR8639063_v2 dataset, kallisto-bustools’ peak memory usage reached 91% of that of alevin-fry (with alevin-fry requiring a maximum of ∼4.6GB5 of memory and kallisto-bustools requiring ∼ 4.2GB of memory). In every other dataset, kallisto-bustools used more peak memory than alevin-fry, with the kallisto-bustools pipeline using at most ∼8 times as much peak memory and, on average, ∼2.92 times as much peak memory as alevin-fry. The peak memory usage of both tools reached their respective maxima on the hybrid human-mouse dataset, where the peak memory usage of kallisto-bustools is ∼7.5GB (which occurs during pseudoalignment) and the peak memory usage of alevin-fry is ∼5GB (which occurs during mapping record collation). While the step of indexing only has to be done once per reference sequence (i.e. with each new organism, or when the reference anno- tation is updated), we also evaluate the time and memory required to build all indices used in these experiments. This is important, since the peak memory usage during indexing may dictate whether the index can be built on the same machine used for subsequent quantification, or if it must be constructed on a machine with more memory. Fig. 1 shows that, as with the pre-processing of reads, alevin-fry is faster and uses less memory for index construction for each reference con- sidered. The slowest index construction for both tools was for the human-mouse combined transcriptome, where the kallisto-bustools pipeline took ∼ 15.6 minutes and required ∼ 10.2GB of memory, while indexing this transcriptome with the alevin-fry pipeline took ∼3 minutes (a ∼ 5.2 times speedup compared to kallisto) and ∼ 1.3GB of memory (∼13% of the the memory usage of kallisto). When eval- uating the time differences, it is important to note that the alevin-fry pipeline can make use of multiple threads when indexing (here we used 10 as in (1)), while the indexing in the kallisto-bustools pipeline is currently restricted to a single thread. The memory usage in the alevin- fry pipeline does not vary considerably with the number of threads used during indexing. The peak memory reduction during indexing and mapping in alevin-fry arise primarily due to alevin-fry’s use of the pufferfish (15) index, while a number of different factors at both the im- plementation and design level contribute to the runtime improvements. When assessing the same summary statistics and count com- parisons considered by Booeshaghi and Pachter (1) to evaluate the similarity of the resulting quantifications, we find that the cell by gene count matrices produced by both tools are similar under these metrics (Fig. 2). As is expected, these evaluations show that the data sum- maries are more similar than in the configuration tested in (1). In that comparison, Booeshaghi and Pachter (1) claim that differences in re- sulting gene expressions between the configurations of the tools tested therein are “irrelevant for downstream analysis” (presumably implying all possible downstream analyses). It is not clear how these compar- isons justify such a sweeping claim. Yet, while these comparisons do not necessarily imply that no differences will manifest in downstream processing of the alevin-fry quantified data compared to the kallisto-bustools quantified data, they do suggest that the differences that may arise under this configuration of alevin-fry are likely to be less extreme than differences that may arise in the configuration tested in (1). We also note that, while Booeshaghi and Pachter (1) observe no significant gene sets found when comparing the quantifications of kallisto-bustools and the configuration of alevin-fry that they tested on the pbmc_10k_v3 data, we do observe some genes as detected. 5The alevin-fry peak memory usage in this dataset happens during the collate step, which can easily be made to operate within a strict desired RAM budget; a feature on which we are currently working. Zakeri et al. | lightweight single-cell RNA-seq pre-processing bioRχiv | 5 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430656doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430656 http://creativecommons.org/licenses/by-nc-nd/4.0/ This appears to stem from our use of the most recent release of Seu- rat (16) (currently version 4.0.0), which modified the default behavior of the FindMarkers() function to “prefilter genes and report fold change using base 2, as is commonly done in other differential ex- pression packages, instead of natural log” (https://satijalab. org/seurat/articles/v4_changes.html). For the purposes of keeping this benchmark like-for-like, we have run both tools in configurations where they must generate their own permit list without the input of an external set of valid but unfiltered (i.e. possible) barcodes. We choose this configuration for two reasons. First, unlike 10x chromium v2 and v3 experiments, many single-cell technologies supported by both pipelines do not provide an external list of barcodes, and so, this is indicative of the general case where pipelines must provide their own method for generating a permit list of barcodes. Second, since alevin-fry does not yet support unfiltered external permit lists, there is not a way to fairly compare it against a method that can take advantage of this information. We consider it a development priority to support this feature in alevin-fry for single-cell technologies where this information is available. Nonetheless, we tested the effect that requiring a second sort of a (sorted and filtered permit list corrected) BUS file had on the overall runtime, compared to correcting the initial unsorted file with an unfiltered permit list, and processing the unfiltered file in the remainder of the pipeline. To do this, we also ran a configuration of kallisto-bustools where the raw BUS file was first corrected with the external, unfiltered 10x permit list, then the file was sorted, then a permit list was extracted from this sorted file (to allow subsequent filtering of an unfiltered count matrix), and finally, the count step was performed. This process results in an unfiltered matrix which may then be filtered using the generated permit list. Alevin-fry was, on average, ∼ 2.15 times as fast as kallisto-bustools under this configuration, rather than ∼2.08 as fast; in other words, the runtime costs of the different kallisto-bustools configurations were very similar for these data. Finally, though we have retained the parallelism settings used in the original benchmark for the purposes of the main results reported in this manuscript, we also evaluated, on one of the larger datasets (pbmc_10k_v3), how both tools scale to a higher thread count of 16. In this case, we found that the total runtime for alevin-fry dropped from 19.7 minutes with 10 threads to 15.6 minutes with 16 threads, and the total runtime for kallisto-bustools went from 38.8 minutes with 10 threads to 37.8 minutes with 16 threads. So, in this case, increasing the thread count by 6 lead to a ∼1.26 times increase in the speed of alevin-fry and a ∼1.03 times increase in the speed of kallisto-bustools. Conclusions We find that when alevin-fry is benchmarked in a like-for-like comparison with kallisto-bustools, it is both faster and uses less memory while producing similar results. Of course, in this configuration, alevin-fry (3), unlike the original alevin (13) or other configurations of alevin-fry, is adopting some of the computational simplifications for which Melsted et al. (2) argue, and the similarity of these results is fully expected. In their manuscript, Booeshaghi and Pachter (1) repeatedly refer to alevin-fry as a “reimplementation” of bustools. This characteriza- tion is untrue both in detail and in spirit. The alevin-fry tool has not been designed to reimplement the bustools commands or interface, or specifically to match the implementation of bustools. It has been de- signed as a way to allow the exploration and configuration, in a unified framework, of a variety of different algorithms for single-cell data pre- processing, many of which don’t currently exist in kallisto-bustools. For example, it implements multiple different methods for generating permit lists, and multiple different algorithms for UMI resolution, including some that correct for UMI sequencing errors, resolve multi- gene UMIs by parsimony, probabilistically (which kallisto-bustools subsequently implemented after it was introduced in alevin (13)), or both, as well as the functionality to quantify the uncertainty of proba- bilistic resolution through bootstrapping. Of course it is the case that, in designing such a tool after the work of Melsted et al. (2) was published and in widespread use, one should learn from the design decisions of that work that proved to be effective and useful. The main such design decisions in this case are first, the separation of the read mapping from the subsequent processing of barcodes and UMIs via intermediate files (as is also done internally by STARsolo), and second, the arrangement of mapping records relevant to a given corrected barcode subsequently so that cells can be processed in an effectively independent manner. The alevin-fry tool adopts these choices described by Melsted et al. (2), as we see no reason to avoid relevant design decisions demonstrated by prior tools, that seem to work, when building new tools. We look forward to discussing these design decisions, as well as some novel design choices we have made, when we publish the alevin-fry preprint. We have not completed, to our satisfaction, a thorough investigation of the effect of different mapping approaches, permit list generating methods and UMI correction and resolution strategies provided by alevin-fry across a wide range of tagged-end single-cell RNA-seq data and technologies (which have, in general, distinct characteristics compared to both bulk RNA-seq data and full-length single-cell RNA-seq data). Once we have adequately explored this algorithmic parameter space, we plan to publish a full manuscript describing the design and implementation of the alevin-fry pipeline, highlighting where it derives design decisions from kallisto-bustools and other tools, and where it differs, as well as the effect that different configurations have on runtime and memory performance, the raw count matrices and common downstream analyses, and how those effects may vary in different single-cell technologies. We have described in this manuscript, and demonstrated in the associated code repository and tutorial, how alevin-fry can optionally be configured so as to match the computational complexity of kallisto- bustools as closely as possible. In this like-for-like comparison of these two pipelines, we have shown that, while the estimated gene expressions are similar — at least when assessed in the manner done by Booeshaghi and Pachter (1) — the runtime and memory character- istics are not. Rather, while using the same benchmarking framework as Booeshaghi and Pachter (1), instead of alevin-fry taking ∼3 times as long to pre-process data (on average) than kallisto-bustools and using many times as much memory in the worst case (1), we find that alevin-fry is both faster and uses less memory than kallisto-bustools. Specifically, alevin-fry is on average ∼2.08 times as fast as kallisto- bustools and consumes, on average, only ∼0.34 as much peak memory. According to the formulae used in the jupyter (17) notebooks of Booe- shaghi and Pachter (1) to estimate costs for performing processing on Amazon Web Services compute instances, pre-processing the pbmc_10k_v3 dataset using the configuration of the alvein-fry pipeline we have tested in this manuscript costs $0.05, which is half of the cost of running the kallisto-bustools pipeline ($0.1). Further- 6 | bioRχiv Zakeri et al. | lightweight single-cell RNA-seq pre-processing .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430656doi: bioRxiv preprint https://satijalab.org/seurat/articles/v4_changes.html https://satijalab.org/seurat/articles/v4_changes.html https://doi.org/10.1101/2021.02.10.430656 http://creativecommons.org/licenses/by-nc-nd/4.0/ more, if one needs to first build the reference index, the peak runtime memory for kallisto-bustools exceeds 8 GB and so a more expensive instance would be necessary. In that case, building the reference index and processing the pbmc_10k_v3 would cost $0.23 using kallisto-bustools (while the cost would remain at $0.05 using alevin- fry, even if index construction is included). The cost of the alevin-fry pipeline we have benchmarked in this manuscript is 39 times smaller than what was reported in (1) for this dataset, while the cost of the kallisto-bustools pipeline is twice as large due to the increased memory requirements when using the newer human transcriptome annotation. If one is comfortable with the simplifying assumptions being made, the performance profiles observed in this manuscript provide a com- pelling case for the use of this configuration of alevin-fry for the rapid and lightweight pre-processing of single-cell RNA-seq data. Finally, it is important to note that alevin-fry is still undergoing active development and improvement, which is, in part, why no full preprint has yet been published describing the tool and underlying methods and implementation in detail. Of course, one can use the tool today to obtain gene expression counts for single-cell data, but we expect that alevin-fry will continue to advance and expand to offer more capabilities and to be further optimized. Disclosure.RP is a co-founder of Ocean Genomics Inc. Funding. This work is supported by the US National Institutes of Health [R01HG009937], and the National Science Foundation [CCF-1750472, CNS-1763680]. The funders had no role in this research, or the decision to publish. References 1. A. Sina Booeshaghi and Lior Pachter. Benchmarking of lightweight- mapping based single-cell RNA-seq pre-processing. bioRxiv, 2021. doi: 10.1101/2021.01.25.428188. 2. Páll Melsted, A. Sina Booeshaghi, Fan Gao, Eduardo Beltrame, Lambda Lu, Kristján Eldjárn Hjorleifsson, Jase Gehring, and Lior Pachter. Modular and efficient pre-processing of single-cell RNA-seq. bioRxiv, 2019. doi: 10.1101/673285. 3. Hirak Sarkar, Avi Srivastava, Mohsen Zakeri, Scott Van Buren, Naim U Rashid, Michael Love, and Rob Patro. Accurate, efficient, and uncertainty-aware expression quantification of single-cell RNA-seq data. 11 2020. doi: 10.6084/m9.figshare.13198100.v1. 4. Avi Srivastava, Laraib Malik, Hirak Sarkar, Mohsen Zakeri, Fatemeh Almodaresi, Charlotte Soneson, Michael I Love, Carl Kingsford, and Rob Patro. Alignment and mapping methodology influence transcript abundance estimation. Genome Biology, 21(1):1–29, 2020. 5. Hirak Sarkar, Mohsen Zakeri, Laraib Malik, and Rob Patro. Towards selective-alignment: Bridging the accuracy gap between alignment-based and alignment-free transcript quantification. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 27–36, 2018. 6. Nicolas L Bray, Harold Pimentel, Páll Melsted, and Lior Pachter. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology, 34(5):525–527, 2016. 7. Matthew D. Shirley, Viveksagar K. Radhakrishna, Javad Golji, and Joshua M. Korn. PISCES: a package for rapid quantitation and quality control of large scale mRNA-seq datasets. bioRxiv, 2020. doi: 10.1101/2020.12.01.390575. 8. Avi Srivastava, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Carl Kingsford, and Rob Patro. Accounting for fragments of unexpected origin improves transcript quantification in RNA-seq simulations focused on increased realism. bioRxiv, 2021. doi: 10.1101/2021.01.17.426996. 9. Ash Blibaum, Jonathan Werner, and Alexander Dobin. STARsolo: single-cell RNA-seq analyses beyond gene expression. 2019. doi: 10.7490/F1000RESEARCH.1117634.1. 10. Geo Pertea and Mihaela Pertea. GFF utilities: GffRead and GffCompare. F1000Research, 9:304, September 2020. doi: 10.12688/f1000research.23297.2. 11. Andrew D Yates, Premanand Achuthan, Wasiu Akanni, James Allen, Jamie Allen, Jorge Alvarez-Jarreta, M Ridwan Amode, Irina M Armean, Andrey G Azov, Ruth Bennett, et al. Ensembl 2020. Nucleic acids research, 48(D1): D682–D688, 2020. 12. Rob Patro, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods, 14(4):417–419, 2017. 13. Avi Srivastava, Laraib Malik, Tom Smith, Ian Sudbery, and Rob Patro. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biology, 20(1):1–16, 2019. 14. Stefan Niebler, André Müller, Thomas Hankeln, and Bertil Schmidt. Raindrop: Rapid activation matrix computation for droplet-based single-cell rna-seq reads. BMC Bioinformatics, 21(1):1–14, 2020. 15. Fatemeh Almodaresi, Hirak Sarkar, Avi Srivastava, and Rob Patro. A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics, 34(13):i169–i177, 2018. 16. Yuhan Hao, Stephanie Hao, Erica Andersen-Nissen, William M. Mauck, Shiwei Zheng, Andrew Butler, Maddie J. Lee, Aaron J. Wilk, Charlotte Darby, Michael Zagar, Paul Hoffman, Marlon Stoeckius, Efthymia Papalexi, Eleni P. Mimitou, Jaison Jain, Avi Srivastava, Tim Stuart, Lamar B. Fleming, Bertrand Yeung, Angela J. Rogers, Juliana M. McElrath, Catherine A. Blish, Raphael Gottardo, Peter Smibert, and Rahul Satija. Integrated analysis of multimodal single-cell data. bioRxiv, 2020. doi: 10.1101/2020.10.12.335331. 17. Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul Ivanov, Damián Avila, Safia Abdalla, Carol Willing, and Jupyter development team. Jupyter notebooks - a publishing format for reproducible computational workflows. In Fernando Loizides and Birgit Scmidt, editors, Positioning and Power in Academic Publishing: Players, Agents and Agendas, pages 87–90, Netherlands, 2016. IOS Press. Zakeri et al. | lightweight single-cell RNA-seq pre-processing bioRχiv | 7 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430656doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430656 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2021_02_10_430705 ---- VIA: Generalized and scalable trajectory inference in single-cell omics data / 1 VIA: Generalized and scalable trajectory inference in single-cell omics data 2 Shobana V. Stassen 1 , Gwinky G. K. Yip 1 , Kenneth K. Y. Wong 1,3 , Joshua W. K. Ho 2,4 and Kevin K. Tsia 1,3 3 1 Department of Electrical & Electronic Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong 4 2 School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong 5 3 Advanced Biomedical Instrumentation Centre, Hong Kong Science Park, Shatin, New Territories, Hong Kong 6 4 Laboratory of Data Discovery for Health, Hong Kong Science Park, Shatin, New Territories, Hong Kong 7 Abstract 8 Inferring cellular trajectories using a variety of omic data is a critical task in single-cell data science. 9 However, prediction and thus biologically meaningful discovery of cell fates are challenged by the sheer 10 size of single-cell data, diverse omic data types, and their complex data topologies. We present VIA, a 11 scalable trajectory inference algorithm that uses lazy-teleporting random walks to accurately reconstruct 12 complex cellular trajectories beyond tree-like pathways (e.g. cyclic or disconnected structures), and to 13 discover less populous lineages or those otherwise obscured in other methods. VIA outperforms existing 14 algorithms in recapitulating cell fates/lineages, and also mitigates loss of global connectivity information 15 in large datasets beyond a million cells. Furthermore, VIA demonstrates versatility by distilling cellular 16 trajectories in single-cell transcriptomic, epigenomic, proteomic and morphological data – showing new 17 promise in scalable, multifaceted single-cell analysis to explore novel biological processes. 18 Introduction 19 Single-cell omics data captures snapshots of cells that catalog cell types and molecular states with high 20 precision. These high-content single-cell readouts can be harnessed to model evolving cellular 21 heterogeneity and track dynamical changes of cell fates in tissue, tumour, and cell population. However, 22 current computational methods face four critical challenges. First, it remains difficult to accurately 23 reconstruct high-resolution cell trajectories and detect cell fates embedded within them. Even the few 24 algorithms which automate cell fate detection (e.g., SlingShot 1 and Palantir 2 ) exhibit low sensitivity and 25 are highly susceptible to changes in input parameters. Second, current trajectory inference (TI) methods 26 predominantly work well on tree-like trajectories (e.g. Slingshot, Monocle2 3 ), but lack the generalisability 27 to infer disconnected, cyclic or hybrid topologies without imposing restrictions on transitions and 28 causality 4 . Third, the growing scale of single-cell data, notably cell atlases of whole organisms 6 ,7 , 29 embryos 8 ,9 and human organs 10 , exceeds the existing TI capacity, not just in runtime and memory, but in 30 preserving global connectivity, which is often lost after extensive dimension reduction or subsampling. 31 Fourth, fueling the advance in single-cell technologies is the ongoing pursuit to understand cellular 32 heterogeneity from a broader perspective beyond transcriptomics. However, the applicability of TI to a 33 broader spectrum of single-cell data has yet to be fully exploited. 34 To overcome these recurring challenges, we present VIA, a graph-based TI algorithm that uses a new 35 strategy to compute pseudotime, and reconstruct cell lineages based on lazy-teleporting random walks 36 integrated with Markov chain Monte Carlo (MCMC) refinement. VIA relaxes common constraints on 37 traversing the graph by allowing cyclic and temporally reversed movements, and thus robustly detects cell 38 fates involving complex transitions that are otherwise obscured in other methods. VIA outperforms 39 popular TI algorithms in terms of capturing cellular trajectories not limited to multi-furcations and trees, 40 but also disconnected and cyclic topologies ( Supplementary Fig. S1) . Compared to existing TI methods, .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=s4aw0ujvuwn8 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=14ruyzz6p81 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=i7cha45y61s6 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=pwzzqt446v1 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=2glf5ij7qkb6 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=mtxpwbv8o1fn https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=gsc2l2p4cfhh https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 41 VIA is highly scalable with respect to number of cells (10 2 to >10 6 cells) and features, without requiring 42 extensive dimensionality reduction or subsampling which compromise global information. We 43 demonstrate VIA’s accuracy, scalability, topological-generalizability and multi-omic versatility across 44 multiple modalities by investigating 10 simulated and experimental datasets ( Supplementary Table S1 ), 45 ranging from single-cell RNA-sequencing (scRNA-seq), single-cell sequencing assay for 46 transposase-accessible chromatin (scATAC-seq), multi-omics integration, to mass and imaging cytometry. 47 Figure 1. General workflow of VIA algorithm. Step 1: Single-cell level graph is clustered such that each node 48 represents a cluster of single cells (computed by our clustering algorithm PARC 11 ). The resulting cluster graph forms 49 the basis for subsequent random walks. Step 2: 2-stage pseudotime computation: (i) The pseudotime (relative to a 50 user defined start cell) is first computed by the expected hitting time for a lazy-teleporting random walk along an 51 undirected graph. At each step, the walk (with small probability) can remain (orange arrows) or teleport (red arrows) 52 to any other state. (ii) Edges are then forward biased based on the expected hitting time (See forward biased edges 53 illustrated as the imbalance of double-arrowhead size). The pseudotime is further refined on the directed graph by 54 running Markov chain Monte Carlo (MCMC) simulations (See 3 highlighted paths starting at root). Step 3: Consensus 55 vote on terminal states based on vertex connectivity properties of the directed graph. Step 4 : lineage likelihoods 56 computed as the visitation frequency under lazy-teleporting MCMC simulations. Step 5 : visualization that combines 57 network topology and single-cell level pseudotime/lineage probability properties onto an embedding using GAMs, as 58 well as unsupervised downstream analysis (e.g. gene expression trend along pseudotime for each lineage). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 59 Results 60 Algorithm 61 VIA first represents the single-cell data as a cluster graph (i.e. each node is a cluster of single cells), 62 computed by our recently developed data-driven community-detection algorithm, PARC, which allows 63 scalable clustering whilst preserving global properties of the topology needed for accurate TI 11 ( Step 1 in 64 Fig. 1) . The cell fates and their lineage pathways are then computed by a two-stage probabilistic method, 65 which is the key algorithmic contribution of this work ( Step 2 in Fig. 1 , see Methods ). In the first stage, 66 VIA models the cellular process as a modified random walk that allows degrees of laziness (remaining at 67 a node/state) and teleportation (jumping to any other node/state) with pre-defined probabilities. The 68 pseudotime, and thus the graph directionality, can be computed based on the theoretical hitting times of 69 nodes (See the theory and derivation in Methods and Supplementary Note 2 ). The lazy-teleporting 70 behavior prevents the expected hitting time from converging to a local distribution in the graph as 71 otherwise occurs in regular random walks, especially when the sample size grows 12 . More specifically, the 72 laziness and teleportation factors regulate the weights given to each eigenvector-value pair in the expected 73 hitting time formulation such that the stationary distribution (given by the local-node degree-properties in 74 regular walks) does not overwhelm the global information provided by other ‘eigen-pairs’. Moreover, the 75 computation does not require subsetting the first k eigenvectors (bypassing the need for the user to select 76 a suitable threshold or subset of eigenvectors) since the dimensionality is not on the order of number of 77 cells, but equal to the number of clusters. Hence all eigenvalue-eigenvector pairs can be incorporated 78 without causing a bottleneck in runtime. Consequently in VIA, the modified walk on a cluster-graph not 79 only enables scalable pseudotime computation for large datasets in terms of runtime, but also preserves 80 information about the global neighborhood relationships within the graph. In the second stage of Step 2, 81 VIA infers the directionality of the graph by biasing the edge-weights with the initial pseudotime 82 computations, and refines the pseudotime through MCMC simulations. Next (Step 3 in Fig . 1), the 83 MCMC-refined graph-edges of the lazy-teleporting random walk enable accurate predictions of terminal 84 cell fates through a consensus vote of various vertex connectivity properties derived from the directed 85 graph. The cell fate predictions obtained using this approach are more robust to changes in input data and 86 parameters compared to other TI methods ( Supplementary Fig. S1 and Fig. S16) . Trajectories towards 87 identified terminal states are resolved using lazy-teleporting MCMC simulations ( Step 4 in Fig. 1 ). The 88 probabilistic approach and relaxation of edge constraints allowed by VIA in computing differentiation 89 pathways and pseudotime enables greater sensitivity to cell fates and complex trajectories, and makes 90 allowances for asynchrony in differentiation processes by avoiding prematurely imposing constraints on 91 node-to-node mobility. Other methods resort to constraints such as reducing the graph to a tree, imposing 92 unidirectionality by thresholding edges based on pseudotime directionality, removing outgoing edges 93 from terminal states 13 , 2 and computing shortest paths for pseudotime 2 ,1 . VIA’s probabilistic approach to 94 graph-traversal allows it to infer cell fates when the underlying data spans combinations of multifurcating 95 trees and cyclic/disconnected topologies - topologies and hence lineages often obscured in existing TI 96 methods ( Supplementary Fig. S1 ). Together, these four steps facilitate holistic topological visualization 97 of TI on the single-cell level (e.g. using UMAP or PHATE embeddings 14 ,15 ) and other data-driven 98 downstream analyses such as recovering gene expression trends ( Methods ). ( Step 5 in Fig. 1 ). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=e9uxoipvwota https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=onqq15xmlfrf https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=jbvdrwuod0wd https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=14ruyzz6p81 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=14ruyzz6p81 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=igxr9liha0sr https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 99 VIA accurately infers trajectories in diverse scRNA-seq data 100 VIA recapitulates differentiation topologies and identifies elusive cell fates across a wide range of 101 transcriptomic data. We first showcase the ability of VIA to explore large single-cell transcriptomic 102 datasets by employing the 1.3-million-cell mouse organogenesis cell atlas (MOCA) 8 . While this dataset is 103 inaccessible to most TI methods from a runtime and memory perspective, VIA can efficiently resolve the 104 underlying developmental heterogeneity, including 9 major trajectories ( Fig. 2a,b ) with a runtime of ~40 105 min, compared to the next fastest method which has a runtime of at least 4 hours 2 ( Supplementary Table 106 S3 ). VIA preserves wider neighborhood information and reveals a globally connected topology of MOCA 107 which is otherwise lost in the previous method. Broadly speaking, the overall cluster graph of VIA 108 consists of three main branches that concur with the known developmental process at early 109 organogenesis. 16 ( Fig. 2a) . It starts from the root stem which has a high concentration of E9.5 early 110 epithelial cells made of multiple sub-trajectories (e.g. epidermis, nose and foregut/hindgut epithelial cells 111 derived from the ectoderm and endoderm). The stem is connected to two distinct lineages: 1) 112 mesenchymal cells originated from the mesoderm which arises from interactions between the ectoderm 113 and endoderm 17 and 2) neural tube/crest cells derived from neurulation when the ectoderm folds inwards 1 . 114 The sparsity of early cells (only ~8% are E9.5) and the absence of earlier ancestral cells make it 115 particularly challenging to capture the simultaneous development of trajectories. However, the overall 116 pseudotime structure presented by VIA is reasonable. For instance, at the junction of the 117 epithelial-mesenchymal branch, we find early mesenchymal cells from E9.5-E10.5. Cells from later 118 mesenchymal developmental stages (e.g. myocytes from E12.5- E13.5) reside at the leaves of branches. 119 Similarly, at the junction of epithelial-neural tube, we find dorsal tube neural cells and notochord plate 120 cells which are predominantly from E9.5-E10.5 and more developed neural cells at the tips (e.g. 121 excitatory and inhibitory neurons from E12.5-E13.5). VIA also places the other dispersed groups of 122 trajectories (e.g. endothelial, hematopoietic) in biologically relevant neighborhoods ( Supplementary 123 Notes 3, Supplementary Fig. S11 ). While VIA’s connected topology offers a coarse-grained holistic 124 view, it does not compromise the ability to delineate individual lineage pathways (consistent with those 125 found by Cao et al., 8 ) as shown in Fig. 2c and Supplementary Fig. S11. TI using VIA uniquely 126 preserves both the global and local structures of the data and is thus particularly favorable for biological 127 exploration involving large datasets, especially for comparative studies involving cell atlases 19 . Whilst 128 manifold-learning methods are often used to extensively reduce dimensionality to mitigate the 129 computational burden of large single-cell datasets, they tend to incur loss of global information and be 130 sensitive to input parameters. VIA is sufficiently scalable to bypass such a step, and therefore retains a 131 higher degree of neighborhood information when mapping large datasets. This is in contrast to 132 Monocle3’s 8 UMAP-reduced inputs that reveal different disconnected super-groups and fluctuating 133 connectivity depending on input parameters (see Supplementary Fig. S12-15 for the biologically 134 consistent structures proposed by VIA across a range of parameters compared to the contradicting cell 135 super groups and connectivity suggested by a UMAP based TI interpretation ). 136 We next demonstrated the applicability of VIA in single-cell multi-omics analysis by inferring murine 137 Isl1+ cardiac progenitor cell (CPC) transition states using both single-cell transcriptomic and chromatin 138 accessibility information 20 ( Fig. 2d-i ). VIA consistently uncovers the bifurcating lineages towards the .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=mtxpwbv8o1fn https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=14ruyzz6p81 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=s2f3i1x6uus6 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=ic935kiviuls https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=c4gwvsmlicg https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=mtxpwbv8o1fn https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=mtxpwbv8o1fn https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=dt4acr7aal4q https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 139 endothelial and cardiomyocyte fates based on the scRNA-seq, scATAC-seq datasets and their data 140 integration ( see Methods for data integration). Other methods such as Palantir and Slingshot, that are also 141 applicable to non-transcriptomic data, fail to uncover the two main lineages in the individual as well as 142 the more challenging integrated multi-omic data. They typically only detect one of the two lineages and 143 instead falsely detect several intermediate and early stages as final cell fates ( see Fig. 2i for prediction 144 accuracy). PAGA does not offer automated cell fate prediction and is therefore not benchmarked for this 145 dataset. VIA detects lineage pathways in both the scRNA-seq and scATAC-seq that can be used to 146 interpret relationships between transcription factor dynamics and gene expression in an unsupervised 147 manner. VIA automatically generates a pseudotemporal ordering of cells (without requiring manual 148 selection of relevant cells as done in Jia et al. 20 ) along respective lineages and their marker-TF pairs ( see 149 Fig. 2f and Supplementary Fig. S8e) . The highlighted gene and TF pairs in the cardiac lineage show a 150 strong correlation between expression and accessibility of Gata and Homeobox Hox genes which are 151 known to be related to the regulation of cardiomyocyte proliferation 23,24,25 . VIA’s reliable performance 152 against user-reconfiguration (choice of components, individual or integrated omic data) suggests it can be 153 used for transferable interpretation between scRNA-seq and chromatin accessibility data. 154 We further tested VIA on a wider scope of (small-to mid-sized) scRNA-seq datasets, including B-cell 155 differentiation 26 , hematopoiesis 2 , 27 , embryonic stem (ES) cell differentiation in embryoid bodies 15 , and 156 endocrine differentiation (~10 2 - 10 4 cells). By comparing VIA with top-performing and popular TI 157 algorithms, e.g. PAGA 28 , Palintir, SlingShot and CellRank 13 (See Methods , and Supplementary Fig. 158 S1-7 for full analysis), we showed that VIA consistently outperforms other methods in terms of both 159 runtime (in some cases by several magnitudes see Supplementary Table S3 for runtime comparison ), 160 and more robust and accurate lineage prediction across a wide range of pre-processing and algorithmic 161 parameters. VIA’s relaxation of graph traversal to permit cyclic sub-paths (see Supplementary Fig. S1) 162 and movements that are temporally reversed, augments its sensitivity to lineages. Notably, VIA more 163 consistently across a wide range of input parameter choice identified less populous lineages that were at 164 best detected by other methods for a narrow sweet spot of parameters. For example, VIA reliably 165 delineates the megakaryocyte, conventional and plasmacytoid dendritic cell (cDC and pDC) lineages in 166 human hematopoiesis ( Fig. 2m-o, Supplementary Fig. S3-4 for pseudotime and graph-topological gene 167 trends for all lineages); and Delta cells (3%) during the endocrine progenitor cells differentiation ( Fig. 168 2j-l, Supplementary Fig. S6 for pseudotime and topological gene trends for all lineages), as evidenced 169 by the corresponding gene-expression trend analysis and parameter stress tests. Interestingly, we find that 170 VIA often detects 2 Beta cell subpopulations (Supplementary Fig. S6b,d,f) that express typical Beta 171 markers like Dlk1, Pdx1 , but differ in their expression of Ins1 and Ins2 (Supplementary Fig. S6d) . Such 172 a Beta cell heterogeneity 29 ,30 , whereby the immature Beta-2 population strongly expresses Ins2 , and 173 weakly expresses Ins1 , and the mature Beta-1 population expresses both types of Ins 30 , can also be 174 reconciled based on the position of the Beta-2 cluster on the VIA graph (Supplementary Fig. S6f). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=dt4acr7aal4q https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=hpu96rs05k0s https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=14ruyzz6p81 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=y5ixemnm3tac https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=9ohfvgceg55 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=s2mv1z2n9zoj https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=jbvdrwuod0wd https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=hn4iy4esq5sd https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=thb0kc2o9554 https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 175 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 176 Figure 2 VIA accurately infers trajectories in diverse scRNA-seq datasets. (a) VIA cluster-graph trajectory where 177 nodes are colored by pseudotime, and branches are shaded according to major lineages of 1.3-million-cell mouse 178 organogenesis cell atlas (MOCA). The VIA analysis (which is independent of the choice of visualization) produces a 179 connected structure with linkages between some of the major cell types that have a tendency to become segregated 180 in a UMAP based TI analysis (see Supplementary Fig. S12-15 ). The stem (root) branch consists of epithelial cells 181 derived from ectoderm and endoderm, leading to two main branches: 1) the mesenchymal and 2) the neural tube and 182 neural crest. Other major groups are placed in the biologically relevant neighborhoods, such as the adjacencies 183 between hepatocyte and epithelial trajectories; the neural crest (comprising glial cells and PNS neurons) and the 184 neural tube; as well as the links between early mesenchyme with both the hematopoietic cells and the endothelial 185 cells ( see Supplementary Note 3 ). (b) (Left) Single-cell PHATE embedding colored by major cell groups. (Right) 186 Single-cell PHATE embedding colored by VIA pseudotime. (c) Lineage pathways and probabilities of neuronal, 187 myocyte and WBC lineages ( see Supplementary Fig. S6 for other lineages ). (d) scRNA-seq and scATAC-seq data 188 of Isl1+ cardiac progenitors (CPs) integrated using Seurat3 before VIA TI analysis and PHATE visualization. Cells are 189 colored by annotated cell-type and experimental modality (e) Cells are colored by VIA pseudotime with the 190 VIA-inferred trajectory towards endothelial and myocyte lineages projected on top. (f) Marker gene expression and 191 chromatin accessibility for gene-TF pairs along pseudotime axis for Cardiomyocyte lineage (g) VIA-graph trajectory 192 with nodes colored by pseudotime shows bifurcation to endothelial and myocyte cells in scRNA-seq cells (h) 193 scATAC-seq of Isl1+ CPs: VIA-graph again shows bifurcation after intermediate CP stage. (i) Lineage prediction 194 accuracy (F1-score) for methods that offer automated lineage detection and are not limited to transcriptomic data. (k) 195 Pancreatic Islets: Colored by VIA pseudotime with detected terminal states shown in red and annotated based on 196 known cell type as Alpha, Beta-1, Beta-2, Delta and Epsilon lineages where Beta-2 is Ins1 low Ins2+ Beta subtype 197 ( Supplementary Fig. S7 ). (l) VIA inferred cluster-level pathway shows gene regulation along endocrine progenitor 198 (EP) to Delta lineage specification (top) and Sst gene-expression trend shows rise of Sst in Delta lineage ( See 199 Supplementary Fig. S7 for remaining ). (m) Prediction Accuracy of the 4 major endocrine cell types when varying 200 the number of HVGs selected in pre-processing, and the number of PCs. (n) Human CD34+ hematopoiesis with 6 201 detected cell fates annotated (o) lineage pathway and gene-pseudotime trend shown for the CD41 Megakaryocytic 202 cells ( see Supplementary Fig. S3 for other lineages ). (p) Prediction accuracy of 6 cell fates when varying number 203 of K (nearest neighbors) and PCs. Note Slingshot on default mode (“V2”) uses GMM clustering and “V1” uses 204 K-means clustering (allowing for over-clustering K=15, to increase sensitivity). Runtime of each method is also 205 highlighted below the chart. 206 VIA enables multi-omic analysis beyond transcriptomic data 207 Broad applicability of TI beyond transcriptomic analysis is increasingly critical, but existing methods 208 have limitations contending with the disparity in the data structure (e.g. sparsity and dimensionality) 209 across a variety of single-cell data types and oftentimes are designed with a view to only handling 210 transcriptomic data (e.g. methods using RNA velocity to infer directionality). 211 First, we employ VIA to analyze human scATAC-seq profiles (from CD34+ human bone marrow) ( Fig. 212 3a ), and find that the continuous landscape of hematopoiesis generally mirrors the scRNA-seq human 213 hematopoietic data ( Fig. 2c ). The intrinsic sparsity of scATAC-seq data poses a challenge that can be 214 alleviated by choice of pre-processing pipelines, and we see that VIA consistently predicts the expected 215 hierarchy of furcations that leads to the lymphoid, myeloid and erythroid lineages for two commonly 216 accepted pre-processing protocols 31 ,27 ( Methods ) . This again holds across a wide range of input 217 parameters, as shown by the changes in the accessibility of TF motifs associated with known regulators, 218 e.g. Gata1 (erythroid), Cebpd (myeloid) (Fig 3b-d, Supplementary Fig. S7). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=cpy816x2qcr https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 219 We next investigated whether VIA can cope with a significant drop in data dimensionality (10-100), as 220 often presented in flow/mass cytometry data, and still delineate continuous biological processes. We run 221 VIA on a time-series mass cytometry data (28 antibodies, 90K cells) capturing murine embryonic stem 222 cells (ESCs) differentiation toward mesoderm cells (Day 0 - Day 11) 32 . Unlike previous analysis 32 of the 223 same data which required chronological labels to visualize the developmental hierarchy, we ran VIA 224 without such supervised adjustments and accurately captured the sequential development. VIA computed 225 the trajectories with faster runtime (running in 2 minutes versus Slingshot which required 6 hours see 226 Table S3 ), detecting 3 terminal states corresponding to cells in the final developmental stages: 2 227 corresponding to the main region of Day 10-11 (marked by Pdgfra , Cd44 and Gata4 expressions), and a 228 small population of Day 10-11 cells expressing EpCAM, which are otherwise obscured in other methods 229 (e.g. Palantir, Slingshot), especially the small EpCAM population (~0.5% of cells) (Fig 3e-h, Fig. S9e,f) . 230 Finally we tested the adaptability of VIA to infer cell-cycle stages based on label-free single-cell 231 biophysical morphology (38 features, see Supplementary Table S4 and Table S5 ) profiled by our 232 recently developed high-throughput imaging flow cytometer, called FACED 33 . VIA reliably reconstructed 233 the continuous cell-cycle progressions from G1-S-G2/M phase of two different types of live breast cancer 234 cells as validated by the single-cell fluorescent (DNA dye) images captured by the same system 235 ( Methods )( Fig. 3i-k for MCF7 , Supplementary Fig. S10 for MDA-MB231 ) . Intriguingly, according to 236 the pseudotime ordered by VIA, not only can it reveal the known cell growth in size and mass 34 , and 237 general conservation of cell mass density 35 (as derived from the FACED images ( Methods )) throughout 238 the G1/S/G2 phases, but also a slow-down trend during the G1/S transition, consistent with the lower 239 protein-accumulation rate during S phase 36 ( Fig. 3l, Supplementary Fig, S10 f,g ). The variation in 240 biophysical textures (e.g. phase entropy) along the VIA pseudotime likely relates to known architectural 241 changes of chromosomes and cytoskeletons during the cell cycles ( Fig. 3l, Fig. S10 f,g ). These results 242 further substantiate the growing body of work 37 ,38,39,40 on imaging biophysical cytometry for gaining a 243 mechanistic understanding of biological systems, especially when combined with omics analysis 41 . 244 Concluding Remarks 245 Overall, VIA offers an advancement to TI methods to study a diverse range of single-cell omic data, 246 including those targeted by many cell-atlas initiatives. By combining lazy-teleporting random walks and 247 MCMC simulations, VIA relaxes common constraints on graph traversal and causality. This enables 248 accurate lineage prediction that is robust to parameter configuration for a variety of complex topologies 249 and rarer lineages obscured in other methods. Our stress tests showed that the modeled developmental 250 landscape in other methods is vulnerable to user parameter choice which can incur fragmentation or 251 spurious linkages, and consequently only yield biologically sensible lineages for a narrow sweet spot of 252 parameters (See the summary in Supplementary Fig. S16 ). For example, due to algorithmic measures 253 taken to restrict permissible graph-edge transitions and progressively reduce the inherent dimensionality 254 (e.g. PCA followed by subsetting the number of diffusion components) other algorithms struggle to 255 delineate obscure lineages and maintain neighborhood relationships. VIA’s wider bandwidth of accuracy, 256 superior runtime and preservation of global graph properties for very large datasets, offers a unique and 257 well-suited approach for multifaceted exploratory analysis to uncover novel biological processes, 258 potentially those deviated from healthy trajectories. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=29jzsvekzs4z https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=29jzsvekzs4z https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=1ezm5xa09jlh https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=tbqfcii5qr09 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=wlju6tjkbefs https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=qj4499p96ubk https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=z8feu0u2i2yt https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=d11sb02yalb https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 259 Figure 3 VIA infers trajectories in single-cell multi-omic and image datasets (a) Major lineages of human 260 hematopoiesis (profiled by scATAC-seq) projected onto the UMAP embedding. Lineages are colored by FACS sorted 261 labels 27 . (b) VIA cluster-graph topology colored by VIA pseudotime. (c) Trajectory, pseudotime and detected terminal 262 states (red) projected onto the UMAP embedding. (d) F1-scores (on the k-mer Z-score input) for terminal state 263 prediction by different TI methods (for a fixed KNN = 20). Terminal states include megakaryocyte–erythroid progenitor 264 (MEP), common lymphoid progenitor (CLP), plasmacytoid dendritic cell (pDC) and monocytes (Mono) lineages. The 265 comparisons show that VIA's accuracy remains high across a wide range of PCs. (e) Differentiation of mESC to 266 mesoderm cells measured by single-cell mass cytometry. UMAP embedding is colored by different measurement time 267 points (Day 0-11). (f) VIA cluster graph with 3 detected terminal nodes (red) and colored by pseudotime. (g) VIA 268 results projected onto single-cell UMAP embedding shows 3 terminal states correspond to Day 10/11 regions. (h) 269 Correlation of inferred pseudotime and day-labels achieved by different TI methods. The benchmark was done across 270 different numbers of KNN (using all 28 antibodies). (i) Label-free cell cycle progression tracking based on FACED 271 imaging cytometry. The PHATE embedding is constructed using 38 biophysical/morphological features computed 272 from images of human breast cancer cells (MCF7) (See Supplementary Fig. S10 for additional results using another 273 breast cancer cell type (MDA-MB231)). The embedding is colored by the known cell cycle stages given by the DNA 274 fluorescence images (obtained from the same system). (j) VIA graph topology colored by pseudotime. (k) VIA 275 trajectory and pseudotime projected on embedding. (l) “Biophysical” feature expressions (Z-score normalized) over 276 pseudotime. (See Supplementary Table S4-5 for detailed definitions of the features). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=y5ixemnm3tac https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 277 Methods 278 VIA Algorithm 279 VIA applies a scalable probabilistic method to infer cell state dynamics and differentiation hierarchies by 280 organizing cells into trajectories along a pseudotime axis in a nearest-neighbor graph which is the basis 281 for subsequent random walks. Single cells are represented by graph nodes that are connected based on 282 their feature similarity, e.g. gene expression, transcription factor accessibility motif, protein expression or 283 morphological features of cell images. A typical routine in VIA mainly consists of four steps: 284 1. Accelerated and scalable cluster-graph construction . VIA first represents the single-cell data in a 285 k-nearest-neighbor (KNN) graph where each node is a cluster of single cells. The clusters are 286 computed by our recently developed clustering algorithm, PARC 11. . In brief, PARC is built on 287 hierarchical navigable small world (HNSW 58 ) accelerated KNN graph construction and a fast 288 community-detection algorithm (Leiden method 42 ), which is further refined by data-driven pruning. 289 The combination of these steps enables PARC to outperform other clustering algorithms in 290 computational run-time, scalability in data size and dimension (without relying on subsampling of 291 large-scale, high-dimensional single-cell data (>1 million cells)), and sensitivity of rare-cell detection. 292 We employ the cluster-level topology, instead of a single-cell-level graph, for TI as it provides a 293 coarser but clearer view of the key linkages and pathways of the underlying cell dynamics without 294 imposing constraints on the graph edges. Together with the strength of PARC in clustering scalability 295 and sensitivity, this step critically allows VIA to faithfully reveal complex topologies namely cyclic, 296 disconnected and multifurcating trajectories ( Supplementary Fig. S1 ). 297 2. Probabilistic pseudotime computation . The trajectories are then modeled in VIA as (i) 298 lazy-teleporting random walk paths along which the pseudotime is computed and further refined by 299 (ii) MCMC simulations. The root is a single cell chosen by the user.These two sub-steps are detailed 300 as follows: 301 (i) Lazy-teleporting random walk : We first compute the pseudotime as the expected hitting time 302 of a lazy-teleporting random walk on an undirected cluster-graph generated in Step 1. The 303 lazy-teleporting nature of this random walk ensures that as the sample size grows, the expected 304 hitting time of each node does not converge to the stationary probability given by local node 305 properties, but instead continues to incorporate the wider global neighborhood information 12 . 306 Here we highlight the derivation of the closed form expression of the hitting time of this modified 307 random walk with a detailed derivation in Supplementary Note 2 . 308 The cluster graph constructed in VIA is mathematically defined as a weighted connected graph G 309 ( V , E , W ) with a vertex set V of n vertices (or nodes), i.e. and an edge set E , V = {v , , }1 ⋯ vn 310 i.e. a set of ordered pairs of distinct nodes. W is an weight matrix that describes a set of n ×n 311 edge weights between node i and j , are assigned to the edges . For an undirected ≥0wij v ,( i vj) 312 graph, The probability transition matrix, P, of a standard random walk on this wwij = ji ×nn 313 graph G can be given by .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=e9uxoipvwota https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=nlu4dvpyxpr5 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=w4963vuo4u2f https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=onqq15xmlfrf https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 314 D WP = −1 (1) 315 where D is the degree matrix, which is a diagonal matrix of the weighted sum of the degree ×nn 316 of each node, i.e. the matrix elements are expressed as 317 where k are the neighbouring nodes connected to node i . Hence, (which can be reduced as ) dii di 318 is the degree of node i . We next consider a lazy random walk, defined as Z , with probability 319 ( ) of being lazy (where 0 ), i.e. staying at the same node, then 1 − x < x < 1 320 xP 1 )IZ = + ( − x (3) 321 where I is the identity matrix. When teleportation occurs with a probability ( ), the modified 1 − α 322 lazy-teleporting random walk Z' can be written as follows, where is an matrix of ones. J ×nn 323 αZ 1 ) JZ ′ = + ( − α n 1 (4) 324 Here we adapt the concept of personalized PageRank vector, originally used for recording (or 325 ranking ) personal preferences of a web-surfer toward particular website pages 43 , to rank the 326 importance of other nodes (clusters of cells) to a given node, depending on the similarities among 327 nodes (related to P in the graph), and the lazy-teleporting random walk characteristics in the 328 graph (set by probabilities of teleporting and being lazy). Based on this concept, one could model 329 the likelihood to transit from one node (cluster of cells) to another, and thus construct the 330 pseudotime based on the hitting time, which is a parameter describing the expected number of 331 steps it takes for a random walk that starts at node i and visit node j for the first time. Consider 332 the teleporting probability of ( ) and a seed vector s specifying the initial probability 1 − α 333 distribution across the n nodes (such that , where s m is the probability of starting at ∑ m sm = 1 334 node m ) the personalized PageRank vector (which is defined as a column vector) is the prα (s) 335 unique solution to 56 336 . αpr Z 1 )sprα (s) T = α (s) T + ( − α T (5) 337 Substituting Z (Eq. (3)) into Eq. (5), we can express the personalized PageRank vector in prα (s) 338 terms of the inverse of the 𝛃 -normalized Laplacian, of the modified random walk Rβ,N L 339 ( Supplementary Note 2), i.e. 340 , s D R Dprα (s) T = β T −0.5 β,N L 0.5 (6) 341 where , and . and are the m th eigenvector and β = (2−α) 2(1−α) Rβ,N L = ∑ m=1 Φ Φm T m β+2x(1−β)η[ m] Φm ηm 342 eigenvalue of the normalized Laplacian. In the expression of R 𝛃,NL, the 𝛃 and x regulate the 343 weight of contribution in each eigenvalue-eigenvector pair of the summation such that the first 344 eigenvalue-eigenvector pair (corresponding to the stationary distribution and given by the 345 local-node degree-properties) remains included in the overall expression, but does not overwhelm 346 the global information provided by subsequent ‘eigen-pairs’. Moreover, computation of R 𝛃,NL is 347 not limited to a subset of the first k eigenvectors (bypassing the need for the user to select a 348 suitable threshold or subset of eigenvectors) since the dimensionality is not on the order of .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=uzyf68ddp60p https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=fi93hnl2oym0 https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 349 number of cells, but equal to the number of clusters and hence all eigenvalue-eigenvector pairs 350 can be incorporated without causing a bottleneck in runtime. 351 The expected hitting time from node q to node r is given by 44 , 352 hα (q, )r = dr pr (e ) (r)[ α r T ] − dq pr (e ) (q)[ α r T ] (7) 353 where is an indicator vector with 1 in the i th entry and 0 elsewhere (i.e. if and ei sm = 1 m = i 354 if ). We can substitute Eq. (6) into Eq. (7), making use of the fact that sm = 0 ≠im 355 , and is symmetric, to obtain a closed form expression of the 1dr = D e[ −1 r] (r) R DD−0.5 β,N L −0.5 356 hitting time in terms of Rβ,N L 357 (e ) D R D ehα (q, )r = β r − eq T −0.5 β,N L −0.5 r (8) 358 (ii) MCMC simulation : The hitting time metric computed in Step-1 is used to infer 359 graph-directionality. Instead of pruning edges in the ‘reverse’ direction, edge-weights are biased 360 based on the time difference between nodes using the logistic function with growth factor b =1. 361 (t) f = 1 1+e −b (t − t )1 0 362 We then recompute the pseudotimes on the forward biased graph: Since there is no closed form 363 solution of hitting times on a directed graph, we perform MCMC simulations (parallely processed 364 to enable fast simulations of 1000s of teleporting, lazy random walks starting at the root node of 365 the cluster graph) and use the first quartile of the simulated pseudotime values for a respective 366 node as the refined pseudotime for that node relative to the root. This refinement step ensures that 367 the pseudotime is robust to the spurious links (or conversely, links that are too weakly weighted) 368 that can distort calculations based purely on the closed form solution of hitting times 369 ( Supplementary Fig. S9d ). By using this 2-step pseudotime computation, VIA mitigates the 370 issues of convergence issues and spurious edge-weights, both of which are common in 371 random-walk pseudotime computation on large and complex datasets 12. . 372 3. Automated terminal-state detection. The algorithm then uses the refined directed and weighted 373 graph (the edges are re-weighted using the refined pseudotimes) to predict which nodes represent the 374 terminal states based on a consensus vote of pseudotime and multiple vertex connectivity properties, 375 including out-degree (i.e. the number of edges directing out from the node), closeness C( q ) , and 376 betweenness B( q ). 377 C (q) = 1 (q,r)∑ q≠r l 378 B (q) = ∑ r=q≠t/ σrt σ (q)rt 379 is the distance between node q and node r (i.e. the sum of edges in a shortest path connecting l (q, )r 380 them). is the total number of shortest paths from node r to node t . is the number of these σrt σrt (q) .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=hy8nvy7h6bta https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=onqq15xmlfrf https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 381 paths passing through node q . The consensus vote is performed on nodes that score above (or below 382 for out-degree) the median in terms of connectivity properties. We show on multiple simulated and 383 real biological datasets that VIA more accurately predicts the terminal states across a range of input 384 data dimensions and key algorithm parameters than other methods attempting the same 385 (Supplementary Fig. S16). 386 4. Automated trajectory reconstruction . VIA then identifies the most likely path of each lineage by 387 computing the likelihood of a node traversing towards a particular terminal state (e.g. differentiation). 388 These lineage likelihoods are computed as the visitation frequency under lazy-teleporting MCMC 389 simulations from the root to a particular terminal state, i.e. the probability of node i reaching 390 terminal-state j as the number of times cell i is visited along a successful path (i.e. terminal-state j is 391 reached) divided by the number of times cell i is visited along all of the simulations. In contrast to 392 other trajectory reconstruction methods which compute the shortest paths between root and terminal 393 node 1 ,2 , the lazy-teleporting MCMC simulations in VIA offer a probabilistic view of pathways under 394 relaxed conditions that are not only restricted to the random-walk along a tree-like graph, but can also 395 be generalizable to other types of topologies, such as cyclic or connected/disconnected paths. In the 396 same vein, we avoid confining the graph to an absorbing Markov chain 13,3 (AMC) as this places 397 prematurely strict / potentially inaccurate constraints on node-to-node mobility and can impede 398 sensitivity to cell fates (as demonstrated by VIA’s superior cell fate detection across numerous 399 datasets ( Supplementary Fig. S16 ). 400 Downstream visualization and analysis 401 VIA generates a visualization that combines the network topology and single-cell level 402 pseudotime/lineage probability properties onto an embedding based on UMAP or PHATE. Generalized 403 additive models (GAMs) are used to draw edges found in the high-dimensional graph onto the lower 404 dimensional visualization ( Fig. 1 ). An unsupervised downstream analysis of cell features (e.g. marker 405 gene expression, protein expression or image phenotype) along pseudotime for each lineage is performed 406 ( Fig. 1 ). Specifically, VIA plots the expression of features across pseudotime for each lineage by using 407 the lineage likelihood properties to weight the GAMs. A cluster-level lineage pathway is automatically 408 produced by VIA to visualize feature heat maps at the cluster-level along a lineage-path to see the 409 regulation of genes. VIA provides the option of gene imputation before plotting the lineage specific gene 410 trends. The imputation is fast as it relies on the single-cell KNN (scKNN) graph computed in Step 1. 411 Using an affinity-based imputation method 45 , this step computes a “diffused” transition matrix on the 412 scKNN graph used to impute and denoise the original gene expressions. 413 Benchmarked Methods 414 The methods were mainly chosen based on their superior performance in a recent large-scale 415 benchmarking study 4 , including a select few recent methods claiming to supersede those in the study. 416 Specifically, recent and popular methods exhibiting reasonable scalability, and automated cell fate 417 prediction in multi-lineage trajectories were favoured as candidates for benchmarking (See 418 Supplementary Table S1 for the key characteristics of methods). Performance stress-tests in terms of 419 lineage detection of each biological dataset, and pseudotime correlation for time-series data were .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=s4aw0ujvuwn8 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=jbvdrwuod0wd https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=c4deqc1by20h https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=pwzzqt446v1 https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 420 conducted over a range of key input parameters (e.g. numbers of k-nearest neighbors, highly variable 421 genes (HVGs), and principal components (PCs)) and pre-processing protocols (see Fig. 2m,p, 422 Supplementary Fig. 16 ). All comparisons were run on a computer with an Intel(R) Xeon (R) W-2123 423 central processing unit (3.60GHz, 8 cores) and 126 GB RAM. 424 Quantifying terminal state prediction accuracy for parameter tests was done using the F1-score, defined 425 as the harmonic mean of recall and precision and calculated as: 426 F 1 = tp tp + 0.5(f p+f n) 427 Where tp is a true-positive: the identification of a terminal cluster that is in fact a final differentiated cell 428 fate; fp is a false positive identification of a cluster as terminal when in fact it represents an intermediate 429 state; and fn is a false negative where a known cell fate fails to be identified 430 PAGA 28 . It uses a cluster-graph representation to capture the underlying topology. PAGA computes a 431 unified pseudotime by averaging the single-cell level diffusion pseudotime computed by DPT, but 432 requires manual specification of terminal cell fates and clusters that contribute to lineages of interest in 433 order to compare gene expression trends across lineages. 434 Palantir 2 . It uses diffusion-map 46. components to represent the underlying trajectory. Pseudotimes are 435 computed as the shortest path along a KNN-graph constructed in a low-dimensional diffusion component 436 space, with edges weighted such that the distance between nodes corresponds to the diffusion 437 pseudotime 47. (DPT). Terminal states are identified as extrema of the diffusion maps that are also outliers 438 of the stationary distribution. The lineage-likelihood probabilities are computed using Absorbing Markov 439 Chains (constructed by removing outgoing edges of terminal states, and thresholding reverse edges). 440 Slingshot 1 . It is designed to process low-dimensional embeddings of the single-cell data. By default 441 Slingshot runs clustering based on Gaussian mixture modeling and recommends using the first few PCs as 442 input. Slingshot connects the clusters using a minimum spanning tree and then fits principle curves for 443 each detected branch. It uses the orthogonal projection against each principal curve to fit a separate 444 pseudotime for each lineage, and hence the gene expressions cannot be compared across lineages. Also, 445 the runtimes are prohibitively long for large datasets or high input dimensions. 446 CellRank 13 . This method combines the information of RNA velocity (computed using scVelo 48. ) and 447 gene-expression to infer trajectories. Given it is mainly suited for the scRNA-seq data, with the 448 RNA-velocity computation limiting the overall runtime for larger dataset, we limit our comparison to the 449 pancreatic dataset which the authors of CellRank used to highlight its performance. 450 Simulated Data 451 We employed the DynToy 4 ( https://github.com/dynverse/dyntoy ) package, which generates synthetic 452 single-cell gene expression data (~1000 cells x 1000 ‘genes’), to simulate different complex trajectory 453 models. Using these datasets, we tested that VIA consistently and more accurately captures both tree and 454 non-tree like structures (multifurcating, cyclic, and disconnected) compared to other methods .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=s2mv1z2n9zoj https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=14ruyzz6p81 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=6th1gotw3ydi https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=l3wpb1nev23n https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=s4aw0ujvuwn8 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=jbvdrwuod0wd https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=xpkwepv274v https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=pwzzqt446v1 https://github.com/dynverse/dyntoy https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 455 (Supplementary Fig. S1) . All methods are subject to the same data pre-processing steps, PCA dimension 456 reduction and root-cell to initialize the path. 457 Multifurcating structure . This dataset consists of 1000 ‘cells’ multifurcating into 4 terminal states. VIA 458 robustly captures all four terminal cell fates across a range of input PCs and the pseudotimes are well 459 inferred relative to the root node (Supplementary Fig. S1a) . Note that two terminal states (M2 and M8), 460 which are very close to each other, are easily merged by the other methods (Slingshot, Palantir and 461 PAGA). 462 Cyclic structure. We ran VIA and other methods for different values of K nearest neighbors. VIA 463 unambiguously shows a cyclic network for a range of K (in KNN). Slingshot does not use a KNN 464 parameter and shows 3 fragmented different lineages (top to bottom). PAGA fails to capture the 465 connected cyclic structure at K = 10 and 5, while Palantir visually shows a linear (K = 10, 30) or 466 disconnected structure (K = 5). Van den Berge et al 57 note that the challenge of cyclic trajectory 467 reconstruction is also common in other popular methods, such as Monocle3 that consistently fragments or 468 fits branching structures onto cyclic simulated datasets. 469 Disconnected structure. This dataset comprises two disconnected trajectories (T1 and T2). T1 is cyclic 470 with an extra branch (M5 to M6), T2 has a bifurcation at M3 ( Supplementary Fig. S1c) . VIA captures 471 the two disconnected structures as well as the M6 branch in the cyclic structure, and the bifurcation in the 472 smaller structure. PAGA captures the underlying structure at PC = 20 but becomes fragmented for other 473 numbers of PCs. Palantir also yields multiple fragments and is not able to capture the overall structure, 474 while Slingshot (using the default clustering based on Gaussian mixture modeling) connects T1 and T2, 475 and only captures one of the bifurcations in T1. 476 Biological Data 477 The pre-processing steps described below for each dataset are not included in the reported runtimes as 478 these steps are typically very fast, (typically less than 1-10% of the total runtime depending on the 479 method. E.g. only a few minutes for pre-processing 100,000s of cells) and only need to be performed 480 once as they remain the same for all subsequent analyses. It should also be noted that visualization (e.g. 481 UMAP, t-SNE) are not included in the runtimes. VIA provides a subsampling option at the visualization 482 stage to accelerate this process for large datasets without impacting the previous computational steps. 483 However, to ensure fair comparisons between TI methods (e.g. other methods do not have an option to 484 compute the embedding on a subsampled input and transfer the results between the full trajectory and the 485 sampled visualization, or rely on a slow version of tSNE), we simply provide each TI method with a 486 pre-computed visualization embedding on which the computed results are projected. 487 ScRNA-seq of mouse pre-B cells. This dataset 26 models the pre-BI cell (Hardy fraction C’) process 488 during which cells progress to the pre-BII stage and B cell progenitors undergo growth arrest and 489 differentiation. Measurements were obtained at 0, 2, 6, 12, 18 and 24 hours (h) for a total of 313 cells x 490 9,075 genes. We follow a standard Scanpy preprocessing recipe 49 that filters cells with low counts, and 491 genes that occur in less than 3 cells. The filtered cells are normalized by library size and log transformed. 492 The top 5000 highly variable genes (HVG) are retained. Cells are renormalized by library count and 493 scaled to unit variance and zero mean. VIA identifies the terminal state at 18-24 h and accurately .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=mfw3e1p8x8hj https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=hpu96rs05k0s https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=s9izyt6361ye https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 494 recapitulates the gene expression trends 26 along inferred pseudotime of IgII1 , Slc7a5 , Fox01 , Myc , Ldha 495 and Lig4 . ( Supplementary Fig. S2a). We show the results generalize across a range of PCs for two 496 values of K of the graph with higher accuracy in locating the later cell fates than Slingshot and Palantir. 497 ( Supplementary Fig. S2b). 498 ScRNA-seq of human CD34+ bone marrow cells. This is a scRNA-seq dataset of 5800 cells 499 representing human hematopoiesis 2. . We used the filtered, normalized and log-transformed count matrix 500 provided by Setty et al 2. ., with PCA performed on all the remaining genes. The cells were annotated using 501 SingleR 50. which automatically labeled cells based on the hematopoietic reference dataset Novershtern 502 Hematopoietic Cell Data - GSE24759 51. . The annotations are in agreement with the labels inferred by 503 Setty et al. for the 7 clusters, including the root HSCs cluster that differentiates into 6 different lineages: 504 monocytes, erythrocytes, and B cells, as well as the less populous megakaryocytes, cDCs and pDCs. VIA 505 consistently identifies these lineages across a wider range of input parameters and data dimensions (e.g. 506 the number of K and PCs provided as input to the algorithms see Fig. 2p, and Supplementary Fig. S3c ). 507 Notably, the upregulated gene expression trends of the small populations can be recovered in VIA, i.e. 508 pDC and cDC show elevated CD123 and CSF1R levels relative to other lineages, and the upregulated 509 CD41 expression in megakaryocytes ( Supplementary Fig. S3-S4) . 510 ScRNA-seq of human embryoid body. This is a midsized scRNA-seq dataset of 16,825 human cells in 511 embryoid bodies (EBs) 15 . We followed the same pre-processing steps as Moon et al. to filter out dead 512 cells and those with too high or low library count. Cells are normalized by library count followed by 513 square root transform. Finally the transformed counts are scaled to unit variance and zero mean. The 514 filtered data contained 16825 cells × 17580 genes. PCA is performed on the processed data before 515 running each TI method. VIA identifies 6 cell fates, which, based on the upregulation of marker genes as 516 cells proceed towards respective lineages, are in accord with the annotations given by Moon et al., (See 517 the gene heatmap and changes in gene expression along respective lineage trajectories in Supplementary 518 Fig. S5). Note that Palantir and Slingshot do not capture the cardiac cell fate, and Slingshot also misses 519 the neural crest ( see the F1-scores summary for terminal state detection Supplementary Fig. S5). 520 ScRNA-seq of mouse organogenesis cell atla s . This is a large and complex scRNA-seq dataset of mouse 521 organogenesis cell atlas (MOCA) consisting of 1.3 million cells 6. . The dataset contains cells from 61 522 embryos spanning 5 developmental stages from early organogenesis (E9.5-E10.5) to organogenesis 523 (E13.5). Of the 2 million cells profiled, 1.3 million are ‘high-quality’ cells that are analysed by VIA. The 524 runtime is approximately 40 minutes which is in stark contrast to the next fastest tool Palantir which takes 525 4 hours (excluding visualization). The authors of MOCA manually annotated 38 cell-types based on the 526 differentially expressed genes of the clusters. In general, each cell type exclusively falls under one of 10 527 major and disjoint trajectories inferred by applying Monocle3 to the UMAP of MOCA. The authors 528 attributed the disconnected nature of the 10 trajectories to the paucity of earlier stage common 529 predecessor cells. We followed the same steps as Cao et al. 6 to retain high-quality cells (i.e. remove cells 530 with less than 400 mRNA, and remove doublet cells and cells from doubled derived sub-clusters). PCA 531 was applied to the top 2000 HVGs with the top 30 PCs selected for analysis. VIA analyzed the data in the 532 high-dimensional PC space. We bypass the step in Monocle3 6 which applies UMAP on the PCs prior to 533 TI as this incurs an additional bias from choice of manifold-learning parameters and a further loss in .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=hpu96rs05k0s https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=14ruyzz6p81 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=14ruyzz6p81 https://bioconductor.org/packages/3.11/SingleR https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=gqvc4cw37qlq https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24759 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=8w1hbpz1mcwd https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=9ohfvgceg55 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=2glf5ij7qkb6 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=2glf5ij7qkb6 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=2glf5ij7qkb6 https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 534 neighborhood information. As a result, VIA produces a more connected structure with linkages between 535 some of the major cell types that become segregated in UMAP (and hence Monocle3), and favors a 536 biologically relevant interpretation ( Fig. 2, Supplementary Fig. S11 ). A detailed explanation of these 537 connections (graph-edges) extending between certain major groups using references to literature on 538 organogenesis is presented in Supplementary Note 3. 539 ScRNA-seq of murine endocrine development 5 . This is an scRNA-seq dataset of E15.5 murine 540 pancreatic cells spanning all developmental stages from an initial endocrine progenitor-precursor (EP) 541 state (low level of Ngn3 , or Ngn3 low ), to the intermediate EP (high level of Ngn3 , or Ngn3 high ) and Fev + 542 states, to the terminal states of hormone-producing alpha, beta, epsilon and delta cells5. 5. Following steps 543 by Lange et al 13. , we preprocessed the data using scVelo to filter genes, normalize each cell by total counts 544 over all genes, keep the top most variable genes, and take the log-transform. PCA was applied to the 545 processed gene matrix. We assessed the performance of VIA and other TI methods (CellRank, Palantir, 546 Slingshot) across a range of number of retained HVGs and input PCs ( Fig. 2m , Supplementary Fig. S6) . 547 ScATAC-seq of human bone marrow cells. This scATAC-seq data profiles 3072 cells isolated from 548 human bone marrow using fluorescence activated cell sorting (FACS), yielding 9 populations 27 : HSC, 549 MPP, CMP, CLP, LMPP, GMP, MEP, mono and plasmacytoid DCs ( Fig. 3a and Supplementary Fig. 550 S7 ). We examined TI results for two different preprocessing pipelines to gauge how robust VIA is on the 551 scATAC-seq analysis which is known to be challenging for its extreme intrinsic sparsity. We used the 552 pre-processed data consisting of PCA applied to the z-scores of the transcription factor (TF) motifs used 553 by Buenrostro et a 27. . Their approach corrects for batch effects in select populations and weighting of PCs 554 based on reference populations and hence involves manual curation. We also employed a more general 555 approach used by Chen et al. 31. which employs ChromVAR to compute k-mer accessibility z-scores across 556 cells. VIA infers the correct trajectories and the terminal cell fates for both of these inputs, again across a 557 wide range of input parameters ( Fig. 3d and Supplementary Fig. S7 ). 558 ScRNA-seq and scATAC-seq of Isl1+ cardiac progenitor cells. This time-series dataset captures 559 murine Isl1+ cardiac progenitor cells (CPCs) from E7.5 to E9.5 characterized by scRNA-seq (197 cells) 560 and scATAC-seq (695 cells) 20. . The Isl1+ CPCs are known to undergo multipotent differentiation to 561 cardiomyocytes or endothelial cells. For the scRNA-seq data, the quality filtered genes and the size-factor 562 normalized expression values are provided by Jia et al. 20 as a “Single Cell Expression Set” object in R. 563 Similarly, the cells in the scATAC-seq experiment were provided in a “SingleCellExperiment” object with 564 low quality cells excluded from further analysis. The accessibility of peaks was transformed to a binary 565 representation as input for TF-IDF (term frequency-inverse document frequency) weighting prior to 566 singular value decomposition (SVD). The highlighted TF motifs in the heatmap ( Fig. 2j ) correspond to 567 those highlighted by Jia et al. We tested the performance when varying the number of SVDs used. We 568 also considered the outcome when merging the scATAC-seq and scRNA-seq data using Seurat3 52. . 569 Despite the relatively low cell count of both datasets, and the relatively under-represented scRNA-seq cell 570 count, the two datasets overlapped reasonably well and allowed us to infer the expected lineages in an 571 unsupervised manner ( Fig. 2d and Supplementary Fig. S8 . In contrast, Jia et al., performed a supervised 572 TI by manually selecting cells relevant to the different lineages (for the scATAC-seq cells) and choosing 573 the two diffusion components that best characterize the developmental trajectories in low dimension 20 . .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=4l8bltp8p9u2 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=4l8bltp8p9u2 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=jbvdrwuod0wd https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=y5ixemnm3tac https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=y5ixemnm3tac https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=cpy816x2qcr https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=dt4acr7aal4q https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=dt4acr7aal4q https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=hkri0zw1jhb3 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=dt4acr7aal4q https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 574 Mass cytometry data of mouse embryonic stem cells (mESC) . This is a mass cytometry (or CyTOF) 575 dataset, consisting of 90,000 cells and 28 antibodies (corresponding to ~7000 cells each from Day 0-11 576 measurements), that represents differentiation of mESC to mesoderm cells 32. . An arcsinh transform with a 577 scaling factor of 5 was applied on all features - a standard procedure for CyTOF datasets, followed by 578 normalization to unit variance and zero mean. Given the small feature set, no PCA is required 579 (Supplementary Fig. S9) . VIA identifies 3 main terminal states corresponding to Day 11 and Day 10, 580 Palantir on the other hand identifies three Terminal states that all correspond to Days in the first half of 581 the experiment and the pseudotime is heavily influenced by the root node being very weakly connected to 582 the other stages of the process. Slingshot appears to capture the overall pseudotime but the 6 lineages 583 imposed onto the low dimensional representation are difficult to interpret and distinguish. To improve 584 Palantir performance we used 5000 waypoints but this takes almost 20 minutes to complete (excluding 585 time taken for embedding the visualization). VIA runs in ~3 minutes and produces results consistent with 586 the known ordering. The pseudotime reflects the range of Days very well, even capturing the small 587 population of Day 11 cells on the left hand side of the Day 6 cells in the embedding (Fig. 3, and 588 Supplementary Fig. S9) . 589 Single-cell biophysical phenotypes derived from imaging flow cytometry. This is the in-house dataset 590 of single-cell biophysical phenotypes of two different human breast cancer types (MDA-MB231 and 591 MCF7). Following our recent image-based biophysical phenotyping strategy 53 , 54 , we defined the 592 spatially-resolved biophysical features of a cell in a hierarchical manner based on both bright-field and 593 quantitative phase images captured by the FACED imaging flow cytometer (i.e., from the bulk features to 594 the subcellular textures). At the bulk level, we extracted the cell size, dry mass density, and cell shape. At 595 the subcellular texture level, we parameterized the global and local textural characteristics of optical 596 density and mass density at both the coarse and fine scales (e.g., local variation of mass density, its 597 higher-order statistics, phase entropy radial distribution etc.). This hierarchical phenotyping approach 53 , 54 598 allowed us to establish a single-cell biophysical profile of 38 features, which were normalized based on 599 the z-score ( See Supplementary Table S4 and Table S5 ). All these features, without any PCA, are used 600 as input to VIA. In order to weigh the features, we use a mutual information classifier to rank the features, 601 based on the integrated fluorescence intensity of the fluorescence FACED images of the cells (which 602 serve as the ground truth of the cell-cycle stages). Following normalization, the top 3 features (which 603 relate to cell size) are weighted (using a factor between 3-10). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=29jzsvekzs4z https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=msgdsdsh4sty https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=qiw3flfp2a4l https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=v1z8vlisxo3y https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=7jdtkd13bs9a https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 604 Imaging flow cytometry experiment 605 FACED imaging flow cytometer setup 606 A multimodal FACED imaging flow cytometry (IFC) platform was used to obtain the quantitative phase 607 and fluorescence images of single cells in microfluidic flow at an imaging throughput of ~70,000 608 cells/sec. The light source consisted of an Nd:YVO picosecond laser (center wavelength = 1064 nm, 609 Time-Bandwidth) and a periodically-poled lithium niobate (PPLN) crystal (Covesion) for second 610 harmonic generation of a green pulsed beam (center wavelength = 532 nm) with a repetition rate of 20 611 MHz. The beam was then directed to the FACED module, which mainly consists of a pair of 612 almost-parallel plane mirrors. This module generated a linear array of 50 beamlets (foci) which were 613 projected by an objective lens (40X, 0.6NA, MRH08430, Nikon) on the flowing cells in the microfluidic 614 channel for imaging. Each beamlet was designed to have a time delay of 1 ns with the neighboring 615 beamlet in order to minimize the fluorescence crosstalk due to the fluorescence decay. Detailed 616 configuration of the FACED module can be referred to Wu et al. 33. . The epi-fluorescence image signal 617 was collected by the same objective lens and directed through a band-pass dichroic beamsplitter (center: 618 575nm, bandwidth: 15nm). The filtered orange fluorescence signal was collected by the photomultiplier 619 tube (PMT) (rise time: 0.57 ns, Hamamatsu). On the other hand, the transmitted light through the cell was 620 collected by another objective lens (40X, 0.8NA, MRD07420, Nikon). The light was then split equally by 621 the 50:50 beamsplitter into two paths, each of which encodes different phase-gradient image contrasts of 622 the same cell (a concept similar to Scherlien photography 55. ). The two beams are combined, 623 time-interleaved, and directed to the photodetector (PD) (bandwidth: >10 GHz, Alphalas) for detection. 624 The signals obtained from both PMT and PD were then passed to a real-time high-bandwidth digitizer (20 625 GHz, 80 GS/s, Lecroy) for data recording. 626 Cell culture and preparation 627 MDA-MB231 (ATCC) and MCF7 (ATCC), which are two different breast cancer cell lines, were used for 628 the cell cycle study. The culture medium for MDA-MB231was ATCC modified RPMI 1640 (Gibco) 629 supplemented with 10% fetal bovine serum (FBS) (Gibco) and 1% antibiotic-antimycotic (Anti-Anti) 630 (Gibco), while that for MCF7 was DMEM supplemented with 10% FBS (Gibco) and 1% Anti-Anti 631 (Gibco). The cells were cultured inside an incubator under 5% CO 2 and 37°C, and subcultured twice a 632 week. 1e6 cells were pipetted out from each cell line and stained with Vybrant DyeCycle orange stain 633 (Invitrogen). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=1ezm5xa09jlh https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=k1dnflcadtcx https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 634 Data Availability 635 Data used in Figures 1-3 as well as Supplementary Figures S1-S15) is available on: 636 1. Pancreatic data: Gene Expression Omnibus (GEO) under accession code GSE132188. 637 2. Cardiac progenitor data is available from the ENA repository under the accession code 638 PRJEB23303 or from [ https://github.com/loosolab/cardiac-progenitors ]. 639 3. B-cell: STATegraData GitHub repository. [ https://github.com/STATegraData/STATegraData ] 640 4. Mass cytometry mesoderm: Cytobank 641 [ https://community.cytobank.org/cytobank/experiments/71953 ]. 642 5. Raw and processed data for scRNA-seq Human Hematopoeisis are available through the Human 643 Cell Atlas data portal at 644 https://data.humancellatlas.org/explore/projects/091cf39b-01bc-42e5-9437-f419a66c8a45 . 645 6. Embryoid Body: Mendeley Data repository at https://doi.org/10.17632/v6n743h5ng.1. 646 7. Mouse Organogenesis : NCBI Gene Expression Omnibus under accession number GSE119945 647 8. FACED cell cycle: https://github.com/ShobiStassen/VIA and on FigShare 648 https://doi.org/10.6084/m9.figshare.13601405.v1 649 9. scATAC-seq Hematopoiesis: GEO: GSE96772. Processed scATAC-seq data, which include PC 650 values and TF scores per cell can be found in Data S1. of 651 https://doi.org/10.1016/j.cell.2018.03.074 652 10. Toy Data: https://github.com/ShobiStassen/VIA 653 Code Availability 654 VIA is available as a pip installable python library “pyVIA” with tutorials and sample data available on 655 https://github.com/ShobiStassen/VIA and https://pypi.org/project/pyVIA/ 656 References 657 1. Street, K. et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. 658 BMC Genomics 19, 477 (2018). 659 2. Setty M, Kiseliovas V, Levine J, Gayoso A, Mazutis L, Pe'er D. Characterization of cell fate 660 probabilities in single-cell data with Palantir [published correction appears in Nat Biotechnol. 661 2019 Oct;37(10):1237]. Nat Biotechnol. 2019;37(4):451-460. doi:10.1038/s41587-019-0068-4 662 3. Qiu, X., Mao, Q., Tang, Y. et al. Reversed graph embedding resolves complex single-cell 663 trajectories. Nat Methods 14, 979–982 (2017). https://doi.org/10.1038/nmeth.4402 664 4. Saelens, W., Cannoodt, R., Todorov, H. et al. A comparison of single-cell trajectory inference 665 methods. Nat Biotechnol 37, 547–554 (2019). https://doi.org/10.1038/s41587-019-0071-9 666 5. Bastidas-Ponce, A. et al. Comprehensive single cell mRNA profiling reveals a detailed roadmap 667 for pancreatic endocrinegenesis. Development 146, (2019). 668 6. Cao, J. et al. Comprehensive single- cell transcriptional profiling of a multicellular organism. 669 Science 357,661–667 (2017). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://github.com/loosolab/cardiac-progenitors https://github.com/STATegraData/STATegraData https://community.cytobank.org/cytobank/experiments/71953 https://data.humancellatlas.org/explore/projects/091cf39b-01bc-42e5-9437-f419a66c8a45 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE119945 https://github.com/ShobiStassen/VIA https://doi.org/10.6084/m9.figshare.13601405.v1 https://doi.org/10.1016/j.cell.2018.03.074 https://github.com/ShobiStassen/VIA https://github.com/ShobiStassen/VIA https://pypi.org/project/pyVIA/ https://doi.org/10.1038/nmeth.4402 https://doi.org/10.1038/s41587-019-0071-9 https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 670 7. Packer, J. S. et al. A lineage- resolved molecular atlas of C. elegans embryogenesis at single- cell 671 resolution.Science 365, eaax1971 (2019). 672 8. Cao, J., Spielmann, M., Qiu, X. et al. The single-cell transcriptional landscape of mammalian 673 organogenesis. Nature 566, 496–502 (2019). 674 9. Briggs, J. A. et al. The dynamics of gene expression in vertebrate embryogenesis at single- cell 675 resolution.Science 360, eaar5780 (2018). 676 10. Litviňuková, M., Talavera-López, C., Maatz, H. et al. Cells of the adult human heart. Nature 677 (2020). 678 11. Stassen SV, Siu DMD, Lee KCM, Ho JWK, So HKH, Tsia KK. PARC: ultrafast and accurate 679 clustering of phenotypic data of millions of single cells. Bioinformatics. 2020 May 680 1;36(9):2778-2786. doi: 10.1093/bioinformatics/btaa042. 681 12. Ulrike von Luxburg, Agnes Rad, Matthias Hein. Hitting and Commute Times in Large Random 682 Neighborhood Graphs. Journal of Machine Learning Research 15, 1751-1798 (2014) 683 13. Marius Lange, Volker Bergen, Michal Klein, Manu Setty, Bernhard Reuter, Mostafa Bakhti, 684 Heiko Lickert, Meshal Ansari, Janine Schniering, Herbert B. Schiller, Dana Pe’er, Fabian J. 685 Theis. CellRank for directed single-cell fate mapping. bioRxiv 2020.10.19.345983; doi: 686 https://doi.org/10.1101/2020.10.19.345983 687 14. McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and 688 projection. J. Open Source Software. 3, 861 (2018). 689 15. Moon, K.R., van Dijk, D., Wang, Z. et al. Visualizing structure and transitions in 690 high-dimensional biological data. Nat Biotechnol 37, 1482–1492 (2019). 691 https://doi.org/10.1038/s41587-019-0336-3 692 16. Tam PP, Behringer RR. Mouse gastrulation: the formation of a mammalian body plan. Mech Dev. 693 1997;68(1-2):3-25. doi:10.1016/s0925-4773(97)00123-8 694 17. Chin AM, Hill DR, Aurora M, Spence JR. Morphogenesis and maturation of the embryonic and 695 postnatal intestine. Semin Cell Dev Biol. 2017 Jun;66:81-93. doi: 10.1016/j.semcdb.2017.01.011. 696 Epub 2017 Feb 1. 697 18. Gilbert SF. Developmental Biology. 6th edition. Sunderland (MA): Sinauer Associates; 2000. The 698 Neural Crest. Available from: https://www.ncbi.nlm.nih.gov/books/NBK10065/ 699 19. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program, Nature 700 (2019) https://doi.org/10.1038/s41586-019-1629-x 701 20. Jia G, Preussner J, Chen X, Guenther S, Yuan X, Yekelchyk M, Kuenne C, Looso M, Zhou Y, 702 Teichmann S, Braun T. Single cell RNA-seq and ATAC-seq analysis of cardiac progenitor cell 703 transition states and lineage settlement. Nat Commun. 2018 Nov 19;9(1):4877. 704 21. Tanya E. Foley, Bradley Hess, Joanne G. A. Savory, Randy Ringuette, David Lohnes.Role of Cdx 705 factors in early mesodermal fate decisions.Development 2019 146: dev170498 doi: 706 10.1242/dev.170498 Published 1 April 2019 707 22. Yao Y, Yao J, Boström KI. SOX Transcription Factors in Endothelial Differentiation and 708 Endothelial-Mesenchymal Transitions. Front Cardiovasc Med. 2019;6:30. Published 2019 Mar 709 28. doi:10.3389/fcvm.2019.00030 710 23. Potta SP, Liang H, Winkler J, Doss MX, Chen S, Wagh V, Pfannkuche K, Hescheler J, Sachinidis 711 A. Isolation and functional characterization of alpha-smooth muscle actin expressing .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://doi.org/10.1101/2020.10.19.345983 https://doi.org/10.1038/s41587-019-0336-3 https://www.ncbi.nlm.nih.gov/books/NBK10065/ https://doi.org/10.1038/s41586-019-1629-x https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 712 cardiomyocytes from embryonic stem cells. Cell Physiol Biochem. 2010;25(6):595-604. doi: 713 10.1159/000315078. Epub 2010 May 18. PMID: 20511704. 714 24. Warkman AS, Whitman SA, Miller MK, Garriock RJ, Schwach CM, Gregorio CC, Krieg PA. 715 Developmental expression and cardiac transcriptional regulation of Myh7b, a third myosin heavy 716 chain in the vertebrate heart. Cytoskeleton (Hoboken). 2012 May;69(5):324-35. doi: 717 10.1002/cm.21029. Epub 2012 Apr 30. Erratum in: Cytoskeleton (Hoboken). 2012 718 Dec;69(12):1086. PMID: 22422726; PMCID: PMC4734749. 719 25. Mahmoud AI, Kocabas F, Muralidhar SA, et al. Meis1 regulates postnatal cardiomyocyte cell 720 cycle arrest. Nature. 2013;497(7448):249-253. doi:10.1038/nature12054 721 26. Gomez-Cabrero, D., Tarazona, S., Ferreirós-Vidal, I. et al. STATegra, a comprehensive 722 multi-omics dataset of B-cell differentiation in mouse. Sci Data 6, 256 (2019). 723 https://doi.org/10.1038/s41597-019-0202-7 724 27. Jason D. Buenrostro, M. Ryan Corces, Caleb A. Lareau, Beijing Wu, Alicia N. Schep, Martin J. 725 Aryee, Ravindra Majeti, Howard Y. Chang, William J. Greenleaf, Integrated Single-Cell Analysis 726 Maps the Continuous Regulatory Landscape of Human Hematopoietic Differentiation, Cell, 173, 727 1535-1548.e16, (2018) https://doi.org/10.1016/j.cell.2018.03.074 . 728 28. Wolf, F. A. et al. PAGA: graph abstraction reconciles clustering with trajectory inference through 729 a topology preserving map of single cells. Genome Biol. 20, 59 (2019). 730 29. Gutierrez GD, Gromada J, Sussel L. Heterogeneity of the Pancreatic Beta Cell. Front Genet. 731 2017;8:22. Published 2017 Mar 6. doi:10.3389/fgene.2017.00022 732 30. Krentz NAJ, Lee MYY, Xu EE, Sproul SLJ, Maslova A, Sasaki S, Lynn FC. Single-Cell 733 Transcriptome Profiling of Mouse and hESC-Derived Pancreatic Progenitors. Stem Cell Reports. 734 2018 Dec 11;11(6):1551-1564. doi: 10.1016/j.stemcr.2018.11.008. PMID: 30540962; PMCID: 735 PMC6294286. 736 31. Chen, H., Lareau, C., Andreani, T. et al. Assessment of computational methods for the analysis of 737 single-cell ATAC-seq data. Genome Biol 20, 241 (2019). 738 https://doi.org/10.1186/s13059-019-1854-5 739 32. Ko, M.E., Williams, C.M., Fread, K.I. et al. FLOW-MAP: a graph-based, force-directed layout 740 algorithm for trajectory mapping in single-cell time course datasets. Nat Protoc 15, 398–420 741 (2020). https://doi.org/10.1038/s41596-019-0246-3 742 33. Wu J. L., Xu Y. Q., Xu J. J., Wei X. X., Chan A. C. S., Tang A. H. L., Lau A. K. S., Chung B. M. 743 F., Cheung Shum H., Lam E. Y., Wong K. K. Y., Tsia K. K., “Ultrafast laser-scanning time-stretch 744 imaging at visible wavelengths,” Light Sci. Appl. 6(1), e16196 (2016).10.1038/lsa.2016.196 745 34. Popescu G, Park Y, Lue N, Best-Popescu C, Deflores L, Dasari RR, Feld MS, Badizadegan K. 746 Optical imaging of cell mass and growth dynamics. Am J Physiol Cell Physiol. 2008 747 Aug;295(2):C538-44. doi: 10.1152/ajpcell.00121.2008. Epub 2008 Jun 18. 748 35. Kyoohyun Kim, Jochen Guck The Relative Densities of Cytoplasm and Nuclear Compartments 749 Are Robust against Strong PerturbationBiophysical Journal. Volume 119, Issue 10, 17 November 750 2020, Pages 1946-1957 751 36. Kafri R, Levy J, Ginzberg MB, Oh S, Lahav G, Kirschner MW. Dynamics extracted from fixed 752 cells reveal feedback linking cell growth to cell cycle. Nature. 2013 Feb 28;494(7438):480-3. doi: 753 10.1038/nature11897. PMID: 23446419; PMCID: PMC3730528. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://doi.org/10.1038/s41597-019-0202-7 https://doi.org/10.1016/j.cell.2018.03.074 https://doi.org/10.1186/s13059-019-1854-5 https://doi.org/10.1038/s41596-019-0246-3 https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 754 37. Park SR, Namkoong S, Friesen L, Cho CS, Zhang ZZ, Chen YC, Yoon E, Kim CH, Kwak H, 755 Kang HM, Lee JH. Single-Cell Transcriptome Analysis of Colon Cancer Cell Response to 756 5-Fluorouracil-Induced DNA Damage. Cell Rep. 2020 Aug 25;32(8):108077. doi: 757 10.1016/j.celrep.2020.108077. 758 38. Zangle TA, Teitell MA. Live-cell mass profiling: an emerging approach in quantitative 759 biophysics. Nat Methods. 2014 Dec;11(12):1221-8. doi: 10.1038/nmeth.3175. PMID: 25423019; 760 PMCID: PMC4319180. 761 39. Tse HT, Gossett DR, Moon YS, Masaeli M, Sohsman M, Ying Y, Mislick K, Adams RP, Rao J, 762 Di Carlo D. Quantitative diagnosis of malignant pleural effusions by single-cell 763 mechanophenotyping. Sci Transl Med. 2013 Nov 20;5(212):212ra163. doi: 764 10.1126/scitranslmed.3006559. PMID: 24259051. 765 40. Otto, O., Rosendahl, P., Mietke, A. et al. Real-time deformability cytometry: on-the-fly cell 766 mechanical phenotyping. Nat Methods 12, 199–202 (2015). https://doi.org/10.1038/nmeth.3281 767 41. Kimmerling, R.J., Prakadan, S.M., Gupta, A.J. et al. Linking single-cell measurements of mass, 768 growth rate, and gene expression. Genome Biol 19, 207 (2018). 769 https://doi.org/10.1186/s13059-018-1576-0 770 42. Traag, V.A., Waltman, L. & van Eck, N.J. From Louvain to Leiden: guaranteeing well-connected 771 communities. Sci Rep 9, 5233 (2019). https://doi.org/10.1038/s41598-019-41695-z 772 43. Langville, Amy N., and Carl D. Meyer. Google's PageRank and Beyond: The Science of Search 773 Engine Rankings. Princeton University Press, 2006. 774 44. Chung F., Zhao W. (2010) PageRank and Random Walks on Graphs. In: Katona G.O.H., 775 Schrijver A., Szőnyi T., Sági G. (eds) Fete of Combinatorics and Computer Science. Bolyai 776 Society Mathematical Studies, vol 20. Springer, Berlin, Heidelberg. 777 45. van Dijk D, Sharma R, Nainys J, et al. Recovering Gene Interactions from Single-Cell Data 778 Using Data Diffusion. Cell. 2018;174(3):716-729.e27. doi:10.1016/j.cell.2018.05.061 779 46. Coifman, R. R. et al. Geometric diffusions as a tool for harmonic analysis and structure definition 780 of data: diffusion maps. Proc. Natl Acad. Sci. USA 102,7426–7431 (2005). 781 47. Haghverdi L, Büttner M, Wolf FA, Buettner F, Theis FJ. Diffusion pseudotime robustly 782 reconstructs lineage branching. Nat Methods. 2016;13(10):845-848. doi:10.1038/nmeth.3971 783 48. Bergen, V., Lange, M., Peidli, S. et al. Generalizing RNA velocity to transient cell states through 784 dynamical modeling. Nat Biotechnol 38, 1408–1414 (2020). 785 https://doi.org/10.1038/s41587-020-0591-3 786 49. Zheng GX, Terry JM, et al. Massively parallel digital transcriptional profiling of single cells. Nat 787 Commun. 2017 Jan 16;8:14049. doi: 10.1038/ncomms14049. 788 50. Aran D et al., (2019). “Reference-based analysis of lung single-cell sequencing reveals a 789 transitional profibrotic macrophage.” Nat. Immunol., 20, 163-172 790 51. Novershtern N. et al., Densely interconnected transcriptional circuits control cell states in human 791 hematopoiesis. Cell. 2011 Jan 21;144(2):296-309. 792 52. Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM 3rd, Hao Y, Stoeckius M, 793 Smibert P, Satija R. Comprehensive Integration of Single-Cell Data. Cell. 2019 794 Jun13;177(7):1888-1902.e21. doi: 10.1016/j.cell.2019.05.031. Epub 2019 Jun 6. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://doi.org/10.1038/nmeth.3281 https://doi.org/10.1186/s13059-018-1576-0 https://doi.org/10.1038/s41598-019-41695-z https://doi.org/10.1038/s41587-020-0591-3 https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 795 53. Siu, KCM Lee, MCK Lo, SV Stassen, M Wang, IZQ Zhang, HKH So. Deep-learning-assisted 796 biophysical imaging cytometry at massive throughput delineates cell population heterogeneity. 797 Lab on a Chip 20 (20), 3696-3708 798 54. KCM Lee, M Wang, KSE Cheah, GCF Chan, HKH So, KKY Wong, KK Tsia.. Quantitative 799 phase imaging flow cytometry for ultra‐large‐scale single‐cell biophysical phenotyping. 800 Cytometry Part A 95 (5), 510-520 801 55. Wenwei Yan Jianglai Wu Kenneth K. Y. Wong Kevin K. Tsia, A high‐throughput all‐optical 802 laser‐scanning imaging flow cytometer with biomolecular specificity and subcellular resolution, 803 J. Biophotonics (2017) https://onlinelibrary.wiley.com/doi/abs/10.1002/jbio.201700178 804 56. F. Chung and S.-T. Yau, Discrete Green’s Functions. Journal of Combinatorial Theory, Series A, 805 91(1-2) (2000), pp. 191–214 806 57. Van den Berge, K., Roux de Bézieux, H., Street, K. et al. Trajectory-based differential expression 807 analysis for single-cell sequencing data. Nat Communications 808 58. Yury A. Malkov, D. Yashunin. Efficient and Robust Approximate Nearest Neighbor Search Using 809 Hierarchical Navigable Small World Graph, Computer Science, Medicine, Mathematics, IEEE 810 Transactions on Pattern Analysis and Machine Intelligence, 2020 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://onlinelibrary.wiley.com/doi/abs/10.1002/jbio.201700178 https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ 10_1101-2021_02_11_430695 ---- Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Elliott Gordon-Rodriguez 1 Thomas P. Quinn 2 John P. Cunningham 1 Abstract The automatic discovery of interpretable features that are associated to an outcome of interest is a central goal of bioinformatics. In the context of high-throughput genetic sequencing data, and Compositional Data more generally, an important class of features are the log-ratios between sub- sets of the input variables. However, the space of these log-ratios grows combinatorially with the dimension of the input, and as a result, existing learning algorithms do not scale to increasingly common high-dimensional datasets. Building on recent literature on continuous relaxations of dis- crete latent variables, we design a novel learning algorithm that identifies sparse log-ratios several orders of magnitude faster than competing meth- ods. As well as dramatically reducing runtime, our method outperforms its competitors in terms of sparsity and predictive accuracy, as measured across a wide range of benchmark datasets. 1. Introduction Much recent work has been devoted to designing differen- tiable relaxations of discrete latent variables. These relax- ations can be used to learn class membership (Jang et al., 2016; Maddison et al., 2017; Potapczynski et al., 2020), permutations (Linderman et al., 2018; Mena et al., 2018), subsets (Xie & Ermon, 2019; Yang et al., 2019), and rank- ings (Cuturi et al., 2019; Blondel et al., 2020). Depend- ing on their use case, existing methods range in complex- ity, from the simple-but-effective straight-through estimator (Bengio et al., 2013), to mathematically intricate schemes based on optimal transport (Xie et al., 2020). However, the driving principle is always the same: to enable efficient gradient-based optimization on an otherwise intractable dis- crete space. The goal of our work is to extend this principle to a novel setting where, to the best of our knowledge, no dif- ferentiable relaxations have yet been proposed. Motivated 1Department of Statistics, Columbia University 2Applied Arti- ficial Intelligence Institute (A2I2), Deakin University. Correspon- dence to: Elliott Gordon-Rodriguez . by domain applications, our objective is to select log-ratios from a set of covariates, a problem that is equivalent to a dis- crete optimization over pairs of disjoint subsets (where the pair represents the numerator and denominator of the ratio, respectively). Our novel relaxation will result in dramatic speedups over several recent state-of-the-art learning algo- rithms from the field of bioinformatics, thereby enabling the analysis of much larger datasets than previously possible. Log-ratios are an important class of features for analyz- ing high-throughput sequencing (HTS) metagenomic data (Wooley et al., 2010; Gloor & Reid, 2016; Gloor et al., 2017; Quinn et al., 2018). For example, in microbiome count data, the relative weight between two sub-populations of related microorganisms can serve as a clinically useful biomarker (Rahat-Rozenbloom et al., 2014; Crovesy et al., 2020; Magne et al., 2020). More generally, log-ratios are of fundamental importance to the field of Compositional Data (CoDa), of which HTS data can be seen as a special case (Pawlowsky-Glahn & Egozcue, 2006; Pawlowsky-Glahn & Buccianti, 2011). CoDa can be defined as simplex-valued data, or equivalently, non-negative vectors whose totals are uninformative, i.e., relative data. Due to the nature of the recording technique, HTS data represents the relative abun- dance of different microbial signatures in a given sample, and therefore is an instance of CoDa (Gloor & Reid, 2016; Gloor et al., 2017; Quinn et al., 2018). Indeed, the ap- plication of CoDa methodology to HTS data has become increasingly popular in recent years (Fernandes et al., 2013; 2014; Rivera-Pinto et al., 2018; Quinn et al., 2019; Calle, 2019), with log-ratios serving as the basic building blocks for statistical analysis. But why do log-ratios form the basis of CoDa methodology? Unlike unconstrained real-valued data, the relative nature of HTS data and CoDa results in each covariate becoming neg- atively correlated to all others (increasing one component of a composition implies a relative decrease of the other components). It is well known that, as a result, the usual measures of association and feature attribution are problem- atic when applied to CoDa (Pearson, 1896; Filzmoser et al., 2009; Van den Boogaart & Tolosana-Delgado, 2013; Lovell et al., 2015). Log-ratios account for this idiosyncratic struc- ture by transforming CoDa onto unconstrained feature space, where the usual tools of statistical learning apply (Aitchison, 1982; Pawlowsky-Glahn & Egozcue, 2006). The choice Learning Sparse Log-Ratios for High-Throughput Sequencing Data of the log-ratio transform offers the necessary property of scale invariance, but in the CoDa literature it holds primacy for a variety of other technical reasons, including so-called subcompositional coherence (Aitchison, 1982; Pawlowsky- Glahn & Buccianti, 2011; Egozcue & Pawlowsky-Glahn, 2019). Log-ratios can be taken over pairs of individual covariates (Aitchison, 1982; Greenacre, 2019b) or aggrega- tions thereof, typically geometric means (Aitchison, 1982; Egozcue et al., 2003; Egozcue & Pawlowsky-Glahn, 2005; Rivera-Pinto et al., 2018; Quinn & Erb, 2019) or summa- tions (Greenacre, 2019a; 2020; Quinn & Erb, 2020). The resulting features work well empirically, but also imply a clear interpretation: a log-ratio is a single composite score that expresses the overall quantity of one sub-population as compared with another. When the log-ratios are sparse, meaning they are taken over a small number of covariates, they define biomarkers that are particularly intuitive to un- derstand, a key desiderata for predictive models that are of clinical relevance (Goodman & Flaxman, 2017). Thus, learning sparse log-ratios is a central problem in CoDa. This problem is especially challenging in the context of HTS data, due to its high dimensionality (ranging from 100 to over 10,000 covariates). Existing methods rely on stepwise search or evolutionary algorithms (Rivera-Pinto et al., 2018; Greenacre, 2019b; Quinn & Erb, 2020; Prifti et al., 2020), which scale very poorly with the dimension of the input. These algorithms are prohibitively slow for most HTS datasets, and thus there is a new demand for sparse and interpretable models that scale to high dimensions (Li, 2015; Cammarota et al., 2020; Susin et al., 2020). This demand motivates the present work, in which we present CoDaCoRe,1a novel learning algorithm for Compositional Data via Continuous Relaxations. The key idea behind CoDaCoRe is to approximate a combinatorial optimization over the set of log-ratios (equivalent to the set of pairs of disjoint subsets of the covariates), by means of a continuous relaxation that can be optimized efficiently using gradient descent. To the best of our knowledge, CoDaCoRe is the first CoDa method that scales to high dimensions, and that simultaneously produces sparse, interpretable, and accurate models. The main contributions of our method can be summarized as follows: • Computational efficiency. CoDaCoRe scales linearly with the dimension of the input. It runs several orders of magnitude faster than its competitors. • Interpretability. CoDaCoRe identifies a set of log- ratios that are sparse, biologically meaningful, and ranked in order of importance. Our model is highly interpretable, and much sparser, relative to compet- ing methods of similar accuracy and computational 1Our implementation can be downloaded from https://github.com/cunningham-lab/codacore. complexity. • Predictive accuracy. CoDaCoRe achieves better out- of-sample accuracy than existing CoDa methods, and performs similarly to state-of-the-art black-box classi- fiers (which are neither sparse nor interpretable). • Optimization robustness. We leverage the functional form of our continuous relaxation to identify an adap- tive learning rate that enables CoDaCoRe to converge reliably, requiring no additional hyperparameter tuning when deployed on novel datasets. 2. Background Our work focuses on the supervised learning problem with compositional predictors. Namely, we are given data {xi,yi}ni=1, where xi is compositional (e.g., HTS data), and our goal is to learn an association xi 7→ yi. For many mi- crobiome applications, xi represents a vector of frequencies of the different species of bacteria that compose the micro- biome of the ith subject. In other words, xij denotes the abundance of the jth species (of which there are p total) in the ith subject. The response yi is a binary variable indicat- ing whether the ith subject belongs to the case or the control groups (e.g., sick vs. healthy). Due to the nature of HTS, the input frequencies xij arise from an inexhaustive sampling procedure, so that the totals ∑p j=1 xij are arbitrary and the components should only be interpreted in relative terms (i.e., as CoDa) (Gloor & Reid, 2016; Gloor et al., 2017; Quinn et al., 2018; Calle, 2019). While we mainly consider applications to microbiome data, our method applies more generally to any high-dimensional CoDa, including those produced by Liquid Chromatography Mass Spectrometry (Filzmoser & Walczak, 2014). In order to account for the compositional nature of xi, we seek log-ratio transformed features that can be passed to a regression function downstream. As discussed, these log- ratios will result in interpretable features and scale-invariant models (that are also subcompositionally coherent). The simplest such choice is to take the pairwise log-ratios be- tween input variables, i.e., log(xij+/xij−), where (j +,j−) indexes a pair of covariates (Aitchison, 1982). Note that the ratio cancels out any scaling factor applied to xi, preserv- ing only the relative information in the data, while the log transform ensures the output is (unconstrained) real-valued. In order to select a good pair (j+,j−) from the input co- variates, Greenacre (2019b) proposed a step-wise algorithm for identifying pairwise log-ratios that explain the most vari- ation in a dataset. This algorithm produces a sparse and interpretable set of features, but it is prohibitively slow on high-dimensional datasets, as a result of the step-wise search scaling poorly in the dimension of the input. A heuristic search algorithm that is less accurate but computationally faster has been developed as part of Quinn et al. (2017), Learning Sparse Log-Ratios for High-Throughput Sequencing Data though its computational cost is still troublesome (as we shall see in Section 4). 2.1. Balances Recently, a class of log-ratios known as balances (Egozcue & Pawlowsky-Glahn, 2005) have become of interest in mi- crobiome applications, due to their interpretability as the relative weight between two sub-populations of bacteria (Morton et al., 2019b; Quinn & Erb, 2019). Balances are defined as the log-ratios between geometric means of two subsets of the covariates:2 B(xi;J +,J−) = log  (∏j∈J+ xij) 1p+ ( ∏ j∈J− xij) 1 p−   (1) = 1 p+ ∑ j∈J+ log xij − 1 p− ∑ j∈J− log xij, where J+ and J− denote a pair of disjoint subsets of the indices {1, . . . ,p}, and p+ and p− denote their respective sizes. For example, in microbiome data, J+ and J− are groups of bacteria species that may be related by their envi- ronmental niche (Morton et al., 2017) or genetic similarity (Silverman et al., 2017; Washburne et al., 2017). Note that when p+ = p− = 1 (i.e., J+ and J− each contain a single element), B(x;J+,J−) reduces to a pairwise log-ratio. By allowing for the aggregation of more than one covariate in the numerator and denominator of the log-ratio, balances provide a richer set of features that allows for more flexible models than pairwise log-ratios. Insofar as the balances are taken over a small number of covariates (i.e., J+ and J− are sparse), they also provide highly interpretable biomarkers. The selbal algorithm (Rivera-Pinto et al., 2018) has gained popularity as a method for automatically identifying bal- ances that predict a response variable. However, this algo- rithm is also based on a step-wise search through the combi- natorial space of subset pairs (J+,J−), which scales poorly in the dimension of the input and becomes prohibitively slow for HTS data (Susin et al., 2020). 2.2. Amalgamations An alternative to balances, known as amalgamations, is defined by aggregating components through summation: A(xi;J +,J−) = log (∑ j∈J+ xij∑ j∈J− xij ) , (2) where again J+ and J− denote disjoint subsets of the input components. Amalgamations have the advantage of reduc- 2Note that the original definition of balances includes a “nor- malization” constant, which we omit for clarity. This constant is in fact unnecessary, as it will get absorbed into a regression coefficient downstream. ing the dimensionality of the data through an operation, the sum, that some authors argue is more interpretable than a geometric mean (Greenacre, 2019a; Greenacre et al., 2020). On the other hand, amalgamations can be less effective than balances for identifying components that are statistically important, but small in magnitude, e.g., rare bacteria species (since small terms will have less impact on a summation than on a product). Recently, Greenacre (2020) has advocated for the use of expert-driven amalgamations, using domain knowledge to construct the relevant features. On the other hand, Quinn & Erb (2020) proposed amalgam, an evolutionary algorithm to automatically identify amalgamated log-ratios (Eq. 2) that are predictive of a response variable. However, this algorithm does not scale to high-dimensional data (albeit, comparing favorably to selbal), nor does it produce sparse models (hindering interpretability of the results). 2.3. Other Related Work CoDa methodology has also recently attracted interest from the machine learning community (Tolosana-Delgado et al., 2019; Quinn et al., 2020; Gordon-Rodriguez et al., 2020a;b; Templ, 2020). Relevant to us is DeepCoDA (Quinn et al., 2020), which combines self-explaining neural networks with log-ratio transformed features. In particular, DeepCoDA learns a set of log-contrasts, in which the numerator and denominator are defined as unequally weighted geometric averages of components. As a result of this weighting, Deep- CoDA loses much of the interpretability and intuitive appeal of balances (or amalgamations), which is exacerbated by its lack of sparsity (in spite of regularization). Moreover, like most deep architectures, DeepCoDA is sensitive to ini- tialization and optimization hyperparameters (which limits its ease of use) and is susceptible to overfitting (which can further compromise interpretability of the model). The special case of a linear log-contrast model has been referred to as Coda-lasso, and was separately proposed by Lu et al. (2019). While Coda-lasso scales better than selbal, it has been found to perform worse in terms of predictive accuracy (Susin et al., 2020). More importantly, Coda- lasso is still prohibitively slow on the high-dimensional HTS data that we wish to consider. Last, we highlight another common set of features that are also a special case of log- contrasts: centered-log-ratios, where individual covariates are divided by the geometric mean of all input variables (Aitchison, 1982). Models using these features, such as Susin et al. (2020), can be accurate and computationally efficient, however they are inherently not sparse and are difficult to interpret scientifically (Greenacre, 2019a). Learning Sparse Log-Ratios for High-Throughput Sequencing Data Table 1. Qualitative comparison of the methods discussed, ordered from most sparse (top) to least (bottom). CoDaCoRe is the only learning algorithm that performs on all of our criteria. See Table 2 for a corresponding quantitative comparison. SCALABILITY INTERPRETABILITY SPARSITY ACCURACY CODACORE (OURS) + + + + PAIRWISE LOG-RATIOS (GREENACRE, 2019B) − + + − SELBAL (RIVERA-PINTO ET AL., 2018) − + + · LASSO + · + − CODA-LASSO (LU ET AL., 2019) − · · · AMALGAM (QUINN & ERB, 2020) − + − · DEEPCODA (QUINN ET AL., 2020) · · − · CLR-LASSO (SUSIN ET AL., 2020) + − − + BLACK-BOX (RANDOM FOREST, XGBOOST) + − − + 3. Methods We now present CoDaCoRe, a novel learning algorithm for HTS data, and more generally, high-dimensional CoDa. Unlike existing methods, CoDaCoRe is simultaneously scal- able, interpretable, sparse, and accurate. We compare the relative merits of CoDaCoRe and its competitors in Table 1. 3.1. Continuous Relaxation In its basic formulation, CoDaCoRe learns a regression function of the form: f(x) = α + β ·B(x;J+,J−), (3) where B denotes a balance (Eq. 1), and α and β are scalar parameters. For clarity, we will restrict our exposition to this formulation, but note that our algorithm can be applied equally to learn amalgamations instead of balances (see Section 3.5), as well as generalizing straightforwardly to nonlinear functions (provided they are suitably parameter- ized and differentiable). Let L(y,f) denote the cross-entropy loss, with f ∈ R given in logit space. The goal of CoDaCoRe is to find the balance that is maximally associated of the response. Mathematically, this can be written as an empirical risk minimization: min (J+,J−,α,β) ∑ i L ( yi,α + β ·B(xi;J+,J−) ) . (4) This objective involves a discrete optimization over pairs (J+,J−) of disjoint subsets, a combinatorially hard prob- lem. The key insight of CoDaCoRe is to approximate this combinatorial optimization with a continuous relaxation that can be trained efficiently by gradient descent. Our relaxation is parameterized by an unconstrained vec- tor of “assignment weights”, w ∈ Rp, with one scalar parameter per input dimension (e.g., one weight per bacte- ria species). The weights are mapped to a vector of “soft assignments” via: w̃ = 2 · sigmoid(w)−1 = 2 1 + exp(−w) −1, (5) where the sigmoid is applied component-wise. Eq. 5 maps onto the interval (−1,1), which can be understood straight- forwardly as a relaxation of the set {−1,1,0}, denoting membership to J−, J+, or neither, respectively. Let us write w̃+ = ReLU(w̃) and w̃− = ReLU(−w̃) for the pos- itive and negative parts of w̃, respectively. We approximate balances (Eq. 1) with the following relaxation: B̃(xi; w) = ∑ j w̃ + j log xij∑ j w̃ + j − ∑ j w̃ − j log xij∑ j w̃ − j (6) = w̃+ · log xi ‖w̃+‖1 − w̃− · log xi ‖w̃−‖1 . (7) In other words, we approximate geometric averages over subsets of the inputs, by weighted geometric averages over all components (compare Equations 1 and 6). Crucially, this relaxation is differentiable in w, allowing us to construct a surrogate objective function that can be optimized jointly in (w,α,β) by gradient descent: min (w,α,β) ∑ i L ( yi,α + β · B̃(xi; w) ) . (8) We defer the details of our implementation of gradient de- scent to the Supplement (Section A), but we highlight two observations. First, the computational cost of the gradient of Eq. 8 is linear in the dimension of w. As a result, our algorithm scales linearly with the dimension of the input, and is fast to fit on large datasets (see Section 4.3). Second, knowledge of the functional form of our relaxation (Eq. 6) can be exploited in order to select the learning rate adap- tively (i.e., without tuning), resulting in robust convergence across all real and simulated datasets that we considered. 3.2. Discretization While a set of features in the form of Eq. 6 may perform accurate classification, a weighted geometric average over Learning Sparse Log-Ratios for High-Throughput Sequencing Data all covariates is much harder for a biologist to interpret (and less intuitively appealing) than a bona fide balance over a small number of covariates. On these grounds, CoDaCoRe implements a “discretization” procedure that exploits the in- formation learned by the soft assignment vector w̃, in order to efficiently identify a pair of sparse subsets (Ĵ+, Ĵ−). The most straightforward way to convert the (soft) assign- ment w̃ into a (hard) pair of subsets is by fixing a threshold t ∈ (0,1): J̃+ = {j : w̃j > t}, (9) J̃− = {j : w̃j < −t}. (10) Note that given a trained w̃ and a fixed threshold t, we can evaluate the quality of the corresponding balance B(x; J̃+, J̃−) (resp. amalgamation) by optimizing Eq. 4 over (α,β) alone, i.e., fitting a linear model. Computation- ally, fitting a linear model is much faster than optimizing Eq. 8, and can be done repeatedly for a range of values of t with little overhead. In CoDaCoRe, we combine this strat- egy with cross-validation in order to select the threshold, t̂, that optimizes predictive performance (see Section A of the Supplement for full detail). Finally, the trained regression function is: f̂(x) = α̂ + β̂ ·B(x; Ĵ+, Ĵ−), (11) where (Ĵ+, Ĵ−) are the subsets corresponding to the opti- mal threshold t̂, and (α̂, β̂) are the coefficients obtained by regressing yi against B(xi; Ĵ+, Ĵ−) on the entire training set. 3.3. Regularization Note from Equations 9 and 10 that larger values of t result in fewer covariates assigned to the balance B(x; J̃+, J̃−), i.e., a sparser model. Thus, CoDaCoRe can be regularized simply by making t̂ larger. Similarly to lasso regression, our implementation of CoDaCoRe uses the 1-standard-error rule: namely, to pick the sparsest model (i.e., the highest t) with mean cross-validated score within 1 standard error of the optimum (Friedman et al., 2001). Trivially, this rule can be generalized to a λ-standard-error rule, where λ becomes a regularization hyperparameter that can be tuned by the practitioner if so desired (with lower values trading off some sparsity in exchange for predictive accuracy). For consis- tency, we restrict our experiments to λ = 1, however our results can be improved further by tuning λ on each dataset. In practice, we recommend choosing a lower value (e.g., λ = 0) when the emphasis is on predictive accuracy rather than interpretability or sparsity, though our benchmarks still show competitive performance with the choice of λ = 1. Algorithm 1 CoDaCoRe Inputs: Training data: (xi,yi)ni=1. Initialize ĝ(x) = 0. repeat Initialize a new relaxation (w,α,β). Train (w,α,β) by gradient descent. Use cross-validation to find the optimal threshold, t̂. Retrain (α,β) using (Ĵ+, Ĵ−). Update ensemble ĝ(x) ← ĝ(x) + f̂(x). until Ĵ+ = ∅ or Ĵ− = ∅. Return ĝ(x). 3.4. CoDaCoRe Algorithm The computational efficiency of our continuous relaxation allows us to train multiple regressors of the form of Eq. 11 within a single model. In the full CoDaCoRe algorithm, we ensemble multiple such regressors in a stage-wise additive fashion, where each successive balance is fitted on the resid- ual from the current model. Thus, CoDaCoRe identifies a sequence of balances, in decreasing order of importance, each of which is sparse and interpretable. Training termi- nates when an additional relaxation (Eq. 6) cannot improve the cross-validation score relative to the existing ensemble (equivalently, when we obtain t̂ = 1). Typically, only a small number of balances is required to capture the signal in the data, and as a result CoDaCoRe produces very sparse models overall, further enhancing interpretability. Our pro- cedure is summarized in Algorithm 1. 3.5. Amalgamations CoDaCoRe can be used to learn amalgamations (Eq. 2) much in the same way as for balances (the choice of which to use depending on the goals of the biologist). In this case, our relaxation is defined as: Ã(xi; w) = log (∑ j w̃ + j xij∑ j w̃ − j xij ) (12) = log ( w̃+ ·xi w̃− ·xi ) , (13) i.e., we approximate summations over subsets of the in- puts, with weighted summations over all components (com- pare Eq. 2 and Eq. 12). The rest of the argument follows verbatim, replacing B(·) with A(·) and B̃(·) with Ã(·) in Equations 3, 4, 8, and 11. 3.6. Extensions Our model allows for a number of extensions: • Unsupervised learning. By means of a suitable unsu- pervised loss function, CoDaCoRe can be extended to unlabelled datasets, {xi}ni=1, as a method for identi- Learning Sparse Log-Ratios for High-Throughput Sequencing Data fying log-ratios that provide a useful low-dimensional representation. Such a method would automatically provide a scalable alternative to several existing dimen- sionality reduction techniques for CoDa (Pawlowsky- Glahn et al., 2011; Mert et al., 2015; Martı́n-Fernández et al., 2018; Greenacre, 2019b; Martino et al., 2019). • Incorporating confounders. In addition to (xi,yi)ni=1, in some applications the effect of additional (non- compositional) predictors, zi, is also of interest. In this case, the effect of zi can be “partialled out” a pri- ori by first regressing yi on zi alone, and using this regression as the initialization of the CoDaCoRe en- semble. Alternatively, zi can also be modeled jointly in Equations 3 and 11 (e.g., by adding a linear term γ · zi) (Forslund et al., 2015; Noguera-Julian et al., 2016; Rivera-Pinto et al., 2018). • Nonlinear regression functions. Our method extends naturally to nonlinear regression functions of the form f(x) = hθ(B(x;J +,J−)), where hθ is a parameter- ized differentiable family. These functions include neural networks, which have recently become of in- terest in microbiome research (Morton et al., 2019a; Quinn et al., 2020). • Applications to non-compositional data. Aggregations of parts can be useful outside the realm of CoDa; for example, an amalgamation applied to a categorical variable with many levels represents a grouping of the categories (Bondell & Reich, 2009; Gertheiss & Tutz, 2010; Tutz & Gertheiss, 2016). 4. Experiments We evaluate CoDaCoRe on a collection of 25 benchmark datasets including 13 datasets from the Microbiome Learn- ing Repo (Vangay et al., 2019), and 12 microbiome, metabo- lite, and microRNA datasets curated by Quinn & Erb (2019). These data vary in dimension from 48 to 3,090 covariates (see Section B of the Supplement for a full description). For each dataset, we fit CoDaCoRe on 20 random 80/20 train/test splits, sampled with stratification by case-control (He & Ma, 2013). We compare against: • Interpretable models (Sections 2.1 and 2.2): pairwise log-ratios (Greenacre, 2019b)3, selbal (Rivera-Pinto et al., 2018), and amalgam (Quinn & Erb, 2020). We also consider lasso logistic regression (with regular- ization parameter chosen by cross-validation with the 1-standard-error rule). • Other CoDa models (Section 2.3): Coda-lasso (Lu et al., 2019), DeepCoDA (Quinn et al., 2020), and Susin et al. (2020). Note that these methods learn (weighted) geometric averages over a large number of 3Implemented using a heuristic search for improved computa- tional efficiency (Quinn et al., 2017). 101 102 103 104 105 Runtime (s) −40 −20 0 20 40 Ac cu ra cy g ai n ov er b as el in e (% ) CoDaCoRe (ours) Selbal Pairwise log-ratios Coda-lasso Amalgam 100 inputs 1,000 inputs Figure 1. Classification accuracy (over baseline) against runtime. Each point represents one of 25 datasets, with size proportional to the input dimension. Note the x-axis is drawn on the log-scale. Co- DaCoRe (with balances) is the only method that scales effectively to our larger datasets, while consistently achieving high predictive accuracy. Moreover, its performance is broadly consistent across smaller and larger datasets. input variables, which are evidently not as straightfor- ward to interpret as simple balances or amalgamations. • Black box classifiers: Random Forest and XGBoost, where we tune the model complexity parameters by cross-validation (subsample size and early stopping, respectively). 4.1. Results We evaluate the quality of our models across the following criteria: computational efficiency (as measured by runtime), sparsity (as measured by the percentage of inpute variables that are active in the model), and predictive accuracy (as measured by out-of-sample accuracy and ROC AUC). Table 2 provides an aggregated summary of the results; CoDa- CoRe (with balances) is performant on all metrics. Indeed, our method provides the only interpretable model that is simultaneously scalable, sparse, and accurate. Detailed per- formance metrics on each of the 25 datasets are provided in Section C of the Supplement. Figure 1 shows the average runtime of our classifiers on each dataset, with larger points denoting larger datasets. Co- DaCoRe trains orders of magnitude faster and scales better than existing interpretable CoDa methods. On our larger datasets (3,090 inputs), selbal runs in ∼100 hours, pairwise log-ratios and amalgam both run in ∼10 hours, and CoDa- CoRe runs in under 10 seconds (full runtimes are provided in Table 2 in the Supplement). All runs, including those in- volving gradient descent, were performed on identical CPU Learning Sparse Log-Ratios for High-Throughput Sequencing Data Table 2. Evaluation metrics shown for each method, averaged over 25 datasets × 20 random train/test splits. Standard errors are computed independently on each dataset, and then averaged over the 25 datasets. The models are ordered by sparsity, i.e., percentage of active input variables. CoDaCoRe (with balances) is the only learning algorithm that is simultaneously fast, sparse, and accurate. RUNTIME (S) ACTIVE VARS (%) ACCURACY (%) AUC (%) MAJORITY CLASS 0.0±0.0 0.0±0.0 62.5±0.0 50.0±0.0 CODACORE - BALANCES (OURS) 4.8±0.4 1.9±0.3 75.2±2.4 79.5±2.6 CODACORE - AMALGAMATIONS (OURS) 4.4±0.4 1.9±0.3 71.8±2.4 74.5±2.8 SELBAL (RIVERA-PINTO ET AL., 2018) 79,033.7±2,094.1 2.4±0.2 61.2±1.9 80.0±2.4 PAIRWISE LOG-RATIOS (GREENACRE, 2019B) 3,283.0±214.1 2.5±0.4 73.3±1.7 75.2±2.4 LASSO 1.6±0.1 4.4±0.6 72.4±1.7 75.2±2.3 CODA-LASSO (LU ET AL., 2019) 1,043.0±55.4 19.7±2.7 72.5±2.3 78.0±2.4 AMALGAM (QUINN & ERB, 2020) 7,360.5±209.8 87.6±2.1 74.4±2.5 78.2±2.7 DEEPCODA (QUINN ET AL., 2020) 296.5±21.4 89.3±0.6 70.6±2.9 77.6±2.9 CLR-LASSO (SUSIN ET AL., 2020) 2.0±0.2 100.0±0.0 77.5±1.8 81.6±2.2 RANDOM FOREST 10.6±0.4 · 78.0±2.2 82.2±2.2 XGBOOST 3.9±0.2 · 78.4±2.1 82.4±2.1 cores; CoDaCoRe can be accelerated further using GPUs, but we did not find it necessary to do so. It is also worth noting that the outperformance of CoDaCoRe is not merely as a result of the other methods failing on high-dimensional datasets. The consistent performance of CoDaCoRe across smaller and larger datasets is demonstrated in Supplemen- tary Tables 3, 4, and 5, which show a full breakdown of results across each dataset. Not only is CoDaCoRe sparser and more accurate than other interpretable models, it also performs on par with state-of- the-art black-box classifiers. By simply reducing the regular- ization parameter, from λ = 1 to λ = 0, CoDaCoRe (with balances) achieved an average 77.6% out-of-sample accu- racy of and 82.0% AUC, on par with Random Forest and XGBoost (bottom rows of Table 2), while only using 5.9% of the input variables, on average. This result indicates, first, that CoDaCoRe provides a highly effective algorithm for variable selection in high-dimensional HTS data. Second, the fact that CoDaCoRe achieves similar predictive accu- racy as best-in-class black-box classifiers, suggests that our model may have captured a near-complete representation of the signal in the data. At any rate, we take this as evidence that log-ratio transformed features are indeed of biological importance in the context of HTS data, corroborating previ- ous microbiome research (Rahat-Rozenbloom et al., 2014; Crovesy et al., 2020; Magne et al., 2020). 4.2. Interpretability The CoDaCoRe algorithm offers two kinds of interpretabil- ity. First, it provides the analyst with sets of covariates whose aggregated ratio predicts the outcome of interest. These sets are easy to understand because they are discrete, with each component making an equivalent (unweighted) contribution. They are also sparse, usually containing fewer than 10 features per ratio, and can be made sparser by adjust- ing the regularization parameter λ. Such ratios have a prece- dent in microbiome research, for example the Firmicutes- to-Bacteroidetes ratio is used as a biomarker of gut health (Crovesy et al., 2020; Magne et al., 2020). Second, Co- DaCoRe ranks predictive ratios hierarchically. Due to the ensembling procedure, the first ratio learned is the most predictive, the second ratio predicts the residual from the first, and so forth. Like principal components, the balances (or amalgamations) learned by CoDaCoRe are naturally or- dered in terms of their explanatory power. This ordering aids interpretability by decomposing a multivariable model into comprehensible “chunks” of information. Notably, we find a high degree of stability in the log-ratios selected by the model. We repeated CoDaCoRe on 10 inde- pendent training set splits of the Crohn disease data provided by Rivera-Pinto et al. (2018), and found consensus among the learned models. Figure 2 shows which bacteria were included for each split, in both versions of CoDaCoRe (bal- ances and amalgamations). Importantly, most of the bacteria that were selected consistently by CoDaCoRe – notably Di- alister, Roseburia and Clostridiales – were also identified by Rivera-Pinto et al. (2018). Differences between the sets selected by CoDaCoRe with balances vs. CoDaCoRe with amalgamations can be explained by differences in how the geometric mean vs. summation operations impact the log- ratio. The geometric mean, being more sensitive to small numbers, is more affected by the presence of rarer bacte- ria species like Dialister and Roseburia (as compared with the more common bacteria species like Haemophilus and Faecalibacterium). 4.3. Scaling to Liquid Biopsy Data HTS data generated from from clinical blood samples can be described as a “liquid biopsy” that can be used for cancer di- agnosis and surveillance (Best et al., 2015; Alix-Panabières Learning Sparse Log-Ratios for High-Throughput Sequencing Data 1 2 3 4 5 6 7 8 9 10 Dialister Aggregatibacter Lactobacillales Streptococcus Parabacteroides Peptostreptococcaceae Faecalibacterium Lachnospira Clostridiales Roseburia CoDaCoRe - Balances 1 2 3 4 5 6 7 8 9 10 Independent 80% training set splits Haemophilus Enterobacteriaceae Fusobacterium Blautia Streptococcus Dialister Lachnospiraceae Roseburia Prevotella Clostridiales Parabacteroides Bacteroides Faecalibacterium CoDaCoRe - Amalgamations Figure 2. CoDaCoRe variable selection for the first (most explana- tory) log-ratio on the Crohn disease data (Rivera-Pinto et al., 2018). For each of 10 independent training set splits (80% of the data), we show which variables are selected in the numerator (blue) and de- nominator (orange) of the log-ratio. Both versions of CoDaCoRe, with balances (top) or amalgamations (bottom), learn remarkably consistent log-ratios across independent training sets. & Pantel, 2016). These data can be very high-dimensional, especially when they include all gene transcripts as covari- ates. In a clinical context, the use of log-ratio predictors is an attractive option because they automatically correct for inter-sample sequencing biases that might otherwise limit the generalizability of the models (Dillies et al., 2013). Unfortunately, existing log-ratio methods like selbal and amalgam simply cannot scale to liquid biopsy data sets that contain as many as 50,000 or more input variables. The large dimensionality of such data has restricted its anal- ysis to overly simplistic linear models, black-box models that are scalable but not interpretable, or suboptimal hybrid approaches where covariates must be pre-selected based on univariate measures (Best et al., 2015; Zhang et al., 2017; Sheng et al., 2018). Owing to its linear scaling, CoDaCoRe Table 3. Evaluation metrics for the liquid biopsy data (Best et al., 2015), averaged over 20 independent 80/20 train/test splits. Co- DaCoRe (with balances) achieves equal predictive accuracy as competing methods, but with much sparser solutions. Note that sparsity is expressed as an (integer) number of active variables in the model (not as a percentage of the total, as was done in Table 1). Running time is shown in seconds (standard errors were small and are omitted for brevity). TIME # VARS ACC. (%) AUC (%) BASELINE 0 0±0 79.1±0.0 50.0±0.0 CODACORE 31 3±1 91.0±1.9 93.6±2.6 LASSO 23 22±4 87.8±1.3 94.7±1.5 RF 383 · 89.0±1.6 94.1±1.8 XGBOOST 108 · 90.6±1.9 95.9±1.5 can be fitted to these data at a similar computational cost to a single lasso regression, i.e., under a minute on a single CPU core. Thus, CoDaCoRe can be used to discover interpretable and predictive log-ratios that are suitable for liquid biopsy cancer diagnostics, among other similar applications. We showcase the capabilities of CoDaCoRe in this high- dimensional setting, by applying our algorithm to the liquid biopsy data of (Best et al., 2015). These contain p = 58,037 genes sequenced in n = 288 human subjects, 60 of whom were healthy controls, the others having been previously diagnosed with cancer. Averaging over 20 random 80/20 train/test splits of this dataset, we found that CoDaCoRe achieved the same predictive accuracy as competing meth- ods (within error), but obtained a much sparser model. Re- markably, CoDaCoRe identified log-ratios involving just 3 genes, that were equally predictive to both black-box classi- fiers and linear models with over 20 covariates. This case study again illustrates the potential of CoDaCoRe to derive novel biological insights, and also to develop learning al- gorithms for cancer diagnosis, a domain in which model interpretability – including sparsity – is of paramount im- portance (Wan et al., 2017). 5. Conclusion Our results corroborate the summary in Table 1: CoDaCoRe is the first sparse and interpretable CoDa model that can scale to high-dimensional HTS data. It does so convinc- ingly, with linear scaling that results in runtimes similar to linear models. Our method is also competitive in terms of predictive accuracy, performing comparably to powerful black-box classifiers, but with interpretability. Our findings suggest that CoDaCoRe could play a significant role in the future analysis of high-throughput sequencing data, with broad implications in microbiology, statistical genetics, and more generally, in the field of CoDa. Learning Sparse Log-Ratios for High-Throughput Sequencing Data References Aitchison, J. The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Method- ological), 44(2):139–160, 1982. Alix-Panabières, C. and Pantel, K. Clinical applications of circulating tumor cells and circulating tumor dna as liquid biopsy. Cancer discovery, 6(5):479–491, 2016. Bengio, Y., Léonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for con- ditional computation. arXiv preprint arXiv:1308.3432, 2013. Best, M. G., Sol, N., Kooi, I., Tannous, J., Westerman, B. A., Rustenburg, F., Schellen, P., Verschueren, H., Post, E., Koster, J., et al. Rna-seq of tumor-educated platelets en- ables blood-based pan-cancer, multiclass, and molecular pathway cancer diagnostics. Cancer cell, 28(5):666–676, 2015. Blondel, M., Teboul, O., Berthet, Q., and Djolonga, J. Fast differentiable sorting and ranking. In International Con- ference on Machine Learning, pp. 950–959. PMLR, 2020. Bondell, H. D. and Reich, B. J. Simultaneous factor selec- tion and collapsing levels in anova. Biometrics, 65(1): 169–177, 2009. Calle, M. L. Statistical analysis of metagenomics data. Genomics & informatics, 17(1), 2019. Cammarota, G., Ianiro, G., Ahern, A., Carbone, C., Temko, A., Claesson, M. J., Gasbarrini, A., and Tortora, G. Gut microbiome, big data and machine learning to promote precision medicine for cancer. Nature Reviews Gastroen- terology & Hepatology, 17(10):635–648, 2020. Crovesy, L., Masterson, D., and Rosado, E. L. Profile of the gut microbiota of adults with obesity: a systematic review. European journal of clinical nutrition, 74(9):1251–1262, 2020. Cuturi, M., Teboul, O., and Vert, J.-P. Differentiable rank- ing and sorting using optimal transport. In Advances in Neural Information Processing Systems, pp. 6861–6871, 2019. Dillies, M.-A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, G., Cas- tel, D., Estelle, J., et al. A comprehensive evaluation of normalization methods for illumina high-throughput rna sequencing data analysis. Briefings in bioinformatics, 14 (6):671–683, 2013. Egozcue, J. J. and Pawlowsky-Glahn, V. Groups of parts and their balances in compositional data analysis. Mathe- matical Geology, 37(7):795–828, 2005. Egozcue, J. J. and Pawlowsky-Glahn, V. Compositional data: the sample space and its structure. TEST, 28(3): 599–638, 2019. Egozcue, J. J., Pawlowsky-Glahn, V., Mateu-Figueras, G., and Barcelo-Vidal, C. Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35(3):279–300, 2003. Fernandes, A. D., Macklaim, J. M., Linn, T. G., Reid, G., and Gloor, G. B. Anova-like differential expression (aldex) analysis for mixed population rna-seq. PLoS One, 8(7):e67019, 2013. Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMur- rough, T. A., Edgell, D. R., and Gloor, G. B. Unifying the analysis of high-throughput sequencing datasets: charac- terizing rna-seq, 16s rrna gene sequencing and selective growth experiments by compositional data analysis. Mi- crobiome, 2(1):15, 2014. Filzmoser, P. and Walczak, B. What can go wrong at the data normalization step for identification of biomarkers? Journal of Chromatography A, 1362:194–205, 2014. Filzmoser, P., Hron, K., and Reimann, C. Univariate sta- tistical analysis of environmental (compositional) data: problems and possibilities. Science of the Total Environ- ment, 407(23):6100–6108, 2009. Forslund, K., Hildebrand, F., Nielsen, T., Falony, G., Le Chatelier, E., Sunagawa, S., Prifti, E., Vieira-Silva, S., Gudmundsdottir, V., Pedersen, H. K., et al. Disentangling type 2 diabetes and metformin treatment signatures in the human gut microbiota. Nature, 528(7581):262–266, 2015. Friedman, J., Hastie, T., Tibshirani, R., et al. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001. Gertheiss, J. and Tutz, G. Sparse modeling of categorial explanatory variables. The Annals of Applied Statistics, pp. 2150–2180, 2010. Gloor, G. B. and Reid, G. Compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data. Canadian journal of microbiology, 62 (8):692–703, 2016. Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., and Egozcue, J. J. Microbiome datasets are compositional: and this is not optional. Frontiers in microbiology, 8: 2224, 2017. Goodman, B. and Flaxman, S. European union regulations on algorithmic decision-making and a “right to explana- tion”. AI magazine, 38(3):50–57, 2017. Learning Sparse Log-Ratios for High-Throughput Sequencing Data Gordon-Rodriguez, E., Loaiza-Ganem, G., and Cunning- ham, J. The continuous categorical: a novel simplex- valued exponential family. In International Conference on Machine Learning, pp. 3637–3647. PMLR, 2020a. Gordon-Rodriguez, E., Loaiza-Ganem, G., Pleiss, G., and Cunningham, J. P. Uses and abuses of the cross-entropy loss: case studies in modern deep learning. arXiv preprint arXiv:2011.05231, 2020b. Greenacre, M. Comments on: Compositional data: the sample space and its structure. TEST, 28(3):644–652, 2019a. Greenacre, M. Variable selection in compositional data anal- ysis using pairwise logratios. Mathematical Geosciences, 51(5):649–682, 2019b. Greenacre, M. Amalgamations are valid in compositional data analysis, can be used in agglomerative clustering, and their logratios have an inverse transformation. Ap- plied Computing and Geosciences, 5:100017, 2020. Greenacre, M., Grunsky, E., and Bacon-Shone, J. A compar- ison of isometric and amalgamation logratio balances in compositional data analysis. Computers & Geosciences, pp. 104621, 2020. He, H. and Ma, Y. Imbalanced learning: foundations, algo- rithms, and applications. 2013. Jang, E., Gu, S., and Poole, B. Categorical repa- rameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016. Li, H. Microbiome, metagenomics, and high-dimensional compositional data analysis. Annual Review of Statistics and Its Application, 2:73–94, 2015. Linderman, S., Mena, G., Cooper, H., Paninski, L., and Cunningham, J. Reparameterizing the birkhoff polytope for variational permutation inference. In International Conference on Artificial Intelligence and Statistics, pp. 1618–1627. PMLR, 2018. Lovell, D., Pawlowsky-Glahn, V., Egozcue, J. J., Marguerat, S., and Bähler, J. Proportionality: a valid alternative to correlation for relative data. PLoS Comput Biol, 11(3): e1004075, 2015. Lu, J., Shi, P., and Li, H. Generalized linear models with lin- ear constraints for microbiome compositional data. Bio- metrics, 75(1):235–244, 2019. Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Rep- resentations, 2017. Magne, F., Gotteland, M., Gauthier, L., Zazueta, A., Pe- soa, S., Navarrete, P., and Balamurugan, R. The firmi- cutes/bacteroidetes ratio: a relevant marker of gut dysbio- sis in obese patients? Nutrients, 12(5):1474, 2020. Martı́n-Fernández, J., Pawlowsky-Glahn, V., Egozcue, J., and Tolosona-Delgado, R. Advances in principal balances for compositional data. Mathematical Geosciences, 50 (3):273–298, 2018. Martino, C., Morton, J. T., Marotz, C. A., Thompson, L. R., Tripathi, A., Knight, R., and Zengler, K. A novel sparse compositional technique reveals microbial perturbations. MSystems, 4(1), 2019. Mena, G., Snoek, J., Linderman, S., and Belanger, D. Learn- ing latent permutations with gumbel-sinkhorn networks. In International Conference on Learning Representations, 2018. Mert, M. C., Filzmoser, P., and Hron, K. Sparse principal balances. Statistical Modelling, 15(2):159–174, 2015. Morton, J. T., Sanders, J., Quinn, R. A., McDonald, D., Gonzalez, A., Vázquez-Baeza, Y., Navas-Molina, J. A., Song, S. J., Metcalf, J. L., Hyde, E. R., et al. Balance trees reveal microbial niche differentiation. MSystems, 2 (1), 2017. Morton, J. T., Aksenov, A. A., Nothias, L. F., Foulds, J. R., Quinn, R. A., Badri, M. H., Swenson, T. L., Van Goethem, M. W., Northen, T. R., Vazquez-Baeza, Y., et al. Learn- ing representations of microbe–metabolite interactions. Nature methods, 16(12):1306–1314, 2019a. Morton, J. T., Marotz, C., Washburne, A., Silverman, J., Zaramela, L. S., Edlund, A., Zengler, K., and Knight, R. Establishing microbial composition measurement stan- dards with reference frames. Nature communications, 10 (1):1–11, 2019b. Noguera-Julian, M., Rocafort, M., Guillén, Y., Rivera, J., Casadellà, M., Nowak, P., Hildebrand, F., Zeller, G., Par- era, M., Bellido, R., et al. Gut microbiota linked to sexual preference and hiv infection. EBioMedicine, 5:135–146, 2016. Pawlowsky-Glahn, V. and Buccianti, A. Compositional data analysis: Theory and applications. John Wiley & Sons, 2011. Pawlowsky-Glahn, V. and Egozcue, J. J. Compositional data and their analysis: an introduction. Geological Society, London, Special Publications, 264(1):1–10, 2006. Pawlowsky-Glahn, V., Egozcue, J. J., Tolosana Delgado, R., et al. Principal balances. Proceedings of CoDaWork, pp. 1–10, 2011. Learning Sparse Log-Ratios for High-Throughput Sequencing Data Pearson, K. Vii. mathematical contributions to the theory of evolution.—iii. regression, heredity, and panmixia. Philo- sophical Transactions of the Royal Society of London. Series A, containing papers of a mathematical or physi- cal character, (187):253–318, 1896. Potapczynski, A., Loaiza-Ganem, G., and Cunningham, J. P. Invertible gaussian reparameterization: Revisiting the gumbel-softmax. Advances in Neural Information Processing Systems, 33, 2020. Prifti, E., Chevaleyre, Y., Hanczar, B., Belda, E., Danchin, A., Clément, K., and Zucker, J.-D. Interpretable and accurate prediction models for metagenomics data. Giga- Science, 9(3):giaa010, 2020. Quinn, T., Nguyen, D., Rana, S., Gupta, S., and Venkatesh, S. Deepcoda: personalized interpretability for composi- tional health data. In International Conference on Ma- chine Learning, pp. 7877–7886. PMLR, 2020. Quinn, T. P. and Erb, I. Using balances to engineer fea- tures for the classification of health biomarkers: a new approach to balance selection. bioRxiv, pp. 600122, 2019. Quinn, T. P. and Erb, I. Amalgams: data-driven amalga- mation for the dimensionality reduction of compositional data. NAR Genomics and Bioinformatics, 2(4):lqaa076, 2020. Quinn, T. P., Richardson, M. F., Lovell, D., and Crowley, T. M. propr: an r-package for identifying proportion- ally abundant features using compositional data analysis. Scientific reports, 7(1):1–9, 2017. Quinn, T. P., Erb, I., Richardson, M. F., and Crowley, T. M. Understanding sequencing data as compositions: an out- look and review. Bioinformatics, 34(16):2870–2878, 2018. Quinn, T. P., Erb, I., Gloor, G., Notredame, C., Richardson, M. F., and Crowley, T. M. A field guide for the compo- sitional analysis of any-omics data. GigaScience, 8(9): giz107, 2019. Rahat-Rozenbloom, S., Fernandes, J., Gloor, G. B., and Wolever, T. M. Evidence for greater production of colonic short-chain fatty acids in overweight than lean humans. International journal of obesity, 38(12):1525– 1531, 2014. Rivera-Pinto, J., Egozcue, J. J., Pawlowsky-Glahn, V., Pare- des, R., Noguera-Julian, M., and Calle, M. L. Balances: a new perspective for microbiome analysis. MSystems, 3 (4), 2018. Sheng, M., Dong, Z., and Xie, Y. Identification of tumor- educated platelet biomarkers of non-small-cell lung can- cer. OncoTargets and therapy, 11:8143, 2018. Silverman, J. D., Washburne, A. D., Mukherjee, S., and David, L. A. A phylogenetic transform enhances analysis of compositional microbiota data. Elife, 6:e21887, 2017. Susin, A., Wang, Y., Lê Cao, K.-A., and Calle, M. L. Vari- able selection in microbiome compositional data analysis. NAR Genomics and Bioinformatics, 2(2):lqaa029, 2020. Templ, M. Artificial neural networks to impute rounded zeros in compositional data. arXiv preprint arXiv:2012.10300, 2020. Tolosana-Delgado, R., Talebi, H., Khodadadzadeh, M., and Van den Boogaart, K. On machine learning algorithms and compositional data. In Proceedings of the 8th In- ternational Workshop on Compositional Data Analysis, Terrassa, Spain, pp. 3–8, 2019. Tutz, G. and Gertheiss, J. Regularized regression for cate- gorical data. Statistical Modelling, 16(3):161–200, 2016. Van den Boogaart, K. G. and Tolosana-Delgado, R. Ana- lyzing compositional data with R, volume 122. Springer, 2013. Vangay, P., Hillmann, B. M., and Knights, D. Microbiome Learning Repo (ML Repo): A public repository of micro- biome regression and classification tasks. GigaScience, 8 (5), 04 2019. Wan, J. C., Massie, C., Garcia-Corbacho, J., Mouliere, F., Brenton, J. D., Caldas, C., Pacey, S., Baird, R., and Rosen- feld, N. Liquid biopsies come of age: towards implemen- tation of circulating tumour dna. Nature Reviews Cancer, 17(4):223, 2017. Washburne, A. D., Silverman, J. D., Leff, J. W., Bennett, D. J., Darcy, J. L., Mukherjee, S., Fierer, N., and David, L. A. Phylogenetic factorization of compositional data yields lineage-level associations in microbiome datasets. PeerJ, 5:e2969, 2017. Wooley, J. C., Godzik, A., and Friedberg, I. A primer on metagenomics. PLoS Comput Biol, 6(2):e1000667, 2010. Xie, S. M. and Ermon, S. Reparameterizable subset sam- pling via continuous relaxations. In International Joint Conferences on Artificial Intelligence, 2019. Xie, Y., Dai, H., Chen, M., Dai, B., Zhao, T., Zha, H., Wei, W., and Pfister, T. Differentiable top-k operator with optimal transport. Advances in Neural Information Processing Systems, 33, 2020. Yang, J., Zhang, Q., Ni, B., Li, L., Liu, J., Zhou, M., and Tian, Q. Modeling point clouds with self-attention and gumbel subset sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3323–3332, 2019. Learning Sparse Log-Ratios for High-Throughput Sequencing Data Zhang, Y.-H., Huang, T., Chen, L., Xu, Y., Hu, Y., Hu, L.-D., Cai, Y., and Kong, X. Identifying and analyzing different cancer subtypes using rna-seq data of blood platelets. Oncotarget, 8(50):87494, 2017. 10_1101-2021_02_11_430789 ---- Accelerating COVID-19 research with graph mining and transformer-based learning Accelerating COVID-19 research with graph mining and transformer-based learning Ilya Tyagin Center for Bioinformatics and Computational Biology University of Delaware Newark, DE tyagin@udel.edu Ankit Kulshrestha Computer and Information Sciences University of Delaware Newark, DE akulshr@udel.edu Justin Sybrandt∗ School of Computing Clemson University Clemson, SC jsybran@clemson.edu Krish Matta Charter School of Wilmington Wilmington, DE matta.krish@charterschool.org Michael Shtutman Drug Discovery and Biomedical Sciences University of S. Carolina Columbia, SC shtutmanm@sccp.sc.edu Ilya Safro Computer and Information Sciences University of Delaware Newark, DE isafro@udel.edu ABSTRACT In 2020, the White House released the, “Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset,” wherein artificial intelligence experts are asked to collect data and develop text mining techniques that can help the science commu- nity answer high-priority scientific questions related to COVID-19. The Allen Institute for AI and collaborators announced the availabil- ity of a rapidly growing open dataset of publications, the COVID-19 Open Research Dataset (CORD-19). As the pace of research acceler- ates, biomedical scientists struggle to stay current. To expedite their investigations, scientists leverage hypothesis generation systems, which can automatically inspect published papers to discover novel implicit connections. We present an automated general purpose hypothesis generation systems AGATHA-C and AGATHA-GP for COVID-19 research. The systems are based on graph-mining and the transformer model. The systems are massively validated using retrospective information rediscovery and proactive analysis in- volving human-in-the-loop expert analysis. Both systems achieve high-quality predictions across domains (in some domains up to 0.97% ROC AUC) in fast computational time and are released to the broad scientific community to accelerate biomedical research. In addition, by performing the domain expert curated study, we show that the systems are able to discover on-going research find- ings such as the relationship between COVID-19 and oxytocin hormone. Reproducibility: All code, details, and pre-trained models are available at https://github.com/IlyaTyagin/AGATHA-C-GP CCS CONCEPTS • Applied computing → Bioinformatics; Document management and text processing; • Computing methodologies → Learning latent representations; Neural networks; Information extraction; Semantic networks. ∗Now with Google Brain. Contact: jsybrandt@google.com. KEYWORDS Hypothesis Generation, Literature-Based Discovery, Transformer Models, Semantic Networks, Biomedical Recommendation, 1 INTRODUCTION Development of vaccines for COVID-19 is a major triumph of mod- ern medicine and humankind’s ability to accelerate scientific re- search. While we are all hoping to see large-scale positive changes from fast mass adoption of the existing vaccines, there remain significant open research questions around COVID-19. The scien- tific community has a responsibility to do everything possible to block the ongoing transmission of the dangerous virus and acceler- ate research to mitigate its consequences. We present the following automated knowledge discovery system in order to propose new tools that could compliment the existing arsenal of techniques to accelerate biomedical and drug discovery research for events like COVID-19. The COVID-19 pandemic became one of the most important events in the information space since the end of 2019. The pace of published scientific information is unprecedented and spans all resolutions, from the news and pop-science articles to drug design at the molecular level. The pace of scientific research has already been a significant problem in science for years [29], and under current circumstances this factor becomes even more pronounced. Several thousands papers are being added weekly to CORD-19 [39] (the dataset of publications related to COVID-19) and even more in MEDLINE [1]. As a result, groups working on similar problems may not be immediately aware of the other’s findings, which can lead to inefficient investments and production delays. Under normal circumstances, the MEDLINE database of biomed- ical citations receives approximately 950,000 new papers per year. Currently this database indexes 31 million total citations. This pace challenges traditional research methods, which often rely on human intuition when searching for relevant information. As a result, the demand for modern AI solutions to help with the automated anal- ysis of scientific information is incredibly high. For instance, the field of drug discovery has explored a range of AI analytical tools .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://github.com/IlyaTyagin/AGATHA-C-GP https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 1: Number of new citations per week in CORD-19 dataset. to expedite new treatments [12]. Designing lab experiments and finding candidate chemical compounds is a costly and long-lasting procedure, often taking years. To accelerate scientific discovery, researchers came up with a family of strategies to utilize public knowledge from databases like MEDLINE that are available through the National Institute of Health (NIH), which facilitate automated hypothesis generation (HG) also known as literature-based discov- ery. Undiscovered public knowledge, information that is implicitly present within available literature, but is not yet explicitly known by an individual who can act on that information, represents the target of our work. Although, there are quite a few automated HG systems [12] in- cluding those we have previously proposed [35, 37], none of them is currently customized and available in the open domain to mas- sively process COVID-19 related queries. In addition to the traditional general requirements for HG systems, such as high-quality results of hypotheses, interpretability and availability for broad scientific community, a specific demand for COVID-19 data analysis requires: (1) customization of the vocabulary and other logical units such as subject-verb-object predicates; (2) customization of the training data that in the reality of urgent research contains a lot of controver- sial and incorrect information; (3) models for different information resolutions; and (4) validation on the on-going domain-specific discovery. Our contribution: In this work we bridge this gap by releasing, AGATHA-C and AGATHA-GP , reliable and easy to use HG sys- tems that demonstrate state-of-the art performance and validate their inference capabilities on both COVID-19 related and general biomedical data. To make them closely related to different goals of COVID-19 research, they correspond to micro- (AGATHA-C, for COVID-19) and macroscopic (AGATHA-GP, for general purpose) scales of knowledge discovery. Both systems are able to process any queries to connect biomedical concepts but AGATHA-C exhibits better results on the molecular scale queries, e.g., those that are relevant to drug design, and AGATHA-GP works better for general queries, e.g., establishing connections between certain profession and COVID-19 transmission. Both systems are the next generation of the AGATHA knowl- edge network mining transformer model [37]. They substantially improve the quality of the previous AGATHA by introducing new information layer into multi-layered semantic knowledge network pipeline, and expanding new information retrieval techniques that facilitate inference. We deploy the deep learning transfer model trained with up-to date datasets and provide easy to use interface to broad scientific community to conduct COVID-19 research. We validate the system via candidate ranking [36, 37] using very recent scientific publications containing findings absent in the training set. While the original AGATHA has demonstrated state-of-the- art performance for the time of its release, AGATHA and other systems were found to perform with notably lower quality on ex- tremely rapidly changing COVID-19 research. We demonstrate a remarkable improvement in the range of approximately 20-30% (in ROC-AUC) on the average on different types of queries with very fast query process that allows massive validation. In addition, we demonstrate that the proposed system can identify recently uncovered gene (BST2) and hormone (oxytocin and melatonin) re- lationships to COVID-19, using only papers published before these connections were discovered. Reproducibility: All code, details, and pre-trained models are available at https://github.com/IlyaTyagin/AGATHA-C-GP 2 BACKGROUND CORD-19 dataset [39] was released as a response to the world’s COVID-19 pandemic to help data science experts and researchers to tackle the challenge of answering the high priority scientific questions. It updates daily and was created by the Allen Institute for AI in collaboration with Microsoft Research, NLM, IBM and other organizations. At the time of this publication it contains over 400.000 scientific abstracts and over 150.000 full-text papers about coronaviruses, primarily COVID-19. MEDLINE is a database of NIH that includes almost 31 million citations (as of 2021) of scientific papers related to the biomedical and related fields. Some of the citations are provided with MeSH (Medical Subject Headings) terms and other metadata. MEDLINE is one of the largest and well-known resources for biomedical text mining. Hypothesis Generation Systems. The HG field has been present in information sciences for several decades. The first notable ap- proach was proposed by Swanson et al. in 1986 [33], which is called the A-B-C model. The concept of A-B-C model is to discover in- termediate (B) terms which occur in titles of publications for both terms A (source) and C (target). In their experiments, Swanson et al. discovered an implicit connection between Raynauld’s syndrome (term A) and fish oil (term C) through blood viscosity (term B), which was mentioned in both sets. The hypothesis that fish oil can be used for patients with Raynaud’s disease was experimentally confirmed several years later [10]. The key idea of the proposed method is that all fragmented bits of information are explicitly known, but their implicit relationships is what HG systems are aimed to uncover. We note the difference between HG and traditional information retrieval. The information retrieval techniques which represent the vast majority of biomedical literature based discovery systems are trained and (what is even more important) validated to retrieve existing information whereas the HG techniques predict undiscov- ered knowledge and thus must be massively validated on it. The HG validation requires training the system strictly on historical data rather than sampling it over the entire time. The advances in machine and deep learning transformed the algorithmics of HG systems (see Sec. 9) that are now able to pro- cess much larger information volumes demonstrating much higher .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://github.com/IlyaTyagin/AGATHA-C-GP https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ quality predictions. However, lack of broader applicability of HG systems in the situation with COVID-19 pandemic demonstrates that several major issues exist and require immediate attention: (1) Most of the existing HG systems are domain-specific (e.g., gene- disease interactions) that is usually expressed in limiting the pro- cessed information (e.g., significant filtering vocabulary and papers to a specific domain in probabilistic topic modeling [38]); (2) A proper validation of HG system remains a technical problem because multiple large-scale models have to trained with all het- erogeneous data carefully eliminated several years back; (3) Moreover, a large number of HG systems are not massively validated at all except of very old findings rediscovery [28] or demonstrating of just a few proactive examples in humanly cu- rated investigation; and (4) Interpretability and explainbability of generated hypotheses remains a major issue. The UMLS Metathesaurus [7] is the NIH database containing information about millions of concepts (both medical and general) and their synonyms. Metathesaurus accumulates information about its entries from more than 200 different vocabularies allowing to map and connect concepts from different terminologies. Metathe- saurus also keeps metadata about the concepts such as semantic types and their hierarchy. The core unit of information in UMLS is the concept unique identifier, or CUI. CUI is a codified representa- tion of a specific term, which includes its different atoms (spelling variants or translations of the term on other languages), vocabulary entries, definitions and other metadata. SemRep [4] is a software kit developed by NIH for extraction of semantic predicates (subject-verb-object triples) from the provided corpus. It also allows to extract entities not involved in any semantic predicate, if the corresponding option is selected. The official exam- ple of possible SemRep output is: INPUT = “We used hemofiltration to treat a patient with digoxin overdose that was complicated by refractory hyperkalemia.”, OUTPUT = “Hemofiltration-TREATS- Patients; Digoxin overdose-PROCESS_OF-Patients; hyperkalemia- COMPLICATES-Digoxin overdose; Hemofiltration-TREATS(INFER)- Digoxin overdose”. SemRep handles word sense disambiguation and performs terms mapping to the corresponding CUIs from UMLS metathesaurus. ScispaCy [24] ScispaCy is a special version of spaCy maintained by AllenAI, containing spaCy models for processing scientific and bio-related texts. ScispaCy models are trained on different sources, such as PMC-pretrained word2vec representations, MedMentions Entity linking Dataset and so on. SciSpacy can handle various NLP tasks, such as NER, dependency parsing and POS-tagging, where achieves state of the art performance. SciBERT [6] is a BERT-like transformer pretrained language model, where full-text scientific papers were used as a training dataset. Embeddings are learned in a word-piece fashion, which makes them capture the relationships between not only words in a sentence, but also between word parts in each word. FAISS [15] is a library for fast approximate clustering and similarity search between dense vectors. It scales to the huge datasets that do not fit in RAM and can be used in a distributed fashion. FAISS is used in our pipeline to perform 𝑘-means clustering of PQ-quantizated sentence vectors to generate 𝑘-nearest neighbor edges for similar sentences (nodes) in knowledge network. Figure 2: AGATHA multi-layered graph schema. PTBG [21] (stands for PyTorch BigGraph) is a high-performance graph embedding system allowing distributed training. It was de- signed to handle large heterogeneous networks containing hun- dreds of millions of nodes of different types and billions of typed edges. Distributed training is achieved by computing embeddings on disjoint node sets. AllenNLP Open Information Extraction. AllenNLP [11] is a powerful library developed by AllenAI that uses PyTorch backend to provide deep-learning models for various natural processing tasks. Specifically, AllenNLP Open Information Extraction provides a trained deep bi-LSTM model for extracting predicates from un- structured text. An API is provided for running inference in both single sentence and batch modes. 3 PIPELINE SUMMARY We briefly summarize the AGATHA semantic graph construction pipeline. It is described in greater detail in the original paper [37]. Text pre-processing. The input for our system is a corpora of scientific citations from the MEDLINE and CORD-19 datasets. These files contain titles and abstracts for millions of biomedical papers. We filter non-English documents, using the FastText Langauge Identification model [16] if the language is not provided. After that we split all abstracts into sentences and process all sentences with ScispaCy library. From each sentence we extract POS-annotated lemmas, entities and perform 𝑛-gram mining, where 𝑛 ∈ [2, 3, 4] and 𝑛-grams are composed of frequently co-occurring lemmas. Additionally, we associate all sentences with any relevant metadata, such as the MeSH/UMLS keywords provided along with the citation. Semantic Graph Construction. We construct a semantic graph containing different types of nodes, namely, sentences, entities, coded terms (from UMLS and MeSH), 𝑛-grams, lemmas, and pred- icates following the schema depicted in Figure 2. Edges between sentences are induced from the nearest-neighbors network of sen- tence embeddings. We also include an edge between two sentences that appear sequentially within the same abstract, counting the title as the first sentence. Other edges can be inferred directly from the recorded metadata. For instance, the node representing the en- tity “COVID-19” is connected to every sentence and predicate that discuss COVID-19. NLM UMLS implementation. The prior AGATHA semantic net- work only includes UMLS terms that appear in SemMedDB predi- cates [18] which is a major limitation. In this work we enrich the “Coded Term” layer by introducing an additional preprocessing phase wherein we run the SemRep tool with full-fielded output option ourselves on the entire input corpora. This phase would be necessary as CORD-19 and most recent MEDLINE citations are not represented within slowly updated SemMedDB. However, we find that we can substantially increase the quality of recovered terms by applying these tools ourselves. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ By doing that we not only enrich the "Coded Terms" semantic network layer, but also introduce a significant number of uncovered previously semantic predicates. It happens because SemMedDB is a cumulative database, having various citations in the database processed over many years with various versions of SemRep and various UMLS releases available at different time periods. To illustrate what was just said, let us consider the following example (PMID: 20109154): "The results showed that V. cholerae O395 and also other related enteric pathogens have the essential CASS components (CRISPR and cas genes) to mediate a RNAi-like path- way." The current SemRep version extracts the following predicate: CRISPR-AFFECTS-RNAi, while SemMedDB does not contain any predicates for this sentence. The year of publication of the corre- sponding paper is 2009, but CRISPR term (C3658200) did not exist in the UMLS metathesaurus on or before 2012, that is why at the time of adding this citation to SemmedDB CRISPR-involved relation could not be identified. Graph Embedding. We embed our large semantic graph using a heterogeneous technique that captures node similarity through a biased transformed dot product. By explicitly including a bias term for each node, we capture a concepts overall affinity within the network that is critical for such general terms as “coronavirus.” By learning transformations between each pair of node types (e.g., between sentences and lemmas), we enable each type to occupy embedding spaces with differing characteristics. Specifically, we fit an embedding model that optimizes the following similarity measure: S(𝑢, 𝑣) = 𝑢1 + 𝑣1 +𝑇𝑢𝑣1 + 𝑑∑ 𝑖=2 𝑢𝑖 (𝑣𝑖𝑇𝑢𝑣𝑖 ), (1) where 𝑢, 𝑣 are nodes in the semantic graph with embeddings 𝑢, 𝑣, and 𝑇𝑢𝑣 is the directional transformation vector between nodes of 𝑢’s type to nodes of 𝑣’s. We use the PTBG heterogeneous graph embedding library to learn 𝑑 = 512 dimensional embeddings for each node of our large semantic graph. While fitting embeddings (𝑢) and transformation vectors (𝑇𝑢𝑣), we represent each edge of the semantic graph as two directed edges. These learned values are optimized using softmax loss, where the similarity for one edge is compared against the similarities of 100 negative samples. Ranking Semantic Predicates (Transformer model). After we obtain embeddings per node in the semantic graph, we train AGA- THA system ranking model. This model is trained to rank published subject-object pairs above randomly composed pairs of UMLS con- cepts (negative samples). Two coded terms, along with a fixed-size random subsample of predicates containing each term are input to this model. Graph embeddings for each term and predicate are fed into stacked transformer encoder layers, which apply multi-headed self-attention across the embedding set. The last set of encodings are averaged and the result is projected to the unit interval, forming a scalar prediction for the input’s “plausibility.” Allennlp Predictor CORD-19 Process Abstracts UMLS Concept Tagging Semnet Filter Final Predicates MEDLINE Figure 3: Predicate Extraction pipeline with Deep Learning based Open IE system. Formally, the model to evaluate term pairs is defined as: 𝑓 (𝑥,𝑦) = 𝑔 ([ 𝑥 𝑦 𝑥′1 . . . 𝑥 ′ 𝑘 𝑦′1 . . .𝑦 ′ 𝑘 ]) 𝑔(𝑋) = sigmoid(MΘ) M = 1 |𝑋 | ColSum (E𝑁 (FeedForward(𝑋))) E0(𝑋) = 𝑋 E𝑖+1(𝑋) = LayerNorm (FeedForward(A(𝑋)) + A(𝑋)) A(𝑋) = LayerNorm (MultiHeadAttention(𝑋) + 𝑋) , (2) where each 𝑥′ and 𝑦′ are randomly sampled from the neighbor- hoods of 𝑥 and 𝑦 respectively, and each ·̂ denotes the graph embed- ding of the given node. Furthermore, Θ represents a free parameter, which is fit along with parameters internal to each FeedForward and MultiHeadAttention layer, following the standard conventions for each. The above model is fit using margin ranking loss, where pred- icates from the training set are compared against a large set of negative samples. Additional details pertaining to specific opti- mization choices surrounding this model are present in the work originally proposing this model [37]. 4 AUGMENTING SEMANTIC PREDICATES WITH DEEP LEARNING We used SemRep predicate extraction system in the first system, AGATHA-C , to extract predicates from the abstracts. However, SemRep relies on expert coded rules and heuristics to extract biomed- ical relations leading to significantly fewer predicates for training. Thus, in order to augment the predicates (for the second system, AGATHA-GP ) we decided to use a deep learning based informa- tion extraction system by Stanvosky et al. [31]. Figure 3 shows our overall predicate extraction pipeline. Abstract Pre-processing. The input for the proposed semantic predicate extraction system is the output files generated by SemRep tool with full-fielded output option enabled, obtained from the pre- processing stage described in Sec. 3. As it was mentioned previously, SemRep system extracts not only semantic triples, but also maps entities found in the input corpus to their corresponding UMLS concept IDs, this is the data which is used for the following method. The initial set of records includes the sentence raw texts and ex- tracted from them UMLS terms and is augmented throughout the pipeline making it easier to extract final predicates for downstream training. Raw Predicate Extraction. We use a pre-trained instance of RnnOIE [31] provided as an API by AllenNLP. The model was trained on the OIE2016 corpus. At a high level the model aims to learn a joint .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ embedding of individual words and their corresponding Beginning- Input-Output (BIO) tags. The output of the model is a probability distribution over the BIO tags. During inference the model selects specific phrases and groups them into ARG0, V, ARG1 tags. By con- vention, we treat ARG0 as the subject and ARG1 as the object in a subject-verb-object tuple. To speed up processing and scale it to thousands of abstracts, we leverage model-parallelism across differ- ent machines and run batch-mode inference on chunks of abstracts. Once the model predictions have been extracted we extract the phrases with relevant tags into raw predicates and augment them in the record. A subsequent filtering is performed by extracting the terms matching with previously detected UMLS concepts in the sentence. Semnet Filtering Using a general purpose RnnOIE model has it’s own challenges. During processing we noted that a lot of raw predicates were either too general or contained too little meaning to be useful for training a prediction model. To overcome this challenge we designed a corrective filter to reduce noise and retain most useful predicates. We call this filter the semnet filter. Each UMLS concept has an associated semantic type (e.g., COVID- 19 has an associated semantic type of dsyn (disease)). This is useful for summarizing large set of diverse text concepts into smaller num- ber of categories. We used the metadata from semantic types to construct two networks - a semantic network and a hierarchical network. The semantic network consists of semantic types as nodes and the edges imply a corresponding direct relation between them. The hierarchical network is a network of a semantic type connected to its more general semantic types. For example, a semantic type dsyn (disease) is more generally associated with a biof (biological function) or a pathf (pathological function). In order to filter a predicate, all edges emanating from the subject’s semantic types are computed on a per-predicate basis. These edges also include any specific-general concept relationships. If the object’s semantic type is found to be in the candidate edge set, then we deem the predicate as valid. In our experiments, we found that this filtering method significantly eliminates predicates which do not directly pertain to the biomedical domain. Processing Abstracts at Scale Building a pipeline that scales to thousands of abstracts is not a trivial task. In order to extract predi- cates from RnnOIE model and extract quality terms of interest we not only have to contend with the problem of running inference on a deep neural network but also the task of aligning the extracted terms with the entities recognized by SemRep. Deployment details: The RnnOIE model by Stanovsky et al. uses a deep Bi-LSTM [27] model to learn the joint word embedding and predict the resulting semantic position tags. Since LSTMs are inherently sequential model, it means that the inference time per sentence would be considerable. We first tried processing an entire collection of abstracts at once on a cluster of 10 machines each consisting of 24 CPUs using the Dask [26] library. The entire process took more than 8 hours. Considering that we had about 100 such collections, this inference time was prohibitively high. In order to speed up inference we read each collection once and distributed chunks of abstracts over the machines. This change helped us to cut down the processing time from over a week to just over 4 days for the MEDLINE corpus. For the CORD-19 corpus the processing time was even faster at 2 days. The next step was to align the extracted predicates with the SemRep recognized biomedical concepts. We achieved this alignment by first building an index of files that contained a specific abstract ID and then processing the RnnOIE predicates with the aforementioned index. We further optimized the indexing phase by updating the existing index each time we processed more than 𝜏 abstracts. The semnet filter does not introduce additional computational overhead and can process a thousand abstracts in under 1 second. Hence, to obtain the most relevant set of predicates we were able to parallelize over “checkpoints" (each of which contained 30k abstracts) in an hour. 5 VALIDATION A fair validation of HG systems is extremely challenging, as these models are designed to predict novel connections that are unknown to even those who evaluate the system [34]. In addition, even if validated by rediscovering findings using historical, the process is computationally expensive because of the need to train multiple models to understand how many months (or years) back, the HG system can predict the findings which requires careful filtering of the used papers, vocabulary and other types of data. To present our results in terms of its usefulness for urgent CORD-19-related HG, we use a historical benchmark, which is conceptually described in [37]. This technique is fully automated and does not require any domain experts intervention. Positive samples collection. We use SemRep and proposed in Sec. 4 approach to process the most recent CORD-19 citations, which were published after the specific cut date making sure that the citations are not included in the training set. After that we extract all subject-object pairs from the obtained results and explicitly check that none of these pairs are presented in the training set. Pairs mentioned in the CORD-19 less than twice are filtered out from the validation set. Almost all of them are either noisy or represent information that already appears in other pairs (e.g., because of the difference in grammar). We also use the strategy of subdomain recommendation. This strategy works in the following way. For each UMLS term we collect its semantic type (which is a part of the metadata provided in UMLS metathesaurus) and group all extracted SemRep pairs by the term-pair criteria (combination of subject and object types). Then we identify the top-20 most common term-pairs subdomains and construct the validation set from pairs belonging to these 20 subdomains. Negative samples generation. To generate negative samples per domain, the random sampling is used, that is, for each positive sample we keep its subject and randomly sample the object belong- ing to the same semantic type as the object of the source pair. We do this 10 times, thus having 10 negative domain-specific samples for each positive sample. When the validation set is generated, we apply our ranking criteria to it, obtaining a numerical score value 𝑠 per each sample, where 𝑠 ∈ [0, 1]. Evaluation metrics. We propose our approach as a recommenda- tion system and to report our results we use a combination of the following classification and recommendation metrics. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ • Classification metrics: (1) Area under the receiver-operating- characteristic curve (AUC ROC); (2) Area under the precision- recall curve (AUC PR). • Recommendation metrics: (1) Top-k precision (P.@k); (2) Average precision (AP.@k); and (3) Overall reciprocal rank (RR). We report these numbers in per subdomain manner to better un- derstand how the system performs with respect to specific task (e.g. drug repurposing). 6 RESULTS To report results, we provide the performance measures for three AGATHA models trained on the same input data (MEDLINE corpus and CORD-19 abstracts dataset): (1) AGATHA-O : Baseline AGATHA model [37]; (2) AGATHA-C : AGATHA-O with new UMLS layer and SemRep enrichment; (3) AGATHA-GP : AGATHA-C with additional deep learning- based extracted and further filtered predicates. It is done in this particular manner because the major role in learn- ing the proposed ranking criteria depends heavily on the quality of extracted semantic predicates and their number, as they form the training set for the AGATHA ranking module. At the moment of writing, no other general purpose and available for public use HG system compliant with the three validation criteria, namely, (a) ability to run thousands of queries in a reasonable time, (b) ability to process COVID-19 related vocabulary, and (c) ability to operate in multiple domains was available for comparison. The performance of both AGATHA-C and AGATHA-GP allows to run thousands of queries in a very short time (in the order of minutes), making the validation on a large number of samples pos- sible. Unfortunately, given the current circumstances, large-scale validation for the specific scientific subdomain (COVID-19 related hypotheses) is hard to implement, because well-established and reliable factual base is being actively developed at the moment and big historic gap for the vocabulary simply does not exist (e.g., the COVID-19 term is just approximately one year old). We, how- ever, provide the validation set including 2736 positive connections extracted from CORD-19 dataset citations added within the time frame from October 28, 2020 to January 21, 2021, which numbered at 77 thousand abstracts. Table 1: Graph metrics (M = millions, B = billions). Counts Node Type AGATHA-O AGATHA-C AGATHA-GP Sentence 190.6 M. 190.6 M. 190.6 M. Predicate 24.2 M. 36.3 M. 38.7 M. Lemma 16.8 M. 16.1 M. 16.1 M. Entity 41.7 M. 43.2 M. 43.2 M. Coded Term 538,588 855,351 855,351 𝑛-Grams 212.922 326.864 333.575 Total Nodes 274,1 M. 287.4 M. 289.8 M. Total Edges 13.52 B. 13.5 B. 13.53 B. In Table 1, we share some basic graph metrics for the models AGATHA-O , AGATHA-C and AGATHA-GP . The most signifi- cant change is observed in the number of semantic predicates and coded terms, which clearly represents the purpose of introducing additional preprocessing steps. In Table 2, we compare aforementioned models using the met- rics described in Sec. 5. We present predicate types with NLM semantic type codes [23] due to space restrictions. Both AGATHA- C and AGATHA-GP models show significant gains when compared to AGATHA-O baseline model. Benefits in the most problematic for the baseline model areas (e.g., (Gene) → (Gene) denoted by (gngm,gngm)) serve the best illustration for that, showing up to almost 30 percent advantage in ROC AUC. Now all most popular biomedical subdomains are covered by the proposed models and show AUC ROC results at at least 0.87. Average ROC AUC value is increased by 0.09. Our validation strategy involves a big number of many-to-many queries, making the area under precision-recall curve another very illustrative metric. This is where the newly proposed models show even more drastic improvements over the baseline AGATHA-O . For some subdomains, like (Gene or Genome) → (Gene or Genome) (gngm,gngm) or (Amino Acid, Peptide, or Protein) → (Gene or Genome) (aapp,gngm), we observe that new models take the recommenda- tions performance to the new quality level. Average PR AUC value is increased by 0.16. The approximate running time with corresponding types of used hardware is presented in Table 3. Each row corresponds to the stage in the AGATHA-C /AGATHA-GP pipelines. The column “M” (machines) and CPU show the number of machines and required CPUs, respectively. In the column “GPU” we indicate if GPU was required or optional. For AGATHA training we used two NVIDIA V100 per machine. The minimal requirements for RAM per machine are in column “RAM”. The running time of queries is negligible. 7 CASE STUDY The proactive discovery of ongoing research findings is an impor- tant component in the validation of hypothesis generation systems [36]. In particular, in the current uncertain situation when a lot of unintentionally incorrect discoveries are published, the validation must include human-in-the-loop part even in limited capacity such as in [2, 30]. To demonstrate the predictive potential of AGATHA-C we perform a case study on three COVID-19-related novel connec- tions manually selected by the domain expert. These connections were published after the cut date before which any data used in training was available to download at NIH. At a low level, all AGATHA models use entity subsampling to calculate pairwise ranking criteria, which means that the absolute numbers may fluctuate slightly. Thus, to present the numeric scores, each experiment was repeated 100 times to compute the average and standard deviation that we present in Table 4. AGATHA-C was tested whether it will be able to predict com- pounds potentially applicable for the treatment of COVID-19 and the genes involved in the SARS-CoV-2 pathogenesis. The data con- firming cardiovascular protective effects of hormone oxytocine were published recently [9, 40]. The protective effect is linked to .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ Table 2: Classification and recommendation quality metrics across recently popular COVID-19-related biomedical subdomains. Labels O, C and GP stand for AGATHA-O , AGATHA-C and AGATHA-GP models, respectively. ROC AUC PR AUC RR P.@10 P.@100 AP.@10 AP.@100 O C GP O C GP O C GP O C GP O C GP O C GP O C GP orch:dsyn 0.91 0.93 0.92 0.47 0.57 0.55 1.00 1.00 0.50 0.60 0.90 0.70 0.48 0.59 0.61 0.79 0.88 0.64 0.64 0.73 0.71 aapp:dsyn 0.90 0.95 0.95 0.45 0.58 0.63 1.00 0.50 1.00 0.60 0.70 0.90 0.52 0.56 0.65 0.79 0.73 0.98 0.57 0.66 0.74 phsu:dsyn 0.89 0.93 0.94 0.40 0.48 0.57 0.50 0.12 1.00 0.40 0.20 0.80 0.50 0.56 0.69 0.56 0.17 0.98 0.43 0.49 0.76 orch:orch 0.85 0.92 0.91 0.47 0.60 0.57 1.00 1.00 1.00 0.90 0.80 0.70 0.51 0.60 0.57 1.00 0.99 0.79 0.66 0.76 0.71 phsu:phsu 0.85 0.90 0.91 0.35 0.41 0.47 0.33 0.20 1.00 0.30 0.50 0.50 0.39 0.42 0.47 0.40 0.38 0.78 0.44 0.49 0.56 orch:phsu 0.87 0.93 0.93 0.51 0.60 0.57 1.00 1.00 1.00 0.90 0.80 0.80 0.49 0.56 0.52 0.91 0.91 0.86 0.68 0.72 0.67 fndg:dsyn 0.89 0.95 0.94 0.46 0.60 0.60 1.00 1.00 1.00 0.60 0.80 0.80 0.56 0.69 0.69 0.88 0.80 0.75 0.65 0.68 0.72 orch:aapp 0.87 0.93 0.93 0.57 0.66 0.73 1.00 1.00 1.00 0.90 0.90 0.90 0.48 0.55 0.60 0.88 0.98 1.00 0.77 0.79 0.84 geoa:spco 0.79 0.77 0.93 0.32 0.23 0.52 1.00 0.50 1.00 0.60 0.30 0.60 0.39 0.26 0.56 0.91 0.51 0.84 0.54 0.35 0.64 geoa:idcn 0.65 0.81 0.88 0.10 0.11 0.28 0.05 0.03 0.50 0.00 0.00 0.70 0.17 0.09 0.25 0.00 0.00 0.69 0.14 0.06 0.45 topp:dsyn 0.90 0.95 0.95 0.53 0.66 0.66 1.00 1.00 1.00 0.90 0.90 0.90 0.60 0.77 0.72 0.96 0.88 0.95 0.72 0.82 0.86 hlca:dsyn 0.89 0.96 0.96 0.58 0.72 0.72 1.00 1.00 1.00 0.90 1.00 0.80 0.46 0.54 0.56 0.88 1.00 0.79 0.75 0.79 0.78 gngm:dsyn 0.93 0.97 0.96 0.47 0.72 0.74 0.50 1.00 1.00 0.60 0.80 0.90 0.48 0.65 0.66 0.62 0.82 1.00 0.50 0.79 0.82 fndg:humn 0.83 0.92 0.91 0.38 0.53 0.54 1.00 0.50 0.50 0.60 0.70 0.80 0.45 0.64 0.63 0.65 0.69 0.73 0.62 0.69 0.77 gngm:gngm 0.66 0.88 0.89 0.14 0.40 0.41 0.10 0.50 1.00 0.10 0.60 0.30 0.15 0.45 0.44 0.10 0.51 0.61 0.17 0.49 0.52 dsyn:fndg 0.81 0.91 0.92 0.31 0.44 0.43 0.25 0.50 0.33 0.20 0.60 0.60 0.42 0.49 0.46 0.32 0.55 0.49 0.45 0.53 0.51 phsu:fndg 0.78 0.91 0.90 0.28 0.51 0.47 0.50 1.00 1.00 0.50 0.50 0.50 0.30 0.49 0.46 0.54 0.76 0.68 0.41 0.62 0.58 dsyn:humn 0.80 0.87 0.88 0.30 0.40 0.42 1.00 0.50 0.20 0.70 0.50 0.50 0.35 0.49 0.54 0.81 0.45 0.40 0.56 0.58 0.56 dsyn:dsyn 0.86 0.92 0.92 0.40 0.50 0.53 0.50 1.00 1.00 0.60 0.80 0.70 0.55 0.65 0.66 0.54 1.00 0.86 0.55 0.67 0.73 aapp:gngm 0.70 0.88 0.87 0.19 0.36 0.37 0.14 0.33 0.20 0.10 0.30 0.30 0.24 0.42 0.42 0.14 0.29 0.32 0.27 0.43 0.47 Mean 0.83 0.91 0.92 0.38 0.50 0.54 0.69 0.68 0.81 0.55 0.63 0.69 0.42 0.52 0.56 0.63 0.66 0.76 0.53 0.61 0.67 Table 3: Running time and hardware requirements. Stage Time Hardware M CPU GPU RAM SemRep Processing 2 d 10-28 20+ Opt N/A AllenNLP Predicates 3 d 28-40 20+ Opt N/A Graph Construction 10 d 30+ 20+ Opt 120GB+ Graph Conversion 7 h 1 40+ Opt 1TB+ Graph Embedding 1 d 20 24+ Opt 120GB+ AGATHA Training 22 h 5+ 2+ Yes 300GB+ Network Adjacency 1 d 1 40+ Opt 1.5TB+ Table 4: Scores for valid recently published connections ob- tained by different AGATHA models. Reported average val- ues for 100 runs and standard deviation. AGATHA-O AGATHA-C AGATHA-GP COVID-19:Melatonin 0.63 ± 0.03 0.91 ± 0.03 0.78 ± 0.03 COVID-19:Oxytocin 0.75 ± 0.03 0.98 ± 0.02 0.81 ± 0.02 COVID-19:BST2 gene 0.41 ± 0.01 0.88 ± 0.03 0.74 ± 0.03 anti inflammatory activity of the hormone. For this connection AGATHA-C generated the score of 0.98. Similarly, we tested the prediction of the effects of the other hormone, melatonin. Several publications, started from November 2020 [3, 8, 13, 43] show the protective effects of melatonin, specifi- cally for COVID-19 neurological complications. The activity was linked to anti-oxidative effects of the melatonin. For this connection AGATHA-C generated the score of 0.91. Our system accurately predicted with score of 0.88 the involve- ment of tetherin (BST2). The results published in 2021 [32] show that tetherin restricts the secretion of SARS-CoV-2 viral particles and is downregulated by SARS-CoV-2. Therefore, pharmacological activation of tetherin expression, or inhibition of the degradation could be a promising direction of the development of SARS-CoV-2 treatment. 8 LESSONS LEARNED AND OPEN PROBLEMS Quality of the information retrieval pipelines. Information retrieval is an important part of any HG pipeline. In order to uncover implicit connections, the system should be able to capture existing explicit connections with as much quality as possible. Given that human knowledge is usually stored in a non-structured manner (e.g., scientific texts), the quality of systems that process raw textual data, such as those that solve the named entity recognition, or word sense disambiguation problems, is crucial. We observed that the SemRep system performs better concept and relation recognition when full abstracts are used as input data instead of single sentences. SemRep also allows to perform optional sortal anaphora resolution to extract co-references to the entities from neighbouring sentences, which was shown to be useful in [17] and is used in this work. "Positive" research bias. The absence of published negative re- search results is a big problem for the HG field. With mostly posi- tive results available, often we have to generate negative examples through some kind of random sampling. These negative samples likely do not adequately represent the real nature of negatively confirmed scientific findings. Likely, one of the most important .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ future work directions in the area of HG is to accurately distinguish and leverage positive and negative proposed results. Domain experts involvement. When any hypothesis generation system is built, one of the first questions a designer should address is extent that domain experts are expected to participate in the pipeline. Modern decision-making systems allow a fully automated discovery process (like the AGATHA system), but this may not be sufficient. A domain expert who interfaces with a HG system as a black box may not trust generated results or know how best to interpret them. The challenge of interpretable hypothesis genera- tion remains a significant barrier to widespread adoption of these kinds of research tools. For this we advocate using our “structural” learning HG system MOLIERE [35] in which with the topical mod- eling and network analytic measures we interpret and explain the results. The nature of input corpora. The question of what should be used as input to a topic-modeling based hypothesis generation sys- tem is raised in [34]. Using full-text papers shows an improvement, but the trade-off between run time and output quality was barely justifiable. However, deep learning models have a greater potential for extracting useful information from large input sources, and as it was demonstrated in our previous work [37], show significant per- formance advancements. Thus the question of using full-text papers in deep learning-based hypothesis generation systems should be addressed. Unfortunately, it is currently too computationally expen- sive our resources as the number of sentences and thus predicates and edges will be significantly larger. Knowledge resolution. Our newly proposed systems showed that the knowledge resolution plays a major role in subdomain recom- mendation. To increase the scope of model expertise (and the scope of potential applications beyond the biomedical fields) we deliber- ately incorporate a general-purpose information retrieval system RnnOIE into AGATHA-GP . This additional information results in significant gains in broad subdomains like (Geographic Area) → (Idea or Concept) (geoa,idcn). At the same time, we observe that AGATHA-C performs better in “microscopic” biomedical areas, e.g. (Organic Chemical) → (Organic Chemical) (orch,orch), which raises the question of choosing the appropriate model for every specific use case. Although, both systems process all types of queries, the general purpose predicates participated in training significantly improve “macroscopic” types of queries. 9 RELATED WORK A number of works have been proposed to organize the CORD-19 literature into a structured knowledge graph for different purposes. For instance, Basu et al. [5] propose ERLKG - a knowledge graph built on CORD-19 with entities corresponding to gene/chemical/dis- ease names and the edges forming relations between the concept. They use a fine tuned SciBERT model for both entity and relation extraction. The main purpose of the knowledge graph is to predict a link between a given chemical-disease and chemical-protein pair using a trained GCN autoencoder [19] approach. In another similar work, Oniani et al. [25] build a co-occurrence network on a subset of CORD-19 with the edges corresponding to either gene-disease, gene-mutation or chemical-disease type. The network is then em- bedded into latent space using a node2vec walk. Link prediction is performed on the nodes by training different classical machine learning algorithms. A major shortcoming of these approaches is that they limit themselves to either specific kind of entities or re- lations or both and as a result not only the scope of possible new literature is narrowed but a lot of additional useful knowledge is filtered out of the system. In contrast, our system does not limit itself to specific entity or relation type and is able to capture much more information from the same corpus. A major interest of constructing knowledge graphs is to al- low medical researchers to re-purpose existing drugs for treating COVID-19. Zhang et al. [42] develop a system that uses combined semantic predications from SemMedDB and CORD-19 (extracted using SemRep) to recommend drugs for COVID-19 treatment. To improve the predications from CORD-19, the authors fine tune various transformer based models on a manually annotated inter- nal dataset. Their resulting knowledge graph consists of 131,555 nodes and 2,558,935 edges. Our work on the other hand utilizes similar technologies and produces a bigger graph with 287,356,836 nodes and 13,500,291,256 edges. Moreover, we do not post-process extracted relations from SemRep and are still able to achieve a higher RoC metric. Another system proposed by Martinc et al. [22] uses a fine-tuned SciBERT model to generate contextualized embed- dings of CORD-19 articles and using an initial seed set of targets proposes possible therapy targets. However, this system is very different from ours as it treats the entire article as a bag of words and directly trains a word embedding model on CORD-19. It was earlier noted that KinderMiner [20] provides a web-based literature discovery tool and supports COVID-19 queries. The underlying algorithm is based on a simple keyword co-count between source and target words in a given corpus. While co-count is a fast and scalable approach, it suffers from a lack of “discrimination" i.e. two keywords occurring together more frequently do not always imply a high degree of correlation. The vastness of COVID-19 literature also spurned the need for having systems that could allow researchers and base users alike to get their COVID-19 queries answered. Systems like CKG (Wise et al.) [41] and SciSight (Hope et al.) [14] currently provide this functionality. While we do aim to provide an easy to use web- framework for medical researchers, the scope of the aforementioned systems is beyond the scope of our work. Unfortunately, no existing system out of those that are trained to accept terms related to COVID-19 or SARS-CoV-2 provided an open access for massive validation for a fair comparison with or was able to be tested in multiple domains like AGATHA-C . 10 CONCLUSIONS We present two graph mining transformer based models AGATHA- C and AGATHA-GP , for micro- and macroscopic scales of queries respectively, which are designed to help domain experts solve high- priority research problems and accelerate scientific discovery. We perform per-subdomain validation of these new models on a rapidly changing COVID-19 focused dataset, composed of recently pub- lished concept pairs and demonstrate that the proposed models achieve state-of-the-art prediction quality. Both models signifi- cantly outperform the existing baseline system AGATHA-O . We deploy the proposed models to the broad scientific community and .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ believe that our contribution can raise more interest in prospective hypothesis generation applications. REFERENCES [1] [n.d.]. Citations Added to MEDLINE by Fiscal Year. https://www.nlm.nih.gov/ bsd/stats/cit_added.html [2] Marina Aksenova, Justin Sybrandt, Biyun Cui, Vitali Sikirzhytski, Hao Ji, Diana Odhiambo, Matthew D Lucius, Jill R Turner, Eugenia Broude, Edsel Peña, et al. 2019. Inhibition of the Dead Box RNA Helicase 3 prevents HIV-1 Tat and cocaine- induced neurotoxicity by targeting microglia activation. Journal of Neuroimmune Pharmacology (2019), 1–15. [3] Lise Alschuler, Ann Marie Chiasson, Randy Horwitz, Esther Sternberg, Robert Crocker, Andrew Weil, and Victoria Maizes. 2020. Integrative medicine consid- erations for convalescence from mild-to-moderate COVID-19 disease. Explore (2020). [4] Patrick Arnold and Erhard Rahm. 2015. SemRep: A repository for semantic mapping. Datenbanksysteme für Business, Technologie und Web (BTW 2015) (2015). [5] Sayantan Basu, Sinchani Chakraborty, Atif Hassan, Sana Siddique, and Ashish Anand. 2020. ERLKG: Entity Representation Learning and Knowledge Graph based association analysis of COVID-19 through mining of unstructured biomed- ical corpora. In Proceedings of the First Workshop on Scholarly Document Pro- cessing. Association for Computational Linguistics, Online, 127–137. https: //doi.org/10.18653/v1/2020.sdp-1.15 [6] Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. Scibert: Pretrained contextualized embeddings for scientific text. arXiv preprint arXiv:1903.10676 (2019). [7] Olivier Bodenreider. 2004. The Unified Medical Language System (UMLS): Inte- grating Biomedical Terminology. [8] Daniel P Cardinali, Gregory M Brown, and Seithikurippu R Pandi-Perumal. 2020. Can Melatonin Be a Potential “Silver Bullet” in Treating COVID-19 Patients? Diseases 8, 4 (2020), 44. [9] Phuoc-Tan Diep. 2021. Is there an underlying link between COVID-19, ACE2, oxytocin and vitamin D? Medical Hypotheses 146 (2021), 110360. [10] R. A. DiGiacomo, J. M. Kremer, and D. M. Shah. 1989. Fish-oil dietary supple- mentation in patients with Raynaud’s phenomenon: a double-blind, controlled, prospective study. Am J Med 86, 2 (Feb 1989), 158–164. [11] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2017. AllenNLP: A Deep Semantic Natural Language Processing Platform. arXiv:arXiv:1803.07640 [12] Vishrawas Gopalakrishnan, Kishlay Jha, Wei Jin, and Aidong Zhang. 2019. A survey on literature based discovery approaches in biomedical domain. Journal of biomedical informatics 93 (2019), 103141. [13] Ping Ho, Jing-Quan Zheng, Chia-Chao Wu, Yi-Chou Hou, Wen-Chih Liu, Chien- Lin Lu, Cai-Mei Zheng, Kuo-Cheng Lu, and You-Chen Chao. 2021. Perspective Adjunctive Therapies for COVID-19: Beyond Antiviral Therapy. International Journal of Medical Sciences 18, 2 (2021), 314. [14] Tom Hope, Jason Portenoy, Kishore Vasan, Jonathan Borchardt, Eric Horvitz, Daniel S. Weld, Marti A. Hearst, and Jevin West. 2020. SciSight: Combining faceted navigation and research group detection for COVID-19 exploratory scientific search. arXiv:2005.12668 [cs.IR] [15] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017). [16] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016). [17] H. Kilicoglu, G. Rosemblat, M. Fiszman, and T. C. Rindflesch. 2016. Sortal anaphora resolution to enhance relation extraction from biomedical literature. BMC Bioin- formatics 17 (Apr 2016), 163. [18] Halil Kilicoglu, Dongwook Shin, Marcelo Fiszman, Graciela Rosemblat, and Thomas C. Rindflesch. 2012. SemMedDB: a PubMed-scale repository of biomedi- cal semantic predications. Bioinform. 28, 23 (2012), 3158–3160. http://dblp.uni- trier.de/db/journals/bioinformatics/bioinformatics28.html#KilicogluSFRR12 [19] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Repre- sentations (ICLR). [20] F. Kuusisto, J. Steill, Z. Kuang, J. Thomson, D. Page, and R. Stewart. 2017. A Simple Text Mining Approach for Ranking Pairwise Associations in Biomedical Applications. AMIA Jt Summits Transl Sci Proc 2017 (2017), 166–174. [21] Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex Peysakhovich. 2019. PyTorch-BigGraph: A Large-scale Graph Embedding System. In Proceedings of the 2nd SysML Conference. Palo Alto, CA, USA. [22] Matej Martinc, Blaž Škrlj, Sergej Pirkmajer, Nada Lavrač, Bojan Cestnik, Martin Marzidovšek, and Senja Pollak. 2020. COVID-19 Therapy Target Discovery with Context-Aware Literature Mining. In Discovery Science, Annalisa Appice, Grigorios Tsoumakas, Yannis Manolopoulos, and Stan Matwin (Eds.). Springer International Publishing, Cham, 109–123. [23] A. T. McCray, A. Burgun, and O. Bodenreider. 2001. Aggregating UMLS semantic types for reducing conceptual complexity. Stud Health Technol Inform 84, Pt 1 (2001), 216–220. [24] Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. 2019. Scispacy: Fast and robust models for biomedical natural language processing. arXiv preprint arXiv:1902.07669 (2019). [25] David Oniani, Guoqian Jiang, Hongfang Liu, and Feichen Shen. 2020. Con- structing co-occurrence network embeddings to assist association extraction for COVID-19 and other coronavirus infectious diseases. Journal of the American Medical Informatics Association 27, 8 (05 2020), 1259–1267. [26] Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. In Proceedings of the 14th Python in Science Conference, Kathryn Huff and James Bergstra (Eds.). 130 – 136. [27] M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681. https://doi.org/ 10.1109/78.650093 [28] Neil R Smalheiser. 2017. Rediscovering Don Swanson: The past, present and future of literature-based discovery. Journal of Data and Information Science 2, 4 (2017), 43–64. [29] Scott Spangler. 2015. Accelerating Discovery: Mining Unstructured Information for Hypothesis Generation. Chapman and Hall/CRC. [30] Scott Spangler, Angela D Wilkins, Benjamin J Bachman, Meena Nagarajan, Tajhal Dayaram, Peter Haas, Sam Regenbogen, Curtis R Pickering, Austin Comer, Jef- frey N Myers, et al. 2014. Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD international confer- ence on Knowledge discovery and data mining. 1877–1886. [31] Gabriel Stanovsky, Julian Michael, Luke Zettlemoyer, and Ido Dagan. 2018. Super- vised Open Information Extraction. In Proceedings of The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). Association for Computational Linguistics, New Orleans, Louisiana, (to appear). [32] Hazel Stewart, Kristoffer H Johansen, Naomi McGovern, Roberta Palmulli, George W Carnell, Jonathan Luke Heeney, Klaus Okkenhaug, Andrew Firth, Andrew A Peden, and James R Edgar. 2021. SARS-CoV-2 spike downregulates tetherin to enhance viral spread. bioRxiv (2021), 2021–01. [33] Don R Swanson. 1986. Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspectives in biology and medicine 30, 1 (1986), 7–18. [34] Justin Sybrandt, Angelo Carrabba, Alexander Herzog, and Ilya Safro. 2018. Are Ab- stracts Enough for Hypothesis Generation?. In 2018 IEEE International Conference on Big Data (Big Data). 1504–1513. https://doi.org/10.1109/bigdata.2018.8621974 [35] Justin Sybrandt, Michael Shtutman, and Ilya Safro. 2017. MOLIERE: Auto- matic Biomedical Hypothesis Generation System. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Min- ing (Halifax, NS, Canada) (KDD ’17). ACM, New York, NY, USA, 1633–1642. https://doi.org/10.1145/3097983.3098057 [36] Justin Sybrandt, Micheal Shtutman, and Ilya Safro. 2018. Large-Scale Validation of Hypothesis Generation Systems via Candidate Ranking. In 2018 IEEE International Conference on Big Data (Big Data). 1494–1503. https://doi.org/10.1109/bigdata. 2018.8622637 [37] Justin Sybrandt, Ilya Tyagin, Michael Shtutman, and Ilya Safro. 2020. AGATHA: Automatic Graph Mining And Transformer Based Hypothesis Generation Approach. Association for Computing Machinery, New York, NY, USA, 2757–2764. https: //doi.org/10.1145/3340531.3412684 [38] Huijun Wang, Ying Ding, Jie Tang, Xiao Dong, Bing He, Judy Qiu, and David J Wild. 2011. Finding complex biological relationships in recent PubMed articles using Bio-LDA. PloS one 6, 3 (2011), e17243. [39] Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Darrin Eide, K. Funk, Rodney Michael Kinney, Ziyang Liu, W. Merrill, P. Mooney, D. Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, Brandon Brandon Stil- son Stilson, Alex D Wade, Kuansan Wang, Christopher Wilhelm, Boya Xie, Dou- glas M. Raymond, Daniel S. Weld, Oren Etzioni, and Sebastian Kohlmeier. 2020. CORD-19: The Covid-19 Open Research Dataset. ArXiv (2020). [40] Stephani C Wang and Yu-Feng Wang. 2021. Cardiovascular protective properties of oxytocin against COVID-19. Life Sciences (2021), 119130. [41] Colby Wise, Vassilis N. Ioannidis, Miguel Romero Calvo, Xiang Song, George Price, Ninad Kulkarni, Ryan Brand, Parminder Bhatia, and George Karypis. 2020. COVID-19 Knowledge Graph: Accelerating Information Retrieval and Discovery for Scientific Literature. arXiv:2007.12731 [cs.IR] [42] Rui Zhang, Dimitar Hristovski, Dalton Schutte, Andrej Kastrin, Marcelo Fiszman, and Halil Kilicoglu. 2020. Drug Repurposing for COVID-19 via Knowledge Graph Completion. arXiv:2010.09600 [cs.CL] [43] Petra Zimmermann and Nigel Curtis. 2020. Why is COVID-19 less severe in children? A review of the proposed mechanisms underlying the age-related difference in severity of SARS-CoV-2 infections. Archives of Disease in Childhood (2020). .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://www.nlm.nih.gov/bsd/stats/cit_added.html https://www.nlm.nih.gov/bsd/stats/cit_added.html https://doi.org/10.18653/v1/2020.sdp-1.15 https://doi.org/10.18653/v1/2020.sdp-1.15 https://arxiv.org/abs/arXiv:1803.07640 https://arxiv.org/abs/2005.12668 http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics28.html#KilicogluSFRR12 http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics28.html#KilicogluSFRR12 https://doi.org/10.1109/78.650093 https://doi.org/10.1109/78.650093 https://doi.org/10.1109/bigdata.2018.8621974 https://doi.org/10.1145/3097983.3098057 https://doi.org/10.1109/bigdata.2018.8622637 https://doi.org/10.1109/bigdata.2018.8622637 https://doi.org/10.1145/3340531.3412684 https://doi.org/10.1145/3340531.3412684 https://arxiv.org/abs/2007.12731 https://arxiv.org/abs/2010.09600 https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ Abstract 1 Introduction 2 Background 3 Pipeline Summary 4 Augmenting Semantic Predicates with Deep Learning 5 Validation 6 Results 7 Case study 8 Lessons Learned and Open Problems 9 Related Work 10 Conclusions References 10_1101-2021_02_11_430762 ---- Ribovore: ribosomal RNA sequence analysis for GenBank submissions and database curation Schäffer et al. SOFTWARE Ribovore: ribosomal RNA sequence analysis for GenBank submissions and database curation Alejandro A. Schäffer1,2, Richard McVeigh2, Barbara Robbertse2, Conrad L. Schoch2, Anjanette Johnston2, Beverly A. Underwood2, Ilene Karsch-Mizrachi2 and Eric P. Nawrocki2* Abstract Background: The DNA sequences encoding ribosomal RNA genes (rRNAs) are commonly used as markers to identify species, including in metagenomics samples that may combine many organismal communities. The 16S small subunit ribosomal RNA (SSU rRNA) gene is typically used to identify bacterial and archaeal species. The nuclear 18S SSU rRNA gene, and 28S large subunit (LSU) rRNA gene have been used as DNA barcodes and for phylogenetic studies in different eukaryote taxonomic groups. Because of their popularity, the National Center for Biotechnology Information (NCBI) receives a disproportionate number of rRNA sequence submissions and BLAST queries. These sequences vary in quality, length, origin (nuclear, mitochondria, plastid), and organism source and can represent any region of the ribosomal cistron. Results: To improve the timely verification of quality, origin and loci boundaries, we developed Ribovore, a software package for sequence analysis of rRNA sequences. The ribotyper and ribosensor programs are used to validate incoming sequences of bacterial and archaeal SSU rRNA. The ribodbmaker program is used to create high-quality datasets of rRNAs from different taxonomic groups. Key algorithmic steps include comparing candidate sequences against rRNA sequence profile hidden Markov models (HMMs) and covariance models of rRNA sequence and secondary-structure conservation, as well as other tests. At least nine freely available blastn rRNA databases created and maintained with Ribovore are used either for checking incoming GenBank submissions or by the blastn browser interface at NCBI or both. Since 2018, Ribovore has been used to analyze more than 50 million prokaryotic SSU rRNA sequences submitted to GenBank, and to select at least 10,435 fungal rRNA RefSeq records from type material of 8,350 taxa. Conclusion: Ribovore combines single-sequence and profile-based methods to improve GenBank processing and analysis of rRNA sequences. It is a standalone, portable, and extensible software package for the alignment, classification and validation of rRNA sequences. Researchers planning on submitting SSU rRNA sequences to GenBank are encouraged to download and use Ribovore to analyze their sequences prior to submission to determine which sequences are likely to be automatically accepted into GenBank. Keywords: ribosomal RNA; annotation; alignment; ncRNA * Correspondence: nawrocke@ncbi.nlm.nih.gov 2 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894 USA Full list of author information is available at the end of the article Background In 1977, Carl Woese and George Fox proposed the Archaebacteria (later renamed Archaea) as a third domain of life distinct from Bacteria and Eukaryota based on analysis of small subunit ribosomal RNA (SSU rRNA) oligonucleotide fragments from 13 microbes [1]. The use of SSU rRNA to elucidate phylogenetic relationships continued and dramatically expanded in the late 1980s when Norm Pace and col- leagues developed a technique to PCR amplify potentially unculturable microbes from environmental samples by targeting so-called universal primer sites [2]. The technique was later refined by Pace and others including Ward, Weller [3] and Giovanonni and colleagues [4]. Environmental studies targeting SSU rRNA as a phylogenetic marker gene that seek to characterize the diversity of life in a given 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 2 of 28 environment have remained common ever since, and consequently there are now millions of prokaryotic SSU rRNA sequences in public databases. When rRNA se- quences are submitted to public databases, such as GenBank, it is important to do quality control, so that subsequent data analyses are not misled by errors in sequencing and sequence annotation. Because rRNA gene sequences do not code for proteins, but have been studied so extensively, specialized checks for correct- ness and completeness are feasible and desireable. The focus of this paper is the description of Ribovore, a software package for validating incoming rRNA sequence submissions to GenBank and for curating rRNA sequence collections. SSU rRNA was initially chosen by Woese and Fox for inferring a universal phylo- genetic tree of life because it existed in all cellular life, was large enough to provide enough data (about 1500 nucleotides (nt) in Bacteria), and had evolved slowly enough to be comparable across disparate groups [5]. The first environmental sur- veys targeted SSU rRNA, but studies targeting LSU rRNA, which is roughly twice as long as SSU rRNA, followed soon after [6, 7]. These types of analyses eventually began to target eukaryotes, especially Fungi. In eukaryotes, the 5.8S rRNA gene is surrounded by two internal transcribed spacers (ITS1 and ITS2). This region is sometimes collectively referred to as the ITS region and it has been selected as the primary fungal barcode since it has the highest probability of successful identification for the broadest range of Fungi [8]. However, the LSU rRNA gene [9] is a popular phylogenetic marker in certain fungal groups [8]. In general, the nuclear SSU rRNA has poor species-level resolution in most Fungi and other eukaryote taxonomic groups [10, 8], but remains useful at species level in some rapid evolving groups such as the diatoms [11]. Species identification in protists takes a two-step barcoding approach, which use the ∼500 bp variable V4 region of the SSU rRNA gene as a variable marker and then use a group-specific barcode for species-level assignments, some of which include the LSU rRNA gene and ITS region [10]. Specialized analysis tools and databases have been developed to help researchers analyze their rRNA sequences. Many of these specialized tools are based on compar- ing sequences to either profile hidden Markov models (profile HMMs) or covariance models (CMs). CMs are profile stochastic context-free grammars, akin to profile HMMs of sequence conservation [12, 13], with additional complexity to model the conserved secondary structure of an RNA family [14, 15, 16]. Like profile HMMs, CMs are probabilistic models with position-specific scores, determined based on the frequencies of nucleotides at each position of the input training alignment used to build the model. Unlike HMMs, CMs also model well-nested secondary structure, provided as a single, fixed consensus secondary structure for each model and anno- tated in the input training alignment. A CM includes scores for each of the possible 16 (4x4) basepairs for basepaired positions and both paired positions are considered together by scoring algorithms. The incorporation of secondary structure has been shown to significantly improve remote homology detection of structural RNAs [17], and for SSU rRNA considering structure has been shown to offer a small improvement to alignment accuracy versus profile HMMs [18, 19]. For eukaryotes, where SSU and LSU rRNA sequences are often more divergent at the sequence level than for Bacteria and Archaea, harnessing structural information during alignment may be more impactful. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 3 of 28 The specialized tools for rRNA sequence analysis include: databases, some of which are integrated with software, rRNA prediction software, and multiple align- ment software. The integrated and highly curated databases include the ARB work- bench software package for rRNA database curation [20], the Comparative RNA Website (CRW) [21], the Ribosomal Database Project (RDP) [22, 23], and the Greengenes [24], and Silva [25] databases. These databases differ in their scope and methodology. CRW contains tens of thousands of sequences and corresponding alignments of SSU, LSU and 5S rRNA from all three domains as well as from or- ganelles, along with secondary structure predictions for selected sequences. Green- genes, which is seemingly no longer maintained as its last update was in 2013, includes SSU rRNA sequences for Bacteria and Archaea, but not for Eukarya, nor does it contain any LSU rRNA sequences. RDP also includes SSU rRNA for Bac- teria and Archaea, as well as fungal LSU rRNA and ITS sequences, but no other LSU sequences. Silva, which split off from the ARB project starting in 2005 [26], includes bacterial, archaeal and eukaryotic (fungal and non-fungal) SSU and LSU rRNA sequences. RDP includes more than 3 million SSU rRNA and 125,000 fungal LSU rRNA sequences as of its latest release (11.5), and Silva includes more than 9 million SSU rRNA and 1 million LSU rRNA sequences (release 138.1). Available rRNA prediction software packages include RNAmmer [27], rRNAselec- tor [28], and barrnap(https://github.com/tseemann/barrnap) all of which use some version of the profile HMM software HMMER [29] to predict the locations of rRNAs in contigs or whole genomes. Both RDP and Silva make available multiple alignments of all sequences for each gene and taxonomic domain, and all include several sequence analysis tools for tasks such as classification. The alignment methodology differs: Silva uses SINA, which implements a graph-based alignment algorithm that computes a sequence- only based alignment of an input sequence to one or more similar sequences selected from a fixed reference alignment [30]. RDP uses Infernal [31], which computes align- ments using CMs. Per-domain CMs for SSU and LSU rRNA are freely available in the Rfam database, a collection of more than 3900 RNA families each represented by a con- sensus secondary structure annotated reference alignment called a seed alignment and corresponding CM built from that alignment [32]. Rfam includes five full length SSU and four full length LSU rRNA families and CMs. Although RDP uses CMs for rRNA alignment, the CMs are not from Rfam. Users can download and use Rfam CMs to annotate their own sequences using Infernal, thus offering a distinct strategy from Silva or RDP for rRNA analysis. The Rfam database includes a model (RF02542) for SSU rRNA from Mi- crosporidia, a phylum of particular interest within the kingdom of Fungi. More than 30 years ago, Woese and colleagues discovered that Microsporidia have a dis- tinctive ribosome that is smaller and more primitive than the ribosomes of most if not all other eukaryotes [33]. Recently, Barandun and colleagues presented the first crystal structure of the ribosome of Microsporidia, confirming that both the SSU and LSU rRNA are smaller than in other Fungi [34]. Most of the sequence analysis and curation to date in Microsporidia has focused on SSU rather than LSU rRNA. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 4 of 28 GenBank processing of rRNA sequences Data submitted to GenBank are subject to review by NCBI staff to prevent incorrect data from entering NCBI databases. Over the past three decades, personnel called GenBank indexers have spent a large proportion of their time validating incoming submissions of thousands to millions of rRNA sequences due to the large number of rRNA sequences generated in phylogenetic and environmental studies. Similarity searches with blastn have been used to compare submitted rRNA sequences against one of several databases of trusted, high-quality rRNA sequences depending on the taxonomic domain and gene. The blastn query results were a primary source of evidence used to determine if rRNA sequences would be accepted to GenBank or not. Prior to the Ribovore project, suitable blastn databases did not exist for validating submissions of eukaryotic SSU rRNA or LSU rRNA sequences, making checking for those genes especially difficult and time-consuming. Starting in 2016, a system with predefined criteria for per-sequence blastn results was deployed at NCBI; submissions in which all sequences met those criteria have been automatically accepted into GenBank without any indexer review. Ribovore- based tests began being used in conjunction with or instead of blastn-based tests for some submissions in this system in June 2018. Although the engine inside the pre-2018 validation system, BLAST, is freely available and portable, the system as a whole was internal to GenBank and not portable, preventing researchers who wish to submit sequence data to GenBank (henceforth, called “submitters”) from replicating the tests on their local computers. For rRNA sequences as well as other sequences of high biological interest, Gen- Bank indexers and other NCBI personnel want to carry out two related and re- current processes: quick identification of which submitted sequences should be ac- cepted into GenBank, and the construction of non-redundant collections of trusted, full length sequences that have no or few errors. The second problem is the moti- vation behind the entire RefSeq project [35]. Towards addressing the first problem, the development of an alternative sequence validation system for rRNA included four design goals offering potential improvements over the existing system. First, the system should be as deterministic and as reproducible as possible in deciding whether sequences are accepted or not, which we refer to as passing (accepted) or failing (not accepted), allowing submissions with zero failing sequences to be au- tomatically added to GenBank without the need for any manual GenBank indexer intervention. Some non-determinism over time is unavoidable because various in- puts to the system, such as the NCBI taxonomy tree, change over time. Second, the system should be available as a standalone tool that submitters can run on their sequences prior to submission, saving time for both the GenBank indexers and submitters. Third, the system should be general enough to facilitate exten- sion to additional taxonomic groups and rRNA genes. Fourth, the system should be capable of increasing the stringency of tests for quality and adding tests to avoid re- dundancy to enable producing collections of high quality non-redundant sequences for other applications, such as serving as blastn databases. Because none of the existing databases or specialized rRNA tools listed above address all of these design goals, we implemented the freely available and portable Ribovore software package for the analysis of SSU rRNA and LSU rRNA sequences 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 5 of 28 from Bacteria, Archaea, and Eukarya as well as mitochondria from some eukary- otic groups. Ribovore includes several programs designed for related but distinct tasks, each of which has specific rules dictating whether a sequence passes or fails based on deterministic criteria described in detail in the Implementation section and in the Ribovore documentation. rRNA sensor is a simplified, standalone ver- sion of the previous blastn-based system that is more portable and faster for bacterial and archaeal SSU rRNA than the previous system owing to a smaller blastn target database constructed by removing redundancy from the pre-existing blastn database. ribotyper is similar to rRNA sensor but compares each input sequence against a library of profile HMMs and/or CMs offering an alternative, and in some cases, more powerful approach than the single sequence-based blastn algorithm. Additionally, ribotyper can be used to validate the taxonomic domain each sequence belongs to because it compares a set of models from different tax- onomic groups against each sequence. To take advantage of both single-sequence and profile-based approaches, and partly to ease the transition from the previous blastn-based system towards profile-based analysis, we implemented ribosensor that runs both rRNA sensor and ribotyper and then combines the results. Up to this point, rRNA sensor and ribotyper are deliberately designed to accept both partial and complete sequences of moderate quality or better. To more selectively identify full-length rRNA sequences that extend up to, but not beyond the gene boundaries, we implemented riboaligner which runs ribotyper as a first pass validation, and then creates multiple alignments and selects sequences that pass based on those alignments. Finally, to make Ribovore capable of generating datasets of trusted sequences from different taxonomic groups for wider use by the commu- nity, we developed ribodbmaker, which chooses a non-redundant set of high-quality, full-length sequences based on a series of tests. The pipeline of tests includes some specific to rRNA, including analysis by ribotyper and riboaligner, some more general tests, such as counting ambiguous nucleotides and vector contamination screening, and some tests that require connection to the NCBI taxonomy database to validate the taxonomy assignment of sequences. Implementation Ribovore is written in Perl and available at https://github.com/ncbi/ribovore. The Ribovore installation procedure also installs the program rRNA sensor, which is described here as well. The rRNA sensor program includes a shell script and Perl scripts and is available at https://github.com/aaschaffer/rRNA sensor. These pack- ages use existing software as listed in Table 1. Each of the four Ribovore programs takes as input two command-line arguments: the path to an input sequence file in FASTA format and the name of an output directory to create and store output files in. Command-line options exist to change default parameters and behavior of the programs in various ways. The options as well as example usage can be found as part of the source distribution and on GitHub in the form of markdown files in the Ribovore documentation subdirectory (e.g. https://github.com/ncbi/ribovore/blob/master/documentation/ribotyper.md). Cen- tral to each of the scripts is the concept of sequences passing or failing. If a sequence meets specific criteria, many of which are changeable with command-line options, 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 6 of 28 Table 1 Software packages and libraries used within Ribovore v1.0. ∗: the esl-cluster executable from Infernal v1.1.2, which is absent in v1.1.4, is also installed and used within Ribovore. software and website used within purpose in Ribovore Sequip v0.08 all Ribovore option handling, output github.com/nawrockie/sequip programs file handling and other utilities Infernal v1.1.4∗ all Ribovore build and use profile github.com/EddyRivasLab/infernal programs HMMs and CMs to classify, validate and align rRNA sequences BLAST+ v2.11.0 ribodbmaker build BLAST databases ftp.ncbi.nlm.nih.gov/blast/ and validate rRNA sequences executables/blast+/2.11.0 VecScreen plus taxonomy v0.17 ribodbmaker screen for vector contamination github.com/aaschaffer/ vecscreen plus taxonomy GNU time (not required) all Ribovore determine running time programs if -p option is used then it will pass and otherwise it will fail, as discussed more below. An overview of the four Ribovore programs and rRNA sensor is shown in Figure 1. Table 2 Command-line arguments for rRNA sensor argument index argument name description 1 min length lower bound on sequence length 2 max length upper bound on sequence length 3 seq file input sequence file in FASTA format 4 output file name name for summary output file 5 min id percentage lower bound on percent identity 6 max Evalue upper bound on E-value 7 nprocessors number of threads for blastn 8 output dir output directory path 9 blastdb blastn database rRNA sensor The rRNA sensor program compares input sequences to a blastn database of ver- ified rRNA sequences using blastn. The program takes nine command-line argu- ments specified in Table 2. Each input sequence is classified into one of five classes based on its length and blastn results. A sequence is classified as too long or too short if its length is greater than the maximum length or less than the minimum length specified in the command by the user. To allow partial sequences and flex- ibility in the length, GenBank indexers were typically using a length interval of [400,2500] nt for prokaryotic 16S SSU rRNA. Empirical analysis shows that more than 99.8% of the full-length validated prokaryotic sequences have lengths in the range [900,1800], so this narrower range is recommended if one wants to check that sequences are typically full-length sequences. Sequences within the allowed length range are classified as either no if there are zero blastn hits, yes if they have at least one blastn hit that has an E-value of 1e-40 or less and a percent identity of 80% or more, or imperfect match if there is at least one hit but the E-value or percent identity thresholds are not met for any hits. Sequences that are too long are probably either incorrect or containing 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 7 of 28 Figure 1 Schematic summarizing the use cases for the four Ribovore programs and rRNA sensor. Programs listed in white boxes underneath the black boxes are important external programs executed from within the program in the attached black box. Validate and classify ribosomal RNA sequences: Analyze lengths of ribosomal RNA sequences: riboaligner ribotyper cmalign Create high-quality reference database of ribosomal RNA sequences: ribodbmaker srcchk vecscreen blastn ribotyper riboaligner esl-cluster sequence �le sequence �le ribotyper cmsearch sequence �le - pass/fail de�nition - classi�cation to best-matching model (e.g. SSU.bacteria) - list of unexpected features, if any rRNA_sensor blastn sequence �le - classi�cation into one of �ve classes: yes, no, too long, too short or imperfect match ribosensor ribotyper rRNA_sensor sequence �le - pass/fail de�nition - list of ribotyper, rRNA_sensor and GenBank errors, if any - alignment to best-matching model - length classi�cation based on alignment - overall pass/fail de�nition - per-test pass/fail de�nition for tests: • ambiguous nucleotides • vector contamination • repetitive sequences • validation by ribotyper and riboaligner • reference model span • taxonomic ingroup analysis output per sequence:input: sequences compared to: pro�les single sequences pro�les and single sequences pro�les pro�les and single sequences also executes: output per sequence:input: sequences compared to: output per sequence:input: sequences compared to: extra flanking sequence that should be trimmed, while sequences that are too short may be valid partial sequences. The other tests based on quality of blastn matches codify the tests that GenBank indexers were doing internally before rRNA sensor was implemented. Submitted sequences of a suitable length now classified as no would have been rejected in the past framework; sequences now classified as yes would have been accepted into GenBank in the past framework. In the current testing framework, rRNA sensor is used as part of the ribosensor program as described below, not by itself. There are two target blastn databases included with rRNA sensor, one for prokaryotic 16S SSU rRNA and one for eukaryotic 18S SSU rRNA. The prokaryotic database includes 1267 sequences, 1205 of which are bacterial and the remain- ing 62 are archaeal. The eukaryotic database includes 1091 sequences. Additional, user-created blastn databases can also be used with the program. The prokaryotic database was updated most recently on June 29, 2017 by filtering and clustering the pre-existing database of 18,816 sequences used by GenBank indexers for 16S SSU rRNA analysis. One could repeat the same procedure with the larger ver- sion of the 16S SSU rRNA database described in Results. The initial database was filtered to remove 26 sequences outside the length range [900,1800]. The re- maining 18,790 sequences were clustered using UCLUST [36] so that the surviving 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 8 of 28 Table 3 Profile models used by Ribovore. ’#seqs’ is the number of sequences in the multiple alignment used to build the model. ’length’ is the number of reference model positions. Abbreviations in ’taxonomy group’ column: ’Bac’ is Bacteria, ’Euk’ is Eukarya and ’Mito’ is Mitochondria. model name gene taxonomy group #seqs length Rfam SSU rRNA archaea SSU rRNA Archaea 86 1477 RF01959 SSU rRNA bacteria SSU rRNA Bacteria 99 1533 RF00177 SSU rRNA eukarya SSU rRNA Eukarya 91 1851 RF01960 SSU rRNA microsporidia SSU rRNA Euk-Microsporidia 46 1312 RF02542 LSU rRNA archaea LSU rRNA Archaea 91 2990 RF02540 LSU rRNA bacteria LSU rRNA Bacteria 102 2925 RF02541 LSU rRNA eukarya LSU rRNA Eukarya 88 3401 RF02543 SSU rRNA mitochondria metazoa SSU rRNA Mito-Metazoa 83 954 - SSU rRNA mitochondria amoeba SSU rRNA Mito-Amoeba 2 1861 - SSU rRNA mitochondria chlorophyta SSU rRNA Mito-Chlorophyta 2 1200 - SSU rRNA mitochondria fungi SSU rRNA Mito-Fungi 4 1603 - SSU rRNA mitochondria kinetoplast SSU rRNA Mito-Kinetoplast 3 624 - SSU rRNA mitochondria plant SSU rRNA Mito-Plant 4 1951 - SSU rRNA mitochondria protist SSU rRNA Mito-Protist 2 1677 - SSU rRNA chloroplast SSU rRNA Chloroplast 94 1488 - SSU rRNA chloroplast pilostyles SSU rRNA Chloroplast 1 1531 - SSU rRNA cyanobacteria SSU rRNA Bac-Cyanobacteria 49 1487 - SSU rRNA apicoplast SSU rRNA Euk-Apicoplast 3 1463 - sequences were no more than 90% identical, leaving 1267 sequences. The eukaryotic 18S SSU rRNA database of 1091 sequences was updated most recently on Septem- ber 27, 2018 by running version 0.28 of the Ribovore program ribodbmaker on an input set of 579,279 GenBank sequences returned from the eukaryotic SSU rRNA E-utilities (eutils) query provided in Results and discussion with command-line options --skipfribo1 --model SSU.Eukarya --ribo2hmm. ribotyper The ribotyper program is also designed to validate ribosomal RNA sequences but it differs from rRNA sensor in the method of sequence comparison and the taxonomic breadth over which it applies. Instead of using blastn, ribotyper uses a profile HMM and optionally a covariance model (CM) to compare against input sequences. The profile HMM and CMs were built either from Rfam rRNA seed alignments (see Table 3) or from alignments created specifically for Ribovore by the authors for taxonomic groups not covered by the Rfam models. Sequence processing by ribotyper proceeds over two main stages. In stage 1, each sequence is compared against all profiles using a truncated version of the HMMER3 pipeline [37] optimized for speed. Only the first three stages of the HM- MER3 pipeline are employed to compute a score for each sequence/profile compar- ison but without calculating accurate alignment endpoints. For each sequence, the best-scoring model is selected and used in the second stage where the HMMER3 pipeline is used again but this time in its entirety to compute likely endpoints of high-scoring hits to each model. These two stages are very similar to the classifi- cation and coverage determination stages of the VADR software package for viral sequence annotation [38]. The results of the stage 2 comparison are then post- processed to determine if any unexpected features exist for each sequence. There are 16 types of unexpected features, listed in Table 4. ribosensor The ribosensor program is a wrapper script that runs both ribotyper and rRNA sensor and combines the results to determine if each sequence should pass or 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 9 of 28 Table 4 Attributes of the 16 types of ribotyper unexpected features. Unexpected features labelled with * in the first column are fatal by default, in that they cause a sequence to fail. UnacceptableModel and QuestionableModel can only potentially be reported if the --inaccept option is used. EvalueScoreDiscrepancy can only be reported if the --evalues option is used. TooShort and TooLong can only be reported if the --shortfail or --longfail options are used, respectively. unexpected feature name description NoHits* no stage 1 hits above threshold to any models UnacceptableModel* best stage 1 hit is to a model that is unacceptable as defined in --inaccept input file MultipleFamilies* stage 1 hits exist to more than one family (e.g. SSU and LSU) BothStrands* stage 1 hits above threshold exist on both strands DuplicateRegion* at least two stage 1 or 2 hits on same strand overlap InconsistentHits* not all hits are in the same order in sequence and model coordinates QuestionableModel* best stage 1 hit is to a model that is questionable as defined in --inaccept input file MinusStrand best stage 1 hit is on the minus strand LowScore the bits per nucleotide value (total bit score divided by total length of sequence) is below threshold of 0.50 LowCoverage sequence coverage of all hits is below threshold of 0.86 LowScoreDifference difference between top two models in different domains is below 0.10 bits per position VeryLowScoreDifference difference between top two models in different domains is below 0.04 bits per position MultipleHits there is more than one hit to the best scoring model on the same strand EvalueScoreDiscrepancy if hits were sorted by E-value due to --evalue, best hit has lower bit score than second best hit TooShort* sequence length is less than and --shortfail used TooLong* sequence length is greater than and --shortlong used fail. This script was motivated partly by an effort to ease the transition for GenBank indexers between the pre-existing blastn-based system and a system based on pro- files. Additionally, in some cases, the profile models in ribotyper allow some valid rRNA sequences that would fail blastn and rRNA sensor to pass, and conversely some valid sequences pass rRNA sensor and fail ribotyper, making a combination of the two programs potentially more accurate. The ribosensor program can be run in one of two modes: 16S mode is the default mode and should be used for bacterial and archaeal 16S SSU rRNA sequences, and 18S mode should be used (by specifying the option -m 18S on the command-line) for eukaryotic 18S SSU rRNA. All sequences are first processed by ribotyper us- ing command-line options --scfail --covfail --tshortcov 0.80 --tshortlen 350 to fail sequences for which LowScore and LowCoverage unexpected features are reported, and to specify that the threshold for LowCoverage is 80% for sequences of 350 nt or less. These options were selected based on results of internal testing by GenBank indexers. Next, rRNA sensor is run, potentially up to three separate times, on partitions of the input sequence file separated based on length and using custom thresholds for each length range. Sequences that are shorter than 100 nt or longer than 2000 nt are considered too short or too long and are not analyzed. For sequences between 100 and 350 nt, a minimum percent identity of 75% and minimum coverage of 80% is enforced. For sequences between 351 and 600 nt, the minimum thresholds used are 80% percent identity and 86% coverage, and for se- quences between 601 and 2000 nt the minimum thresholds used are 86% percent identity and 86% coverage. These thresholds can be changed via command-line options. The results of ribotyper and rRNA sensor are combined and each sequence is sep- arated into one of four outcome classes depending on whether it passed or failed each 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 10 of 28 program: RPSP (passed ribotyper and rRNA sensor), RPSF (passed ribotyper and failed rRNA sensor), RFSP (failed ribotyper and passed rRNA sensor), and RFSF (failed both). Additionally, the reasons for failing each program are reported. For ribotyper, these are the unexpected features described above, each prefixed with a “R ” (e.g. R MultipleFamilies). The possible errors for rRNA sensor are listed in Table 5 and the possible errors for ribotyper are listed in Table 6. Fi- nally, these errors are mapped to a different set of errors created for use within the pre-existing context of GenBank’s sequence processing pipeline shown which has its own error naming and usage conventions. This mapping is shown in Table 7. The “fails to” column is of practical importance because it indicates which errors cause a submission to not be accepted. More positively, if a submitter runs ribosensor before actually trying to submit and the submitter sees that the errors in the first seven rows and the third column of Table 7 do not occur, then, assuming the meta- data for the submission are complete and valid, the submitter can have confidence that the submission to GenBank will be accepted. Table 5 Descriptions of rRNA sensor errors within ribosensor and mapping to the GenBank errors they trigger. ’*’: The first four rRNA sensor errors do not trigger GenBank errors and are ignored by ribosensor if either (a) the sequence is ’RPSF’ (passes ribotyper and fails rRNA sensor) and the -c option is not used with ribosensor or (b) the sequence is ’RFSF’ (fails both ribotyper and rRNA sensor) and R UnacceptableModel or R QuestionableModel ribotyper errors are also reported. rRNA sensor error associated GenBank error cause/explanation S NoHits∗ SEQ HOM NotSSUOrLSUrRNA no hits reported (’no’ column 2) S NoSimilarity∗ SEQ HOM LowSimilarity coverage (column 5) of best blast hit is < 10% S LowSimilarity∗ SEQ HOM LowSimilarity coverage (column 5) of best blast hit is < 80% (≤ 350nt) or 86% (> 350nt) S LowScore∗ SEQ HOM LowSimilarity either id percentage below length-dependent threshold (75%,80%,86%) or E-value above 1e-40 (’imperfect match’ column 2) S BothStrands SEQ HOM MisAsBothStrands hits on both strands (’mixed’ column 2) S MultipleHits SEQ HOM MultipleHits more than 1 hit reported (column 4 > 1) Table 6 Descriptions of ribotyper errors within ribosensor and mapping to the GenBank errors they trigger. ’+’: these errors errors do not trigger a GenBank error if sequence is ’RFSP’ (fails ribotyper and passes rRNA sensor); ribotyper error associated GenBank error cause/explanation R NoHits SEQ HOM NotSSUOrLSUrRNA no hits reported R MultipleFamilies SEQ HOM SSUAndLSUrRNA SSU and LSU hits R LowScore SEQ HOM LowSimilarity bits/position score is < 0.5 R BothStrands SEQ HOM MisAsBothStrands hits on both strands R InconsistentHits SEQ HOM MisAsHitOrder hits are in different order in sequence and model R DuplicateRegion SEQ HOM MisAsDupRegion hits overlap by 10 or more model positions R UnacceptableModel SEQ HOM TaxNotExpectedSSUrRNA best hit is to model other than expected set 16S expected set: SSU.Archaea, SSU.Bacteria, SSU.Cyanobacteria, SSU.Chloroplast 18S expected set: SSU.Eukarya R LowCoverage SEQ HOM LowCoverage coverage of all hits is < 0.80 (if ≤ 350nt) or 0.86 (if > 350nt) R QuestionableModel+ SEQ HOM TaxQuestionableSSUrRNA best hit is to a ’questionable’ model (if mode is 16S: SSU.Chloroplast) R MultipleHits+ SEQ HOM MultipleHits more than 1 hit reported riboaligner The riboaligner program was designed to help GenBank indexers to evaluate whether ribosomal RNA sequences are full length and do not extend past the boundaries of the gene. One application for a set of full length rRNAs is as part 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 11 of 28 Table 7 Mapping of GenBank errors to the rRNA sensor and ribotyper errors that trigger them. There are two classes of exceptions marked by two different superscripts in the table: ’*’: these rRNA sensor errors do not trigger a GenBank error if: (a) the sequence is ’RPSF’ (passes ribotyper and fails rRNA sensor) and the -c option is not used with ribosensor. or (b) the sequence is ’RFSF’ (fails both ribotyper and rRNA sensor) and R UnacceptableModel or R QuestionableModel are also reported. ’+’: these ribotyper errors do not trigger a GenBank error if sequence is ’RFSP’ (fails ribotyper and passes rRNA sensor); GenBank error fails to triggering rRNA sensor/ribotyper errors SEQ HOM NotSSUOrLSUrRNA submitter S NoHits∗, R NoHits SEQ HOM LowSimilarity submitter S NoSimilarity∗, S LowSimilarity∗, S LowScore∗, R LowScore SEQ HOM SSUAndLSUrRNA submitter R MultipleFamilies SEQ HOM MisAsBothStrands submitter S BothStrands, R BothStrands SEQ HOM MisAsHitOrder submitter R InconsistentHits SEQ HOM MisAsDupRegion submitter R DuplicateRegion SEQ HOM TaxNotExpectedSSUrRNA submitter R UnacceptableModel SEQ HOM TaxQuestionableSSUrRNA indexer R QuestionableModel+ SEQ HOM LowCoverage indexer R LowCoverage SEQ HOM MultipleHits indexer S MultipleHits, R MultipleHits+ of the blastn database for screening and validating incoming sequences using rRNA sensor, ribosensor or other blastn-based methods. riboaligner first calls ribotyper to determine the best matching model for each sequence using special command-line options. The --minusfail, --scfail and --covfail options are used to specify that sequences with unexpected features of MinusStrand, LowScore and LowCoverage will fail. Additionally, the --inaccept option is used to specify that the names of the desired models to use are in file ; only sequences that match best to one of these models is eligible to pass. The default set of acceptable models is SSU.archaea and SSU.bacteria by default. All sequences that score best to one of the acceptable models are aligned to that model using the cmalign program of Infernal which takes into account both sequence and secondary structure conservation. The alignment is then parsed to determine the length classification of each sequence based on the alignment. There are 13 possible length classes which are defined based on whether the alignment of each sequence extends to or past the first and final model reference position as well as how many insertions and deletions occur in the first and final ten model reference positions. More information on these classes can be found in the Ribovore documentation. Only sequences that pass ribotyper will be aligned by riboaligner, and the per-sequence ribotyper pass/fail designation is not changed by riboaligner. The riboaligner summary output file is identical to the ribotyper output summary file with additional per-sequence information on the length class, start and stop model reference position of each aligned sequence and number of insertions/deletions in the first and final ten model positions. ribodbmaker The ribodbmaker program is designed to create high quality datasets of rRNA sequences, which may be useful as reference datasets or blastn databases. It takes as input a set of candidate sequences and a specified rRNA model (e.g. SSU.Bacteria) and applies numerous quality control tests or filters such that only high quality sequences pass. The program performs the following steps: 1 fail sequences with too many ambiguous nucleotides 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 12 of 28 2 fail sequences that do not have a specified species taxid in the NCBI taxonomy database 3 fail sequences that have non-weak vecscreen hits, suggesting the presence of vector contamination, as calculated by the VecScreen plus taxonomy software package [39] 4 fail sequences that have unexpected internal repeats as determined by com- paring each sequence against itself using blastn and finding off-diagonal local alignments with an E-value of no more than 1 and length at least 20 for the plus strand and 50 for the minus strand 5 fail sequences that fail ribotyper, including matching best to a model other than the specified one, using non-default options --minusfail --lowppossc 0.5 --scfail to specify that sequences with best hits on the minus strand or with scores below 0.5 bits per nucleotide will fail 6 fail sequences that fail riboaligner, including matching best to a model other than the specified one, using non-default options --lowppossc 0.5 --tcov 0.99 to specify that sequences with scores below 0.5 bits per nucleotide or for which less than 99% of the sequence length is covered by hits will fail 7 fail sequences that do not cover a specified span of model positions (are too short) 8 fail sequences that survive all above steps but do not meet expected criteria of an ingroup analysis based on taxonomy and alignment identity In step 6, riboaligner outputs multiple sequence alignments of all sequences. These alignments are used for further scrutiny of each sequence in step 7, the ingroup analysis step. At this stage, sequences that do not cluster (based on alignment identity) with other sequences in their taxonomic group fail. Finally, sequences that survive all stages are clustered based on alignment identity and centroids for each cluster are selected for the final set of surviving sequences. Steps 2, 3, and 8 require access to the NCBI taxonomy database and further that each input sequence be assigned in the nucleotide database to a unique or- ganism in the taxonomy database. This restricts the use of ribodbmaker to se- quences already present in GenBank. The taxonomy criterion excludes, for ex- ample, some chimeric sequences that have been engineered and patented. Users can run ribodbmaker on other sequences, but must bypass these steps using the --skipftaxid, --skipfvecsc, and --skipingrup. The VecScreen plus taxonomy package is only available for Linux and so is not installed with Ribovore on Mac/OSX. Consequently, the following ribodbmaker options must be used on Mac/OSX: --skipftaxid --skipfvecsc --skipingrup --skipmstbl. In gen- eral, ribodbmaker is highly customizable via command-line option usage, and can be run using many different subsets of tests. For more information on command-line options see the ribodbmaker.md file in the Ribovore documentation subdirectory. As described above, riboaligner calls ribotyper, so ribotyper is actually called twice by ribodbmaker, once in step 5 and once in step 6. In the riboaligner step, ribotyper is called with options that differentiate its usage from step 5, making the criteria for passing more strict in several ways. The --difffail and --multfail options are used to specify that sequences with unexpected features of LowScoreDifference and VeryLowScoreDifference will fail. Additionally, a CM is 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 13 of 28 used instead of a profile HMM for the second stage (ribotyper --2slow option) and any sequence for which less than 99% of the nucleotides are covered by a hit in the second stage will fail (--tcov 0.99 option). Finally, the --scfail option, which is used in the ribotyper call in step 5, is not used in step 6. Ribovore reference model library and blastn databases The Ribovore package includes 18 sequence- and structure-based alignments and corresponding CMs, listed in Table 3. Seven of the 18 alignments are from Rfam, and the other 11 were created during development of the package. rRNA sensor includes two blastn databases: one of 1267 bacterial and archaeal 16S SSU rRNA sequences created by clustering and filtering the blastn database already in use at GenBank in 2017 when development of the script began, and one of 1091 eukaryotic 18S SSU rRNA sequences created by filtering a sequence dataset generated by ribodbmaker. All 18 of the Ribovore model alignments are the end products of a multi-step model refinement procedure using the valuable secondary structure data available from CRW [21] and sequences from GenBank. For each gene and taxonomic group (e.g. SSU rRNA eukarya), an initial alignment with consensus secondary structure was created based on combining alignments and individual sequence secondary structure predictions from CRW as described in [19, 40], and used to build a CM using the Infernal program cmbuild. That CM was then calibrated for database se- quence search using cmcalibrate and searched against all currently available rRNA sequences in GenBank. The resulting high-scoring hits were then filtered for redun- dancy and manually examined and surviving sequences were realigned to the model to create a new alignment. In some cases the consensus secondary structure was modified slightly based on the new alignment. Some models were further refined by additional iterations of building, searching, and realigning. Eight of the 18 Ribovore models are SSU models with fewer than 10 sequences in the training alignment (Table 3). These are for taxonomic groups with relatively few known example sequences for which the consensus secondary structure is distinct but not as well understood as for other groups, like 16S SSU rRNA. Six of these eight are non-metazoan mitochondrial models, one is a chloroplast model for the Pilostyles plant genus, and one is for apicoplasts. These eight models are less mature than the other ten models, but they are included in the package for completeness and we plan to improve them in future versions. Currently, users should be cautious when interpreting results that involve any of these eight models. From each of the 18 Ribovore model alignments, two separate CMs were con- structed using different command-line parameters to the cmbuild program of In- fernal. One model was built using cmbuild’s default entropy weighting feature that controls the average entropy per model position [13, 19], and one was built using the cmbuild --enone option, which turns off entropy weighting. The non-entropy weighted models, which perform better at sequence classification in our internal testing (results not shown), are used by ribotyper, and the entropy weighted mod- els are used by riboaligner for sequence alignment because they are slightly more accurate at getting alignment endpoints correct based on our own internal test results (not shown). 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 14 of 28 Timing measurements Timing measurements of rRNA sensor, ribotyper, riboaligner, and ribodbmaker were done primarily on an Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz with 48 cores and running the CentOS7.8.2003 version of Linux. We used one thread except for tests of rRNA sensor that measured the effect on wall-clock time of increasing the number of threads. For the runs of ribodbmaker, we used the NCBI compute farm to parallelize some time intensive steps. The reason for using the compute farm only for the ribodbmaker tests is that ribodbmaker is intended primarily for curation of databases at NCBI, while the other modules are intended to be used both by submitters around the world and GenBank indexers at NCBI. Results and discussion Ribovore is used directly or indirectly by NCBI and GenBank in various ways: as part of its submission pipelines for rRNA sequences, through the BLAST Web server (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE= BlastSearch&LINK_LOC=blasthome) and by facilitating the validation of sequences from type material to be incorporated into new records in the RefSeq database (https://www.ncbi.nlm.nih.gov/bioproject/224725). We detail each of these uses below and then compare the capability of Ribovore for fungal rRNA sequence validation to related projects. rRNA sequence submission checking Submitters of rRNA sequences to GenBank who use the NCBI Submission Portal can choose between 12 different subtypes, listed in Table 8. For most submission subtypes, the sequences are analyzed via a blastn-based pipeline by comparing each submitted sequence against a blastn database for the specific submission subtype. Three of these submission subtypes (ITS1 and ITS2 and 16S-23S IGS) are for non-rRNA sequences. For four of the remaining nine subtypes, the blastn database currently used was created with the help of ribodbmaker, as discussed more below. The ribosensor program is used instead of the blastn pipeline to analyze 16S prokaryotic SSU rRNA submissions of 2500 or more sequences for which the submitter chooses the attribute uncultured to describe the sequences. For ribosensor, the default parameters are used to determine if sequences should pass or fail as discussed in the Implementation section. For blastn, sequences are evaluated based on the average percentage identity, the average percentage query coverage, and the percentage of gaps in the alignments for the top target sequences. Additionally, using a blastn-based method that predates and inspired rRNA sensor, sequences that are suspected to be misassembled or incorrectly la- belled taxonomically fail. Specifically, the query sequence is tested with blastn against the 16S SSU rRNA database described below and the matches are ranked in increasing order of E-value. A sequence passes the misassembly test if and only if the best matches each have exactly one local alignment. The taxonomy tests are based on a comparison of the proposed taxonomy from the submitter and the tax- onomic information of the top matches, taking into account variant spellings and 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 15 of 28 Table 8 NCBI rRNA and ITS sequence submission types and attributes. The ’submission type’ column indicates the three possible rRNA sequence related options for GenBank submissions available at https://submit.ncbi.nlm.nih.gov/subs/genbank/. There is not an intergenic spacer type of eukaryotes at this time. The ’submission subtype’ column indicates the more specific possible sequence types a submitter can select after choosing one of the options in the first column. The ’validation method’ column indicates whether incoming sequences are processed with ribosensor, blastn using a database constructed with ribodbmaker (’blast(ribo)’) or blastn using either the general non-redundant (nr) database or a database constructed by other means (’blastn’). The ’percentage of accepted submissions’ and ’percentage of accepted sequences’ reflect rRNA/IGS/ITS submissions published between Jan 1, 2020 and May 31, 2020. Note that the percentages for rows 3 and 4 are summed and reported in row 3, and for rows 5 through 12 (all eukaryotic submission types) are summed and reported in in row 5. Counts pertain only to submissions that advanced through enough preliminary checks to be assigned an internal submission code. percentage of percentage submission submission validation of accepted of accepted type subtype method submissions sequences SSU rRNA only (16S) ≥ 2500 seqs ribosensor 0.648% 99.604% Prokaryotic SSU rRNA only (16S) < 2500 seqs blastn(ribo) 43.934% 0.1252% rRNA/IGS LSU rRNA only (23S) blastn(ribo) 0.754% 0.0027% intergenic spacer (16S-23S IGS) blastn (sum of 2 rows) contains rRNA-ITS region blastn 54.663% 0.2682% Eukaryotic SSU rRNA only (18S) blastn(ribo) (sum of 8 eukaryotic rows) Nuclear LSU rRNA only (28S) blastn(ribo) rRNA/ITS ITS1 only blastn ITS2 only blastn Eukaryotic mitochondrial SSU rRNA (12S) blastn Organellar mitochondrial LSU rRNA (16S) blastn rRNA chloroplast SSU rRNA (16S) blastn chloroplast LSU rRNA (23S) blastn synonyms in NCBI Taxonomy. The exact thresholds for these pre-Ribovore, blastn- related comparisons vary according to the type of submission and are outside the scope of this paper. For both the blastn and ribosensor pipelines submissions in which all sequences pass and which have the required metadata are automatically deposited into Gen- Bank. All other submissions fail and are either sent back to the submitter with automated error reports or manually examined further by GenBank indexers, de- pending on the specific reason for the failure. A key objective of distributing the Ribovore software is to permit submitters to do on their own computers similar checks to those done by the GenBank submission pipeline. In 2018, we began using earlier versions of ribosensor to analyze large- scale 16S prokaryotic SSU rRNA submissions. This remains the only submission type for which ribosensor is employed in an automated way, although we plan to expand to additional genes and taxonomic domains in the future. Parts of Ribovore are also used manually by GenBank indexers to evaluate some submissions. The most common type of rRNA submission by far is 16S prokaryotic SSU rRNA (Table 8). Between July 1, 2018 and May 31, 2020, 33,388 submissions of 16S SSU rRNAs with less than 2500 sequences (or for which the submitter indicated the sequences were from cultured organisms), were handled by the blastn pipeline. The total number of sequences in these submissions was 240,112, for an average of 7.2 sequences per submission. In the same time interval, ribosensor processed 242 16S SSU rRNA submissions comprising 49,868,017 sequences for an average of 206,066.2 sequences per submission. In the first six months of 2020, ribosensor processed more than 99.6% of the sequences deposited in GenBank via any of the the rRNA or ITS submission pipelines (Table 8). 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 16 of 28 Construction and usage of rRNA databases for blastn Four of the ten rRNA blastn databases used for submission checking were created by ribodbmaker as indicated in Table 8. An additional three blastn databases, available in the web server version of BLAST are described below. All blastn databases we mention can be retrieved for local use from the direc- tory https://ftp.ncbi.nlm.nih.gov/blast/db/ The databases are (re)generated semi- automatically by extracting large sets of plausible sequences from Entrez and pro- viding them as input to ribodbmaker. The ribodbmaker program is run so that the tests for ambiguous nucleotides, specified species, vector contamination, self- repeats, ribotyper, riboaligner, and model span are all executed. However, the databases are allowed to contain more than one sequence per taxid and the ingroup analysis is skipped (--skipingrup option). Only sequences that pass ribodbmaker tests are eligible to be in the blastn databases. To keep the 16S prokaryotic SSU, eukaryotic SSU, and eukaryotic LSU databases to a reasonable size, the sequences that pass ribodbmaker are clustered with UCLUST [36] at a threshold of 97% identity and all other parameters at default values. The clustering stage of ribodbmaker is not used for this purpose and is skipped by using the --skipclustr option. The 16S prokaryotic SSU rRNA BLAST database is generated starting from all sequences in the GenBank nucleotide database that match NCBI BioProject IDs PRJNA33175 or PRJNA33317 using the eutils query in the first row of Table 9. PRJNA33175 has the title “Bacterial 16S Ribosomal RNA RefSeq Tar- geted Loci Project”. PRJNA33317 has the title “Archaeal 16S Ribosomal RNA RefSeq Targeted Loci Project”. This formal query is supplemented by manual searches of the journal International Journal of Systematics and Evolutionary Bi- ology (https://www.microbiologyresearch.org/content/journal/ijsem), where many new bacterial species are announced and peer-reviewed, along with their 16S SSU rRNA sequences. Among the databases described here, the 16S SSU rRNA database is the only one restricted to sequences from “type material” that have been more stringently vetted before curation for RefSeq. The fungal RefSeq records described in a later subsection are also restricted to be from “type material”. Table 9 also lists the eutils queries to the nucleotide database that are used for 23S prokaryotic LSU rRNA, eukaryotic SSU rRNA, eukaryotic LSU rRNA, Mi- crosporidia SSU rRNA, and Microsporidia LSU rRNA. When we seek sequences that are likely to be complete, not larger genome pieces, and not partial, we add a constraint on the length with an extra term such as 1500:18000[slen] for eukary- otic SSU rRNA. The main attribute that distinguishes Microsporidia is that the lower bound on slen for complete sequences is set about 300-500 nucleotides lower as explained in Background. To find possibly partial LSU sequences that are long enough to cover the variable regions, we add the condition 425:1000[slen]. These queries rely on standardized nomenclature and structure of the definition line of GenBank sequence records which contain information about the source organism, feature content, completeness and location. Since 2016, these definition lines have been constructed formulaically during the processing of submissions. For example, the sequence MT981756.1 has the title: “Staphylococcus epidermidis strain RA13 16S ribosomal RNA gene, partial sequence”, and the sequence MN158348.1 has the 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 17 of 28 Table 9 Queries used in command-line eutils to collect input datasets for ribodbmaker. gene eutils query archaeal and PRJNA33175[BioProject] OR PRJNA33317[BioProject] bacterial SSU rRNA bacterial bacteria[orgn] AND (23S [ti] OR large subunit ribosomal RNA [ti]) LSU rRNA NOT uncultured[orgn] NOT 23S rRNA methyltransferase[ti] NOT srcdb pdb [prop] NOT srcdb PAT [prop] NOT WGS [filter] NOT mRNA [filter] [ti] NOT RefSeq [filter] NOT mRNA NOT “mitochondrion”[Filter] NOT TLS [ti] archaeal archaea[orgn] AND (23S [ti] OR large subunit ribosomal RNA [ti]) LSU rRNA NOT uncultured[orgn] NOT 23S rRNA methyltransferase[ti] NOT srcdb pdb [prop] NOT srcdb PAT [prop] NOT WGS [filter] NOT mRNA [filter] [ti] NOT RefSeq [filter] NOT mRNA NOT “mitochondrion”[Filter] NOT TLS [ti] eukaryotic eukaryota[orgn] (18S [ti] OR small subunit ribosomal RNA [ti]) SSU rRNA NOT WGS [filter] NOT mRNA [filter] NOT “mitochondrion”SSU rRNA [Filter] NOT plastid [filter] NOT chloroplast [filter] NOT plastid [ti] NOT chloroplast [ti] NOT mitochondrial [ti] NOT RefSeq [filter] NOT (5.8S [ti] OR internal [ti]) NOT 28S [ti] NOT WGS NOT mRNA NOT “mitochondrion”[Filter] NOT TLS [ti] NOT srcdb pdb[prop] eukaryotic Eukaryota[orgn] AND (25S [ti] OR 26S [ti] OR 28S [ti] OR large subunit LSU rRNA ribosomal RNA [ti]) NOT WGS [filter] NOT mRNA [filter] NOT “mitochondrion” [Filter] NOT plastid [filter] NOT chloroplast [filter] NOT plastid [ti] NOT chloroplast [ti] NOT mitochondrial [ti] NOT RefSeq [filter] NOT (5.8S [ti] OR internal [ti]) NOT TLS [ti] NOT partial cds [ti] NOT Chain [ti] NOT 18S [ti] NOT srcdb pdb[prop] microsporidia microsporidia[orgn] AND (18S [ti] OR small subunit ribosomal RNA [ti]) SSU rRNA NOT WGS [filter] NOT mRNA [filter] NOT “mitochondrion”[Filter] NOT plastid [filter] NOT chloroplast [filter] NOT plastid [ti] NOT chloroplast [ti] NOT mitochondrial [ti] NOT RefSeq [filter] NOT (5.8S [ti] OR internal [ti]) NOT 28S [ti] NOT WGS NOT mRNA NOT “mitochondrion”[Filter] NOT TLS [ti] NOT srcdb pdb[prop] microsporidia microsporidia[orgn] AND (25S [ti] OR 26S [ti] OR 28S [ti] OR large subunit LSU rRNA ribosomal RNA [ti]) NOT WGS [filter] NOT mRNA [filter] NOT “mitochondrion” [Filter] NOT plastid [filter] NOT chloroplast [filter] NOT plastid [ti] NOT chloroplast [ti] NOT mitochondrial [ti] NOT RefSeq [filter] NOT (5.8S [ti] OR internal [ti]) NOT TLS [ti] NOT partial cds [ti] NOT Chain [ti] NOT 18S [ti] NOT srcdb pdb[prop] fungal fungi [orgn] AND 225:10000 [slen] AND sequence from type [filter] SSU rRNA AND (18S [ti] OR small subunit ribosomal RNA [ti]) NOT WGS [filter] RefSeq NOT mRNA [filter] NOT mitochondrion [Filter] NOT plastid [filter] records NOT chloroplast [filter] NOT mitochondrial [ti] NOT (5.8S [ti] OR internal [ti]) NOT 28S [ti] NOT 26S [ti] NOT 25S [ti] NOT 5S [ti] NOT 23S [ti] NOT 37S [ti] NOT WGS NOT mRNA NOT RefSeq [filter] NOT TLS [ti] fungal fungi[orgn] AND 425:10000[slen] AND sequence from type[filter] LSU rRNA AND (25S[ti] OR 26S[ti] OR 28S[ti] OR large subunit ribosomal RNA[ti]) RefSeq NOT WGS[filter] NOT mRNA[filter] NOT “mitochondrion”[Filter] records NOT plastid[filter] NOT chloroplast[filter] NOT mitochondrial[ti] NOT (5.8S[ti] OR internal[ti]) NOT TLS[ti] NOT partial cds[ti] NOT Chain[ti] NOT 18S[ti] NOT srcdb pdb[prop] NOT RefSeq[filter] title “Tetrahymena rostrata strain TRAUS 18S ribosomal RNA gene, internal tran- scribed spacer 1, 5.8S ribosomal RNA gene, internal transcribed spacer 2, and 28S ribosomal RNA gene, complete sequence”. Creation of fungal rRNA RefSeq entries using ribodbmaker NCBI’s RefSeq project seeks to create a representative, non-redundant set of an- notated genomes, transcripts, proteins and nucleotide records including rRNA se- quences [35]. Since 2018, ribodbmaker has been used to screen the set of fungal 18S SSU rRNA and 26S LSU rRNA sequences. Table 9 lists the queries used to identify candidates to be new fungal SSU and LSU rRNA RefSeq records. Studies that target fungal rRNAs frequently attempt to obtain SSU rRNA se- quences that span most of the V4 and part of the V5 variable regions, or LSU rRNA sequences that span the D1 and D2 variable regions as these have been 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 18 of 28 Table 10 Web blastn usage of specialized rRNA databases that are curated using Ribovore to decide which sequences are valid. Usage was measured during November 12, 2019 - June 22, 2020. database runs runs/day visits visits/day 16S rRNA sequences (SSU) 355,479 1921.5 92,638 500.7 from Bacteria and Archaea 18S rRNA sequences (SSU) from 5,724 30.9 2,876 15.5 Fungi and reference type material 28S rRNA sequences (LSU) from 4,535 24.5 1,861 10.1 Fungi and reference type material shown to be phylogenetically informative [41, 42]. These regions correspond to Rfam RF01960 model positions 604 to 1070 (SSU) and RF02543 model posi- tions 124 to 627 (LSU). Correspondingly, ribodbmaker is run with command- line options (--fmlpos and --fmrpos) that enforce that only sequences that span these model coordinates can pass. The following ribodbmaker options are used for SSU: --fione --fmnogap --fmlpos 604 --fmrpos 1070 -f --model SSU.Eukarya --skipclustr, and for LSU: --fione --fmnogap --fmlpos 124 --fmrpos 627 -f --model LSU.Eukarya --skipclustr. NCBI BLAST webpage rRNA target databases For many years, NCBI has been offering searches of nucleotide and protein databases with various modules of BLAST [43] through the NCBI BLAST webpage. Most commonly, searches of nucleotide queries use a comprehensive “nonredundant (nr)” database of nucleotide sequences or databases of whole genomes. A disproportion- ate number of queries are rRNA sequences. When blastn users know that their queries are of these special types, searching smaller targeted databases that exclude sequences unexpected to have a significant match to the query reduces running time and leads to more focused results. The BLAST webpage now allows users to select from three ribodbmaker-derived rRNA target databases, listed in Table 10. The 16S SSU rRNA database is identical to the one used by the blastn submission pipeline. The other two are specific to Fungi, due to the popularity of the analysis of rRNA sequences for studies of that kingdom. The fungal SSU and fungal LSU BLAST databases are effectively equivalent to the sets of curated RefSeq records described below. The availability of these databases was announced in late 2019, and the number of blastn runs and unique blastn visitors who selected each database during the seven-month period November 12, 2019 - June 22, 2020 are reported in Table 10. The usage suggests that there is sufficient user demand to justify the cura- tion effort. As of November 24, 2020, there are 2,836 fungal SSU RefSeq records and 7,573 fungal LSU RefSeq records, almost all of which were curated with Ribovore. Comparison to curated sets of fungal rRNA sequences from Silva We compared iteratively our ribodbmaker approach to curating fungal RefSeq records with other curatorial efforts that are part of the Silva project [26]. The purposes of this comparison were: 1 to identify new candidate RefSeq records and possible weakness in our proce- dures for choosing RefSeq records, 2 to test whether Ribovore works on curated data sets and to correct errors asso- ciated with some sequences in NCBI databases, such as misleading definition lines, 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 19 of 28 3 to characterize what proportion of sequences curated by others pass the Ri- bovore criteria and why sequences fail. As noted above, fungal LSU sequences submitted since 2016 or 2017 should have definition lines that match this query. In the course of doing the SilvaParc LSU tests described below, we corrected the definition lines of 132 older sequences that are fungal LSU and passed all ribodbmaker tests, but did not match the above eutils query. The current fungal RefSeq SSU and LSU sequences can be obtained with the queries “PRJNA39195[BioProject]” and “PRJNA51803[BioProject]”, respectively or both from the ftp site: https://ftp.ncbi.nlm.nih.gov/refseq/TargetedLoci/Fungi/. For our comparison of fungal sequences, we used a curated set of 8,770 SSU sequences from Silva [44], a set of 1,461 SSU sequences from Silva in the phylum Microsporidia, a set of 2,993 high-quality LSU reference sequences from Silva, and a much larger set of 394,247 sequences from Silva called Parc [25, 45]. We denote these four sets as Yarza, SilvaMicrosporidia, SilvaRef, and SilvaParc, respectively. To set up the SilvaMicrosporidia set, we downloaded the FASTA files for all of SilvaParc SSU and extracted 1,468 sequences labeled as being from the phylum Microsporidia; of these, 1,461 sequences were in GenBank with sufficient taxonomy information to be considered for ribodbmaker. SilvaParc contains fewer than 50 Microsporidia LSU sequences, supporting our previous assertion that the SSU has been much more studied than the LSU in Microsporidia. Similarly, to set up the SilvaRef and SilvaParc sets, we downloaded FASTA files for all SilvaRef LSU sequences for all taxa in version 132 on July 23, 2020 and all LSU Parc sequences in version 132 on August 1, 2020. We filtered for all sequences that had the token “Fungi” in the definition line. A small number of sequences had to be dropped subsequently because 1) they are not from the kingdom Fungi (e.g., they may be from a pathogen of a fungus) 2) they were absent from the nuccore database of GenBank either due to being “unverified” or from certain types of patents or 3) due to phylogenetic discrepancies (cf. [46]) that we subsequently fixed as part of the first objective listed above. In all data sets, we retrieved the most recent version of all GenBank accessions, which differs from the curated version for a very small number of sequences since version 132 of Silva is recent. In our analysis of the SilvaParc set, 96 non-fungal sequences were inadvertently included in the analysis, and excluded only while checking the results. The results of the three comparisons are shown in Table 11. The main steps in these tests consisted of: 1 Download and uncompress a FASTA file of source sequences from the sup- plementary information of [44] or from the Silva FTP retrieval site, which we denote File1.fa. 2 As explained in the Ribovore documentation, retrieve and condense the cur- rent version of NCBI’s taxonomy tree. An important and subtle column is the boolean (0/1) specified species column for each taxon; a 1 in this column for the row of taxon t means that according to NCBI’s Taxonomy Group the taxon name is valid and currently peferred; a 1 in this column is a neces- sary condition for a sequence from taxon t to be eligible to be in the rRNA databases or to be a RefSeq. Call the resulting file taxonomy.txt. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 20 of 28 Table 11 Summary of ribodbmaker pass/fail outcomes for Yarza(SSU), SilvaMicrosporidia(SSU), SilvaRef(LSU), and SilvaParc(LSU) datasets. All tests except the ingroup analysis depend only on the sequence being tested. The four tests for ambiguous nucleotides, specified species, vector contamination, and self-repeats are done on all sequences, so sequences may fail more than one test. Only sequences that pass the ribotyper test are eligible as input to riboaligner. Only sequences that pass the riboaligner test are eligible to be tested for length and alignment span. Only sequences that pass all 1-sequence tests are eligle for ingroup analysis. The ingroup analysis can be done allowing many sequences from the same taxon to pass or limiting to 1 the number of sequences that pass from each taxon (argument --fione). The many option is a more meaningful test; we show the 1 option just for comparison. dataset/ Yarza SilvaMicrosporidia SilvaRef SilvaParc test pass/fail pass/fail pass/fail pass/fail Ambiguous nucleotides 8493/277 1435/26 2928/65 385607/8640 Specified species 7543/1227 774/687 2089/904 204923/189324 Vector contamination 8735/35 460/1 2989/4 393938/305 Self-repeats 8635/135 1418/43 2929/64 390305/3942 ribotyper 8085/685 1140/321 2925/68 145450/248797 riboaligner 7720/1050 998/463 1583/1410 135252/258995 length in range? 7617/103 957/41 1187/396 132984/2268 expected span? 7612/5 669/288 990/197 103016/29968 all 1-sequence tests? 6359/2411 405/1056 759/2234 70127/324120 ingroup analysis(many) 5158/1201 288/117 649/110 56606/13521 ingroup analysis(1) 3246/3113 136/269 478/281 7129/52998 3 (For tests of Silva data only) Extract the definition lines for sequence identifiers of interest with the command: grep Fungi File1.fa | grep -v Bacteria or grep Microsporidia File1.fa | grep -v Bacteria, redi- recting the output to an intermediate file. The second command in the pipe removes most sequences from fungal pathogens that are not actually from the kingdom of Fungi. For the Fungal SSU set, all sequences are of interest, so the simpler command grep ">" File1.fa extracts the definition lines. 4 Extract GenBank accessions without the versions from the definition lines at step 3. Versions are removed because for some sequences, the version in Silva has been superseded in GenBank with a newer version. 5 Use the NCBI package eutils [47] to retrieve from the nucleotide database of GenBank all currently live accessions from the accession sets derived at the previous step. Some sequences get dropped at this step because they are no longer live. Call the FASTA file at this step File2.fa. 6 Use the NCBI standalone tool srcchk (available at: ftp://ftp.ncbi.nih.gov/toolbox/ncbi tools/converters/by program/srcchk/) to check which sequences in File2.fa have a valid and fully consistent taxon- omy entry. Remove sequences that do not get a normal result from srcchk because they will cause Ribovore to halt. A small number (well below 1%) of sequences get removed at this step either because they are engineered se- quences from patents or because there are transient inconsistencies between the NCBI taxonomy tree and the organism values in the GenBank nucleotide records. Call the resulting file File3.fa. 7 Run ribodbmaker --taxin taxonomy.txt --skipclustr --model --fmlpos --fmrpos --fmnogap --fione --pidmax 71000 --indiffseqtax -f -p . The value of was either SSU.Eukarya, SSU.Microsporidia, or LSU.Eukarya, depending on the test being done. The values of and are set in a model-specific manner according to the rec- ommended values in the Ribovore documentation (ribodbmaker.md file in 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 21 of 28 documentation subdirectory). The 71000 is an upper bound on the number of sequences input in our various tests. Version 0.40 of Ribovore was used. For the large test of SilvaParc, we ran the last ribodbmaker step separately on various subsets of File3.fa, then collected all sequences that passed all sequence- specific tests, and did a final run of ribodbmaker to include the ingroup analysis. This split into multiple runs achieves better parallelism and throughput for large versions of File3.fa because the ingroup analysis is the only step in which the results for any single sequence depend on which other sequences are included in the input. The numbers of eligible sequences reported in Table 11 are those included in File3.fa. The results are slightly sensitive to changes in the NCBI taxonomy tree, which is updated daily. For the Yarza tests, we used the NCBI taxonomy tree as of August 1, 2020 and for the Silva tests we used the tree as of September 22, 2020. It was not our intent to compare sets of “passing” sequences because the criteria for fungal RefSeq records are deliberately more stringent than for inclusion in Silva. Most notable are the two taxonomic tests: 1) that each sequence should come from a specified species and 2) that in selecting sequences for RefSeq, we may choose to keep only one sequence per species taxid, as specified by the command-line option --fione, to avoid redundancy. Indeed, the analysis of fungal sequences from Silva yielded some new fungal RefSeq records; specifically, we added 4 SSU sequences from the Yarza set, and 10 LSU sequences from the SilvaRef and SilvaParc data sets. However, the tests results also show some possible improvements in the Silva curation. It appears that a small number of Silva sequences have vector contamination and more than 1% may be misassembled as indicated by self-repeats, which are not expected in fungal SSU and LSU rRNA genes (see Methods, subsection ribodbmaker for the self-repeat criteria). It appears that the Yarza SSU data set was carefully curated for sequences to be full-length and not too long, but in the SilvaRef and SilvaParc data sets more than 20% of sequences have either a length that is out of the range of typical eukaryotic LSU sequences or do not span the range [124,627] that includes the D1/D2 regions typically covered for species differentitation. Thus, it appears that the Silva resource curation could arguably be improved by checking sequence ends, so as to trim long sequences, remove short sequences, and remove sequences that are unlikely to be full LSU sequences. The sequences could be too long either because they were not trimmed to the LSU boundary or possibly because they contain introns. In the SilvaMicrosporidia test, we tested for the presence of the most conserved V4 and V5 regions only with a permissive expected span of [380,800]. Partial comparison to RNAmmer To our knowledge, there is no other software that solves the rRNA validation prob- lem as we have formulated it for GenBank submissions. One widely used software package that solves a related problem is RNAmmer [27]. The problem that RNAmmer solves is to find likely rRNA sequences within larger sequences by using an old ver- sion of HMMER (v2.3.2) to compare against one of six profile HMMs. To accelerate searches, RNAmmer first utilizes a small spotter profile HMM that models only the most conserved 75 consecutive positions of the overall rRNA alignment to detect rRNA regions, padding those regions with extra sequence on each end, and then 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 22 of 28 using a profile of the full rRNA to determine gene boundaries within those padded regions. So far as we could determine, RNAmmer does not work with the up-to-date HMMER version 3 nor with arbitratry profile HMMs or CMs. Nevertheless, one can provide as input to RNAmmer FASTA files of putative rRNA sequences, pretending that they were larger contigs. The module of Ribovore that is closest in purpose to this usage of RNAmmer is ribotyper. To allow com- parison between RNAmmer and ribotyper, one can define that a sequence passes RNAmmer if RNAmmer produces in the output at least one HMMER-based pre- diction for that sequence using the intended rRNA model (e.g., eukaryotic SSU for the Yarza set) and fails if there are zero such predictions. This comparison is unfair to RNAmmer because it does not use the predicted intervals, which are the most useful part of the output when RNAmmer is used with large contigs as inputs. We compared the performance of ribotyper versus RNAmmer on the Yarza SSU set and the SilvaRef LSU set. We used the ribotyper results obtained from the ribodbmaker tests described above and summarized in Table 11. Among the 8,770 SSU sequences in the Yarza set: 7,999 passed both ribotyper and RNAmmer, 23 failed both ribotyper and RNAmmer, 665 passed RNAmmer and failed ribotyper, and 86 passed ribotyper and failed RNAmmer. Among the set of 665, 557 se- quences include “internal transcribed spacer” or “ITS” in the definition line, and 69 of the other 108 have lengths above 5,000 nt, indicating that all but at least 39 of the sequences likely include sequence outside the SSU rRNA sequence (which is rarely more than 3Kb) and so are expected to fail ribotyper. Of the 86 sequences that passed ribotyper and failed RNAmmer, 46 of them would pass RNAmmer if the E-value and bit score thresholds for the spotter HMM which are hard-coded at 1E-5 and 0 were changed to 1 and -100 in the rnammer Perl script, indicating that these sequences do not match well to the spotter profile HMM used for eukaryotic SSU rRNA. Among the 2,993 LSU sequences in the SilvaRef set: 2,481 passed both RNAmmer and ribotyper, 21 failed both RNAmmer and ribotyper, 47 passed RNAmmer and failed ribotyper, and 444 passed ribotyper and failed RNAmmer. Among the 47 sequences that passed RNAmmer and failed ribotyper, 44/47 can be ex- plained because they have one of three errors that ribotyper looks for and would not necessarily lead RNAmmer to have no output matches: R DuplicateRegion (24 sequences), R BothStrands (7 sequences), R MultipleFamilies (13 sequences). Many of the 47 sequences are described on the definition lines as a “shotgun assembly”; accordingly, the R DuplicateRegion and R BothStrands errors indicate two differ- ent errors that occur commonly in assembling nucleotide sequences into contigs. Some of the sequences have both SSU and LSU in the definition lines and if ac- curate, that should lead to an R MultipleFamilies error. These 13 sequences that match both genes could have been trimmed before inclusion in the SilvaRef LSU set. In principle, one could detect the presence of an SSU match and an LSU match in the same sequence with RNAmmer, but one would have to add error rules to RNAmmer to decide when the occurrence of matches to both SSU and LSU mod- els is an error. That need for error semantics exemplifies how ribotyper differs in functionality from RNAmmer. Of the 444 sequences that passed ribotyper and failed RNAmmer, 433 of them would pass RNAmmer if, as for the Yarza set, the 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 23 of 28 E-value and bit score thresholds for the spotter HMMs were changed to 1 and -100 respectively, indicating that these sequences do not match well to the eukaryotic LSU rRNA spotter profile HMM. In general, there are a large number of discrepant outcomes and neither RNAmmer nor ribotyper is consistently more restrictive than the other. We infer merely that the two pieces of software solve different problems and there is not a straightforward way to modify RNAmmer to solve the problem of checking rRNA submissions to GenBank. This helps to justify why we developed the new software Ribovore. As explained above, Ribovore also has additional modules, such as ribodbmaker, that are even less comparable to RNAmmer and solve other problems in rRNA sequence validation and curation. Limitations and future directions Ribovore includes 18 profile models (Table 3), only two of which are used for au- tomated submission checking (bacterial SSU rRNA and archaeal SSU rRNA), and seven of which (the first seven rows in Table 3) have been used in the context of ribodbmaker to generate one or more blastn databases or RefSeq records. Eight of the remaining models were created from alignments of fewer than 10 sequences, and need to be improved by adding more sequences. However, some of the models, especially those based on Rfam alignments such as eukaryotic SSU and LSU rRNA, could in principle be used for submission checking by ribosensor and we plan to investigate those possibilities based on empirical testing in the future. Beyond the existing models, more models are needed for other rRNA genes and taxonomic domains, such as mitochondrial LSU rRNA, Microsporidia LSU rRNA, eukaryotic 5.8S rRNA and 5S rRNA. Rfam includes alignments for some of these (e.g. 5.8S and 5S rRNA) and future versions of Ribovore could include models based on those, but manual curation effort will be required to create others. One limitation of Ribovore is that there are many parameters and the user may need to choose the settings carefully for each distinct purpose. For example, the usage of ribodbmaker should be tuned for each gene and taxonomic domain, as we have reported here for fungal SSU and LSU rRNA to require the commonly targeted regions of those respective genes to be present in the sequences. Another limitation is that we do not model introns, simply expecting any introns to be un- aligned in the ribotyper and riboaligner analysis. Additionally, minimum criteria (e.g. minimum score and coverage values) for passing sequences in the ribotyper, ribosensor and riboaligner tools should be set based on empirical testing, and the default values for those programs are currently tailored to prokaryotic SSU rRNA based on our internal testing. Expansion to other genes and taxonomic do- mains will require additional testing of those values. For some applications, the running time of Ribovore programs can be a signif- icant limitation. Profile-based CM or profile HMM methods that compare a few profiles (in this case, at most 18) to each input sequence can be more efficient than single-sequence based methods like blastn which typically compare many database sequences (in this case, more than 1000) to each input sequence, but of course this depends on the relative speed of each profile to sequence and sequence to sequence comparison. CM methods that score both sequence and secondary structure con- servation are computationally complex. On a single CPU, alignment of a single 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 24 of 28 full length LSU rRNA sequence typically takes several seconds. For this reason, ribotyper and ribosensor, which are intended to handle sequence submissions of up to millions of sequences do not, by default, compute an alignment using the CM, but rather use only more efficient profile HMM algorithms. The riboaligner and ribodbmaker programs, however, do compute alignments using CMs and so take longer per sequence, although the frequency with which these programs need to be run, at least for GenBank, is less. The ribosensor program, which runs both ribotyper and rRNA sensor, which is blastn-based, combines both profile and single-sequence methods. We measured the running time of rRNA sensor by itself and ribosensor on 1000 16S sequences using 1, 2, 4, 8, and 16 processors. The rRNA sensor program took 199s, 104s, 54s, 29s, and 17s, respectively; ribosensor took 362s, 196s, 106s, 61s, 39s, respectively. Thus, a submission of 1 million sequences, which is on the high end, would take hours to process given 16 processors on the host computer. The programs ribotyper, ribosensor, riboaligner, and ribodbmaker all include a command-line option -p that enables finer-grained parallelization by splitting the input file into roughly equal sized chunks and processing each independently on nodes on a compute cluster. However, for ribodbmaker, only the ribotyper and riboaligner steps are parallelized in this way. While doing the comparison of curated fungal datasets, we also timed ribotyper, riboaligner, and ribodbmaker on the 8,770 sequence Yarza fungal SSU set and the 2,993 sequence SilvaRef fungal LSU set as described in Implementation. The ribotyper program required 40m 14s (0.275s per sequence) on the Yarza set and 16m 51s (0.338s per sequence) on the SilvaRef set. The riboaligner program took 116m 22s (0.796s per sequence) on the Yarza set and 123m 56s (2.48s per sequence) on the SilvaRef set. The ribodbmaker program took 49m 49s wall-clock time and 1191m 2s cumulative time for all processors on the Yarza set and 60m 24s wall-clock time and 980m 11s cumulative time for the SilvaRef set. In general, we conclude that these analyses are tractable for tens of thousands of sequences at a time. Conclusions Our primary contribution described herein is the software package Ribovore for rRNA sequence analysis. At NCBI since July 2018, Ribovore has been used to check the quality of incoming submissions and to curate datasets of high quality sequences for RefSeq or to use as blastn databases. In the submission checking context, Ribovore has been used to check nearly 50 million 16S bacterial and archaeal SSU rRNA sequences through May 31, 2020 and millions more after that date. Ribovore has also been used manually by GenBank indexers when blastn analyses gave uncertain results for other rRNAs. A subset of the blastn databases created by Ribovore are selectable by users of the BLAST webpage as target databases, and are used in over 2,000 web blastn runs per day. We also are using Ribovore internally to curate fungal RefSeq records for SSU and LSU rRNA from type material. We showed that this curation effort is complementary to the larger Silva effort, as it selects only the best sequences that pass a larger battery of tests. Furthermore, the RefSeq records are linked within Entrez to other NCBI resources including BioCollections, BioProjects, Taxonomy, and BLAST. With this formal report of how 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 25 of 28 Ribovore is designed and implemented, we hope that both producers and consumers of rRNA sequence data will achieve a new understanding of how rRNA sequences are curated in GenBank, RefSeq, and associated resources. Availability and requirements Project name: Ribovore Project home page: https://github.com/ncbi/ribovore Operating system(s): Linux, Mac/OSX Programming language: Perl Other requirements: BLAST+ v2.11.0, Infernal v1.1.4, Sequip v0.08, VecScreen plus taxonomy v0.17 (see Table 2) License: public domain Any restrictions to use by non-academics: none Abbreviations NCBI: National Center for Biotechnology Information; rRNA: ribosomal RNA; SSU rRNA: small subunit ribosomal RNA; LSU rRNA: large subunit ribosomal RNA; CM: covariance model; HMM: hidden Markov model; nt: nucleotides; Kb: kilobase (1000 nucleotides); Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Availability of data and materials All data generated or analyzed during this study are included in this published article, its supplementary material, or NCBI’s GenBank database. Code is available on GitHub (https://github.com/ncbi/ribovore). BLAST databases are available in the directory https://ftp.ncbi.nlm.nih.gov/blast/db/. The supplementary material includes instructions for reproducing the comparisons reported in the article. Competing interests The authors declare that they have no competing interests. Funding This research was supported by the Intramural Research of the National Institutes of Health, National Library of Medicine (NLM) and National Cancer Institute. Author’s contributions AAS and EPN conceived of and designed the software and wrote most of the paper. RM, BR, CS assisted by writing passages about Fungi and about practical usages of Ribovore within NCBI. EPN wrote most of the Ribovore code and AAS wrote rRNA Sensor. All authors participated in the design and user interface of at least one Ribovore module. EPN, AAS, RM, BR, AJ, BAU formally tested the software. RM curated the rRNA databases. BR selected and curated the fungal RefSeqs. CS guided the multiple usages of NCBI taxonomy in Ribovore and corrected taxonomy inconsistencies as they were detected. EPN, RM, AJ, and BAU collected data on Ribovore usage. RM, AJ, and BAU used Ribovore to evaluate submissions to GenBank. IK-M supervised the work of RM, AJ, and BAU. All authors read and edited multiple versions of the manuscript and approved the final version. Acknowledgements Thanks to our NCBI colleagues Alex Kotliarov and Sergiy Gotvyanskyy for assistance in integrating Ribovore into GenBank processing pipelines and for collecting data on Ribovore usage. Thanks to our NCBI colleague Richa Agarwala for providing access to an isolated Linux computer on which we could do sole-user measurements of running time. Author details 1 Cancer Data Science Laboratory, National Cancer Insitute, National Institutes of Health, Bethesda, MD, 20892 USA. 2 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894 USA. References 1. Woese CR, Fox GE. Phylogenetic Structure of the Prokaryotic Domain: the Primary Kingdoms. Proc Natl Acad Sci USA. 1977;74:5088–5090. 2. Pace NR, Stahl DA, Lane DJ, Olsen GJ. Analyzing Natural Microbial Populations by rRNA Sequences. ASM News. 1985;51:4–12. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 26 of 28 3. Weller R, Ward DM. Selective Recovery of 16S rRNA Sequences From Natural Microbial Communities in the Form of cDNA. Appl Environ Microbiol. 1989;55:1818–1822. 4. Giovannoni SJ, Britschgi TB, Moyer CL, Field KG. Genetic Diversity in Sargasso Sea Bacterioplankton. Nature. 1990;345:60–63. 5. Fox GE, Pechman KR, Woese CR. Comparative Cataloging of 16S Ribosomal Ribonucleic Acid: Molecular Approach to Procaryotic Systematics. Int J Syst Evol Microbiol. 1977;27:44–57. 6. Betzl D, Ludwig W, Schleifer KH. Identification of em Lactococci and Enterococci by Colony Hybridization with 23S rRNA-targeted Oligonucleotide Probes. Appl Env Microbiol. 1990;56:2927–2929. 7. Amann RI, Ludwig W, Schleifer KH. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev. 1995;59:143–169. 8. Schoch CL, Seifert KA, Huhndorf S, Robert V, Spouge JL, Levesque CA, et al. Nuclear Ribosomal Internal Transcribed Spacer (ITS) Region as a Universal DNA Barcode Marker for Fungi. Proc Natl Acad Sci USA. 2012;109:6241–6246. 9. Peterson SW, Kurtzman CP. Ribosomal RNA sequence divergence among sibling species of yeasts. Syst Appl Microbiol. 1991;14:124–129. 10. Pawlowski J, Audic S, Adl S, Bass D, Belbhari L, Berney C, et al. The Significance of a Confidence Between Evolutionary Landmarks Found in Mating Affinity and a DNA Sequence. PLoS Biol. 2012;10:e1001419. 11. Zimmerman J, Hahn R, Geimenholzer B. Barcoding Diatoms: Evaluation of the V4 Subregion on the 18S rRNA Gene, Including New Primers and Protocols. Organism Diversity Evol. 2011;11:173. 12. Eddy SR. Profile Hidden Markov Models. Bioinformatics. 1998;14:755–763. 13. Karplus K, Barrett C, Hughey R. Hidden Markov Models for Detecting Remote Protein Homologies. Bioinformatics. 1998;14:846–856. 14. Eddy SR, Durbin R. RNA Sequence Analysis Using Covariance Models. Nucleic Acids Res. 1994;22:2079–2088. 15. Sakakibara Y, Brown M, Underwood RC, Mian IS, Haussler D. Stochastic Context-Free Grammars for Modeling RNA. In: Hunter L, editor. Proceedings of the Twenty-Seventh Annual Hawaii International Conference on System Sciences: Biotechnology Computing. vol. V. Los Alamitos, CA: IEEE Computer Society Press; 1994. p. 284–293. 16. Durbin R, Eddy SR, Krogh A, Mitchison GJ. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge UK: Cambridge University Press; 1998. 17. Freyhult EK, Bollback JP, Gardner PP. Exploring Genomic Dark Matter: A Critical Assessment of the Performance of Homology Search Methods on Noncoding RNA. Genome Res. 2007;17:117–125. 18. Kolbe DL, Eddy SR. Local RNA Structure Alignment With Incomplete Sequence. Bioinformatics. 2009;25:1236–1243. 19. Nawrocki EP. Structural RNA Homology Search and Alignment Using Covariance Models [Ph.D. thesis]. Washington University School of Medicine; 2009. 20. Ludwig W, Strunk O, Westram R, Richter L, Meier H, , et al. ARB: a Software Environment for Sequence Data. Nucleic Acids Res. 2004;32:1363–1371. 21. Cannone JJ, Subramanian S, Schnare MN, Collett JR, D’Souza LM, Du Y, et al. The Comparative RNA Web (CRW) Site: an Online Database of Comparative Sequence and Structure Information for Ribosomal, Intron, and Other RNAs. BMC Bioinformatics. 2002;3:2. 22. Olsen GJ, Larsen N, Woese CR. The Ribosomal RNA Database Project. Nucleic Acids Res. 1991;19:2017–2021. 23. Cole JR, Wang Q, Fish JA, Chai B, McGarrell DM, Sun Y, et al. Ribosomal Database Project: Data and Tools for High Throughput rRNA Analysis. Nucleic Acids Res. 2014;42:D633–D642. 24. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, et al. Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible With ARB. Appl Environ Microbiol. 2006;72:5069–5072. 25. Pruesse E, Quast C, Knittel K, Fuchs BM, Peplies J, Glöckner FO. SILVA: A Comprehensive Online Resource for Quality Checked and Aligned Ribosomal RNA Sequence Data Compatible With ARB. Nucleic Acids Res. 2007;35:7188–7196. 26. Glöckner FO, Yilmaz P, Quast C, Gerken J, Beccati A, Ciuprina A, et al. 25 Years of Serving the Community with Ribosomal RNA Gene Reference Databases and Tools. J Biotechnol. 2017;261:169–176. 27. Lagesen K, Hallin P, Rødland EA, Staerfeldt H, Rognes T, Ussery DW. RNAmmer: Consistent and Rapid Annotation of Ribosomal RNA Genes. Nucleic Acids Res. 2007;35:3100–3108. 28. Lee JH, Yi H, Chun J. rRNASelector: a Computer Program for Selecting Ribosomal RNA Encoding Sequences From Metagenomic and Metatranscriptomic Shotgun Libraries. J Microbiol. 2011;49:689–691. 29. Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011;7:e1002195. 30. Pruesse E, Peplies J, Glöckner FO. SINA: Accurate High Throughput Multiple Sequence Alignment of Ribosomal RNA. Bioinformatics. 2012;28:1823–1889. 31. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold Faster RNA Homology Searches. Bioinformatics. 2013;29:2933–2935. 32. Kalvari I, Nawrocki EP, Ontiveros-Palacios N, Argasinska J, Lamkiewicz K, Marz M, et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res. 2020 11;Gkaa1047. 33. Vossbrink CR, Maddox JV, Fredman S, Debrunner-Vossbrinck BA, Woese CR. Ribosomal RNA Sequence Suggests Microsporidia are Extremely Ancient Eukarytotes. Nature. 1987;326:411–414. 34. Barandun J, Hunziker M, Vossbrink CR, Klinge S. Evolutionary Compaction and Adaptation Visualized by the Structure of the Dormant Microsporidia Ribosome. Nat Microbiol. 2019;4:1798–1804. 35. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference Sequence (RefSeq) Database at NCBI: Current Status, Taxonomic Expansion, and Functional Annotation. Nucleic Acids Res. 2016;44:D733–D745. 36. Edgar RC. Search and Clustering Orders of Magnitude Faster than BLAST. Bioinformatics. 2010;26:2460–2461. 37. Wheeler TJ, Eddy SR. nhmmer: DNA Homology Search With Profile HMMs. Bioinformatics. 2013;29:2487–2489. 38. Schäffer AA, Hatcher EL, Yankie L, andd J R Brister LS, Karsch-Mizrachi I, Nawrocki EP. VADR: Validation 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 27 of 28 and Annotation of Virus Sequence Submissions to GenBank. BMC Bioinformatics. 2020;21:211. 39. Schäffer AA, Nawrocki EP, Choi Y, Kitts PA, Karsch-Mizrachi I, McVeigh R. VecScreen plus taxonomy: Imposing a Tax(onomy) Increase on Vector Contamination Screening. Bioinformatics. 2018;34:755–759. 40. Nawrocki EP. The SSU-ALIGN User’s Guide; 2016. [http://eddylab.org/software/ssu-align/Userguide.pdf]. 41. Liu K, Porras-Alfaro A, Kuske CR, Eichorst SA, Xie G. Accurate, Rapid Taxonomic Classification of Fungal Large-Subunit rRNA Genes. Appl Environ Microbiol. 2012;78:1523–1533. 42. Hadziavdic K, Lekang K, Lanzen A, Jonassen I, Thompson EM. Characterization of the 18S rRNA Gene for Designing Universal Eukaryotic Specific Primers. PLoS ONE. 2014;9:e87624. 43. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs. Nucleic Acids Res. 1997;25:3389–3402. 44. Yarza P, Yilmaz P, Panzer K, Glöckner FO, Reich M. A Phylogenetic Framework for the Kingdom Fungi Based on 18S rRNA Gene Sequences. Mar Genomics. 2017;36:33–39. 45. Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, et al. The SILVA Ribosomal RNA Gene Database Project: Improved Data Processing and Web-Based Tools. Nucleic Acids Res. 2013;41:D590–D596. 46. Kozlov AM, Zhang J, Yilmaz P, Glöckner FO, Stamatakis A. Phylogeny-Aware Identification and Correction of Taxonomically Mislabeled Sequences. Nucleic Acids Res. 2016;44:5022–5033. 47. Sayers E. Entrez Programming Utilities Help [Internet]; 2010-. [https://www.ncbi.nlm.nih.gov/books/NBK25501/]. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 Schäffer et al. Page 28 of 28 Additional Files Additional file 1 We provide ribovore-paper-supplemental-material.tar.gz, a gzipped tar archive with sequence files and instructions for reproducing the tests of Ribovore and RNAmmer described in Results and discussion, that includes a 00README.txt with file descriptions. Unpack with the command ’tar xf ribovore-paper-supplementary-material.tar.gz’. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430762doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430762 10_1101-2021_02_11_430806 ---- BIAPSS - BioInformatic Analysis of liquid-liquid Phase-Separating protein Sequences DR AF T BIAPSS - BioInformatic Analysis of liquid-liquid Phase-Separating protein Sequences Aleksandra E. Badaczewska-Dawid1,� and Davit A. Potoyan1,2,3� 1 Department of Chemistry, Iowa State University, Ames IA 50011 USA 2 Department of Biochemistry Biophysics and Molecular Biology, Iowa State University, Ames IA 50011 USA 3 Bioninformatics and Computational Biology program, Iowa State University, Ames IA 50011 USA Liquid-liquid phase separation (LLPS) has recently emerged as a foundational mechanism for order and regulation in bi- ology. However, a quantitative molecular grammar of protein sequences underlying LLPS remains unclear. The comprehen- sive databases and associated computational infrastructure for biophysical and statistical analysis can enable rapid progress in the field. Therefore, we have created a novel open-source web platform named BIAPSS (BioInformatic Analysis of liquid- liquid Phase-Separating protein Sequences) which offers the users interactive data analytic tools for facilitating the discov- ery of statistically significant sequence signals for proteins with LLPS behavior. Availability: BIAPSS is freely available on- line at https://biapss.chem.iastate.edu/. Website is implemented within the Python framework using HTML, CSS, and Plotly- Dash graphing libraries, with all the major browsers supported including the mobile device accessibility. LLPS | BIAPSS | Plotly-Dash Correspondence: abadacz@iastate.edu, potoyan@iastate.edu Introduction In the past few years, LLPS of biomolecules has become a universal language for interpreting intracellular signaling, compartmentalization, and regulation (1–5). The ability to phase separate appears to be encoded primarily in the protein sequences, frequently containing disordered and low com- plexity domains, which are enriched in charged and multi- valent interaction centers (6–8). Nevertheless, the quanti- tative aspects of how amino acids encode and decode the phase separation remain largely unknown (9–11). This is be- cause many different combinations of relevant interactions seem to be contributing to phase separation without any- one being universally necessary (12). So far, however, with a few exceptions (13–16) mostly case by case studies of different sequences are performed, with the broader context of many findings, including their statistical significance re- maining unknown. To this end, we have developed a web framework BIAPSS: BioInformatic Analysis of liquid-liquid Phase-Separating protein Sequences. The objective of BI- APSS is to enable a rapid and on-the-fly deep statistical anal- ysis of LLPS-driver proteins using the pool of sequences with empirically confirmed phase behavior. Implementation The back-end processing pipeline of BIAPSS is implemented in a Python framework, where in-house developed algorithms parse pre-computed data and perform on-the-fly analysis. The basic front-end user interface of the BIAPSS web plat- form is implemented with HTML5, CSS, JavaScript, and Bootstrap components which support the responsiveness and mobile-accessibility of the website. Specifically, our cross- platform framework is adjusted to be run on multiple operat- ing systems and popular browsers. Modern display-layer so- lutions improve user experience by enabling smooth loading of contents, page transitions, and accompanying an in-depth presentation of the results. For instance, we included a light- box slideshow with a brief overview of the features, collapsed menu, and modal images of quick guide within individual applications, side navigation, and more. Interactive graph plotting and data visualization accessible through web ap- plications in SingleSEQ and MultiSEQ tabs were developed with the Plotly-Dash (17) browser-based graphing libraries for Python which create a user-responsive environment and follow remote, customized instructions. Thanks to the inter- active interface users can go directly from exploratory analyt- ics to the creation of publication-ready high-quality images. Results BIAPSS is designed as a user-friendly web platform that is billing itself as a central resource for systematic and stan- dardized statistical analysis of biophysical characteristics of known LLPS sequences. The web service provides users with (i) a database of the superset of experimentally evi- denced LLPS-driver protein sequences, (ii) a repository of pre-computed bioinformatics and statistics data, and (iii) two sets of web applications supporting the interactive analysis and visualization of physicochemical and biomolecular char- acteristics of LLPS proteins. The initial LLPS sequence set leverages the data from manually curated primary LLPS databases, namely PhaSePro (15) and LLPSDB (16). Given that the number of experimentally confirmed LLPS driver proteins is constantly growing, the BIAPSS pre-computed repository is updated annually and released to the public, which significantly saves the users time eliminating the need for exhaustive in-house calculations. The apps integrate the Badaczewska-Dawid et al. | bioRχiv | February 4, 2021 | 1–3 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430806doi: bioRxiv preprint https://biapss.chem.iastate.edu/ https://doi.org/10.1101/2021.02.11.430806 DR AF T results from our extensive studies, described in more detail elsewhere (). One of the aims of BIAPSS is to get an in- sight into the overall characteristics of the sufficient non- redundant set of LLPS-driver protein sequences. The com- parison to benchmarks of various protein groups enables sta- tistical inference of specific phase-separating affinities. Fur- thermore, the residue-resolution biophysical regularities in- ferred from BIAPSS will help not only to accurately iden- tify regions prone to phase separation but also to design se- quence modifications targeting various biomedical applica- tions. The extended Cross-References section is designed as a central navigation hub for researchers for keeping track of the corresponding entries in the primary LLPS databases along with the other external resources relevant to the phase separation field. Since many users usually have specific sin- gle sequences of interest (natural or designed) our future ef- forts will be directed towards the creation of an upload sec- tion for parse user-defined cases and compare them with the benchmark of known LLPS-driver proteins. The layout and main functionalities of BIAPSS services are summarized in the Figure 1. The general outline of the plat- form is designed to provide clarity and intuitive navigation by avoiding the excess of permanently visible information. Due to the multitude of analyses, available to meet the needs of a diverse audience of scientists, the extensive content of BIAPSS has been divided into 5 main tabs. The Home tab is a place where the user gets a high-level overview of the features of BIAPSS services. Next comes the SingleSEQ tab which is dedicated to the exploration of individual LLPS se- quence characteristics. Besides a case summary and cross- reference section, there are multiple web applications dedi- cated to the in-depth analysis of biomolecular features, such as sequence conservation with multiple sequence alignment (MSA) (18), various sequence-based predictions by the state- of-the-art methods for secondary structure (18–23), solvent accessibility (22–24), structural disorder (22, 25–29), con- tact maps (22, 27, 29), and uniquely proposed detection of numerous short linear motifs (SliMs) (30–34) recently high- lighted as key regions for driving the LLPS (35). The Multi- SEQ tab provides the user with a set of web applications for a broad array of statistics on a superset of LLPS sequences. One may there investigate the regularities and trends specific only for disordered regions, such as amino acid (AA) compo- sition, including AA diversity or regions rich in a given AA, general physicochemical patterns of polarity, hydrophobic- ity, the distribution of aromatic or charged residues, includ- ing not only the overall net charge but also charge decora- tion parameters that emerged as a relevant factor for electro- static interactions of intrinsically disordered proteins (IDPs) (36), and more. Also, a deeper focus on the general fre- quency of particular short linear motifs, including LARKS (31), GARs (32), ELMs (30), and steric zippers (34), as well as pioneering identification of specific n-mers, can bring new perspectives in the field. The Download tab facilitates access- ing the BIAPSS repository. The available data includes raw predictions pre-calculated using the well-established tools as well as the findings of our deep statistical analysis. For the Fig. 1. The overall layout of BIAPSS web platform (https://biapss.chem.iastate.edu/) for comprehensive sequence-based analysis of LLPS proteins. The core of the implemented web applications and data repository is contained in the SingleSEQ, MultiSEQ, and Download tabs. convenience of users, we have unified and integrated the pre- processed results into a standardized CSV format accompa- nied with intuitive descriptors to facilitate reuse and, specif- ically, allows the researcher to implement the pre-computed data directly or carry out further analysis. Finally, in the Docs tab, the user can follow the detailed data-analytic workflow and learn more about used tools with corresponding refer- ences to the original literature. The documentation also in- cludes an easy-to-use tutorial dedicated to individual web applications, where all of the features are presented graph- ically with detailed descriptions (see also the user’s manual attached in the Supplementary information). Funding A.E.B-D. acknowledges a generous financial support by Roy J. Carver Charitable Trust through Iowa State University Bio- science Innovation Postdoctoral Fellowship. This work was supported by the National Institute Of General Medical Sci- ences of the National Institutes of Health [R35GM138243 to D.A.P.]. The content is solely the responsibility of the au- thors and does not necessarily represent the official views of the National Institutes of Health. Conflict of Interest: none declared. Author Contribution Conceptualization, A.E.B-D.; Software development, A.E.B- D.; Writing an original draft, A.E.B-D. and D.A.P. 1. Clifford P Brangwynne, Christian R Eckmann, David S Courson, Agata Rybarska, Carsten Hoege, Jöbin Gharakhani, Frank Jülicher, and Anthony A Hyman. Germline P granules are liquid droplets that localize by controlled dissolution/condensation. Science, 324(5935): 1729–1732, June 2009. 2. Clifford P Brangwynne, Timothy J Mitchison, and Anthony A Hyman. Active liquid-like be- havior of nucleoli determines their size and shape in xenopus laevis oocytes. Proc. Natl. Acad. Sci. U. S. A., 108(11):4334–4339, March 2011. 3. Iain A Sawyer, Jiri Bartek, and Miroslav Dundr. Phase separated microenvironments inside the cell nucleus are linked to disease and regulate epigenetic state, transcription and RNA processing. Semin. Cell Dev. Biol., July 2018. 4. Sudeep Banjade, Qiong Wu, Anuradha Mittal, William B Peeples, Rohit V Pappu, and Michael K Rosen. Conserved interdomain linker promotes phase separation of the mul- tivalent adaptor protein nck. Proc. Natl. Acad. Sci. U. S. A., 112(47):E6426–35, November 2015. 5. Sudeep Banjade and Michael K Rosen. Phase transitions of multivalent proteins can pro- mote clustering of membrane receptors. Elife, 3, October 2014. 6. Jeong-Mo Choi, Alex S Holehouse, and Rohit V Pappu. Physical principles underlying the complex biology of intracellular phase transitions. Annu. Rev. Biophys., January 2020. 2 | bioRχiv Badaczewska-Dawid et al. | (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430806doi: bioRxiv preprint https://biapss.chem.iastate.edu/ https://doi.org/10.1101/2021.02.11.430806 DR AF T 7. Jie Wang, Jeong-Mo Choi, Alex S Holehouse, Hyun O Lee, Xiaojie Zhang, Marcus Jahnel, Shovamayee Maharana, Régis Lemaitre, Andrei Pozniakovsky, David Drechsel, Ina Poser, Rohit V Pappu, Simon Alberti, and Anthony A Hyman. A molecular grammar governing the driving forces for phase separation of prion-like RNA binding proteins. Cell, 174(3): 688–699.e16, July 2018. 8. Gregory L Dignon, Robert B Best, and Jeetain Mittal. Biomolecular phase separation: From molecular driving forces to macroscopic properties. Annu. Rev. Phys. Chem., 71:53–75, April 2020. 9. Castrense Savojardo, Pier Luigi Martelli, and Rita Casadio. Protein–Protein interaction methods and protein phase separation. Annu. Rev. Biomed. Data Sci., 3(1):89–112, July 2020. 10. Wade Borcherds, Anne Bremer, Madeleine B Borgia, and Tanja Mittag. How do intrinsically disordered protein regions encode a driving force for liquid-liquid phase separation? Curr. Opin. Struct. Biol., 67:41–50, October 2020. 11. Boris Y Zaslavsky, Luisa A Ferreira, and Vladimir N Uversky. Driving forces of Liquid-Liquid phase separation in biological systems. Biomolecules, 9(9), September 2019. 12. Brian Tsang, Iva Pritišanac, Stephen W Scherer, Alan M Moses, and Julie D Forman-Kay. Phase separation as a missing mechanism for interpretation of disease mutations. Cell, 183 (7):1742–1756, December 2020. 13. Kadi L Saar, Alexey S Morgunov, Runzhang Qi, William E Arter, Georg Krainer, Alpha Albert Lee, and Tuomas Knowles. Machine learning models for predicting protein condensate formation from sequence determinants and embeddings. October 2020. 14. Kaiqiang You, Qi Huang, Chunyu Yu, Boyan Shen, Cristoffer Sevilla, Minglei Shi, Henning Hermjakob, Yang Chen, and Tingting Li. PhaSepDB: a database of liquid-liquid phase separation related proteins. Nucleic Acids Res., 48(D1):D354–D359, January 2020. 15. Bálint Mészáros, Gábor Erdős, Beáta Szabó, Éva Schád, Ágnes Tantos, Rawan Abukhairan, Tamás Horváth, Nikoletta Murvai, Orsolya P Kovács, Márton Kovács, Silvio C E Tosatto, Péter Tompa, Zsuzsanna Dosztányi, and Rita Pancsa. PhaSePro: the database of proteins driving liquid-liquid phase separation. Nucleic Acids Res., 48(D1):D360–D367, January 2020. 16. Qian Li, Xiaojun Peng, Yuanqing Li, Wenqin Tang, Jia’an Zhu, Jing Huang, Yifei Qi, and Zhuqing Zhang. LLPSDB: a database of proteins undergoing liquid–liquid phase separation in vitro. Nucleic Acids Res., September 2019. 17. Plotly Technologies Inc. Collaborative data science, 2015. 18. Jaina Mistry, Robert D Finn, Sean R Eddy, Alex Bateman, and Marco Punta. Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res., 41(12):e121, July 2013. 19. Damiano Piovesan, Ian Walsh, Giovanni Minervini, and Silvio C E Tosatto. FELLS: fast estimator of latent local structure. Bioinformatics, 33(12):1889–1891, June 2017. 20. Rhys Heffernan, Kuldip Paliwal, James Lyons, Jaswinder Singh, Yuedong Yang, and Yaoqi Zhou. Single-sequence-based prediction of protein secondary structures and solvent acces- sibility by deep whole-sequence learning. J. Comput. Chem., 39(26):2210–2216, October 2018. 21. Mirko Torrisi, Manaz Kaleel, and Gianluca Pollastri. Porter 5: fast, state-of-the-art ab initio prediction of protein secondary structure in 3 and 8 classes. October 2018. 22. Zhiyong Wang, Feng Zhao, Jian Peng, and Jinbo Xu. Protein 8-class secondary structure prediction using conditional neural fields. Proteomics, 11(19):3786–3792, October 2011. 23. Daniel W A Buchan and David T Jones. The PSIPRED protein analysis workbench: 20 years on. Nucleic Acids Res., 47(W1):W402–W407, July 2019. 24. Jack Hanson, Kuldip Paliwal, Thomas Litfin, Yuedong Yang, and Yaoqi Zhou. Improving prediction of protein secondary structure, backbone angles, solvent accessibility and con- tact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks. Bioinformatics, 35(14):2403–2410, July 2019. 25. Bin Xue, Roland L Dunbrack, Robert W Williams, A Keith Dunker, and Vladimir N Uversky. PONDR-FIT: a meta-predictor of intrinsically disordered amino acids. Biochim. Biophys. Acta, 1804(4):996–1010, April 2010. 26. Kang Peng, Predrag Radivojac, Slobodan Vucetic, A Keith Dunker, and Zoran Obradovic. Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics, 7:208, April 2006. 27. Jack Hanson, Kuldip K Paliwal, Thomas Litfin, and Yaoqi Zhou. SPOT-Disorder2: Improved protein intrinsic disorder prediction by ensembled deep learning. Genomics Proteomics Bioinformatics, 17(6):645–656, December 2019. 28. David T Jones and Domenico Cozzetto. DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics, 31(6):857–863, March 2015. 29. Yang Li, Jun Hu, Chengxin Zhang, Dong-Jun Yu, and Yang Zhang. ResPRE: high-accuracy protein contact prediction by coupling precision matrix with deep residual neural networks. Bioinformatics, 35(22):4647–4655, November 2019. 30. Manjeet Kumar, Marc Gouw, Sushama Michael, Hugo Sámano-Sánchez, Rita Pancsa, Ju- liana Glavina, Athina Diakogianni, Jesús Alvarado Valverde, Dayana Bukirova, Jelena Ča- lyševa, et al. Elm—the eukaryotic linear motif resource in 2020. Nucleic Acids Research, 48(D1):D296–D306, 2020. 31. Michael P Hughes, Michael R Sawaya, David R Boyer, Lukasz Goldschmidt, Jose A Ro- driguez, Duilio Cascio, Lisa Chong, Tamir Gonen, and David S Eisenberg. Atomic struc- tures of low-complexity protein segments reveal kinked β sheets that assemble networks. Science, 359(6376):698–701, 2018. 32. P Andrew Chong, Robert M Vernon, and Julie D Forman-Kay. Rgg/rg motif regions in rna binding and phase separation. Journal of molecular biology, 430(23):4650–4665, 2018. 33. Izzy Owen and Frank Shewmaker. The role of Post-Translational modifications in the phase transitions of intrinsically disordered proteins. Int. J. Mol. Sci., 20(21), November 2019. 34. Roland Riek. The Three-Dimensional structures of amyloids. Cold Spring Harb. Perspect. Biol., 9(2), February 2017. 35. Simon Alberti, Amy Gladfelter, and Tanja Mittag. Considerations and challenges in studying liquid-liquid phase separation and biomolecular condensates. Cell, 176(3):419–434, 2019. 36. Greta Bianchi, Sonia Longhi, Rita Grandori, and Stefania Brocca. Relevance of electrostatic charges in compactness, aggregation, and phase separation of intrinsically disordered pro- teins. International Journal of Molecular Sciences, 21(17):6208, 2020. Badaczewska-Dawid et al. | bioRχiv | 3 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2021.02.11.430806doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430806 10_1101-2021_02_11_430847 ---- SearcHPV: a novel approach to identify and assemble human papillomavirus-host genomic integration events in cancer 1 TITLE: SearcHPV: a novel approach to identify and assemble human papillomavirus-host genomic integration events in cancer RUNNING TITLE: SearcHPV: Detecting viral integrations AUTHORS: Lisa M. Pinatti B.S.1,2*, Wenjin Gu M.S.3*, Yifan Wang PhD4, Ahmed El Hossiny3, Apurva D. Bhangale B.S.2, Collin V. Brummel B.A.2, Thomas E. Carey PhD2,5,6, Ryan E. Mills PhD3,4†, J. Chad Brenner PhD2,5,6† *L.M. Pinatti and W. Gu should be considered joint first author †R.E. Mills and J.C. Brenner should be considered joint senior author 1Cancer Biology Program, Program in the Biomedical Sciences, Rackham Graduate School, University of Michigan, Ann Arbor, MI 2Department of Otolaryngology/Head and Neck Surgery, University of Michigan, Ann Arbor, MI 3Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 4Department of Human Genetics, University of Michigan, Ann Arbor, MI 5Rogel Cancer Center, Michigan Medicine, Ann Arbor, MI 6Department of Pharmacology, University of Michigan, Ann Arbor, MI CORRESPONDING AUTHOR: J. Chad Brenner 9301B MSRB3, 1150 W. Medical Center Drive, Ann Arbor, MI 48109 734-763-2761 chadbren@umich.edu FUNDING STATEMENT: This study was supported by NIH-NCI R01 CA194536 (T.E. Carey and J.C. Brenner), as well as start-up discretionary funds to J.C. Brenner and R.E. Mills from the University of Michigan. L.M. Pinatti was supported by NIH-NCI R01 CA194536. CONFLICT OF INTEREST: The authors declare that there is no conflict of interest. AUTHOR CONTRIBUTIONS: L.M. Pinatti: conceptualization, data curation, formal analysis, investigation, project administration, validation, visualization, writing - original draft, and writing - review and editing. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 W. Gu: conceptualization, data curation, formal analysis, investigation, project administration, methodology, resources, software, visualization, writing - original draft, and writing - review and editing. Y. Wang: data curation, formal analysis, investigation, methodology, resources, and software. A.D. Bhangale: data curation, formal analysis, investigation, methodology, resources, software, and validation. C.V. Brummel: data curation, investigation, project administration, and resources. A. Elhossiny: data curation, investigation and software. T.E. Carey: conceptualization, funding acquisition, project administration, resources, supervision, and writing - review and editing. R.E. Mills: conceptualization, funding acquisition, methodology, resources, software, supervision, and writing - review and editing. J.C. Brenner: conceptualization, funding acquisition, project administration, resources, software, supervision, visualization, and writing - review and editing. ACKNOWLEDGMENTS: We would like to thank the University of Michigan Advanced Genomics Core for carrying out the targeted capture sequencing and 10X linked read sequencing. We thank Dr. Tom Wilson for discussions of the data. PRECIS: To overcome technical challenges of detecting viral integrations in human papillomavirus-related cancers, we optimized a new pipeline called SearcHPV. Using this tool, we found frequent integration near genes and areas of large structural rearrangements in HPV+ models. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3 ABSTRACT: Background: Human papillomavirus (HPV) is a well-established driver of malignant transformation in a number of sites including head and neck, cervical, vulvar, anorectal and penile squamous cell carcinomas; however, the impact of HPV integration into the host human genome on this process remains largely unresolved. This is due to the technical challenge of identifying HPV integration sites, which includes limitations of existing informatics approaches to discover viral-host breakpoints from low read coverage sequencing data. Methods: To overcome this limitation, we developed a new HPV detection pipeline called SearcHPV based on targeted capture technology and applied the algorithm to targeted capture data. We performed an integrated analysis of SearcHPV-defined breakpoints with genome-wide linked read sequencing to identify potential HPV-related structural variations. Results: Through analysis of HPV+ models, we show that SearcHPV detects HPV-host integration sites with a higher sensitivity and specificity than two other commonly used HPV detection callers. SearcHPV uncovered HPV integration sites adjacent to known cancer-related genes including TP63 and MYC, as well as near regions of large structural variation. We further validated the junction contig assembly feature of SearcHPV, which helped to accurately identify viral-host junction breakpoint sequences. We found that viral integration occurred through a variety of DNA repair mechanisms including non-homologous end joining, alternative end joining and microhomology mediated repair. Conclusions: In summary, we show that SearcHPV is a new optimized tool for the accurate detection of HPV-human integration sites from targeted capture DNA sequencing data. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4 KEYWORDS: genomics, bioinformatics, papillomavirus infections, virus integration, squamous cell carcinoma, DNA sequence analysis TOTAL # OF: 1. Text pages: 19 2. Tables: 8 3. Figures: 7 4. Supporting files: 1 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 5 INTRODUCTION: Human papillomavirus (HPV) is a well-established driver of malignant transformation in a number of cancers, including head and neck squamous cell carcinomas (HNSCC). Although HPV genomic integration is not a normal event in the lifecycle of HPV, it is frequently reported in HPV+ cancers1-4 and it may be a contributor to oncogenesis. In cervical cancer, HPV integration increases in incidence during progression from stages of cervical intraepithelial neoplasia (CIN) I/II, CIN III and invasive cancer development.5 This process has a variety of impacts on both the HPV and cellular genomes, including disruption of the transcriptional repressor of the HPV oncoproteins E2, leading to increase in genetic instability.6 HPV integration occurs within/near cellular genes more often than expected by chance7 and has been reported to be associated with structural variations8. Recent studies in HNSCCs have also suggested that additional oncogenic mechanisms of HPV integration may exist through direct effects on cancer-related gene expression and generation of hybrid viral-host fusion transcripts.9 A wide array of methods has been previously used for the detection of HPV integration. Polymerase chain reaction (PCR)-based methods, such as Detection of Integrated Papillomavirus Sequences PCR (DIPS-PCR)10 and Amplification of Papillomavirus Oncogene Transcripts (APOT)11, are low sensitivity assays and are limited in their ability to detect the broad spectrum of genomic changes resulting from this process. Next-generation sequencing (NGS) technologies overcomes these limitations. Previous groups have assessed HPV integration within HNSCC tumors in The Cancer Genome Atlas (TCGA) and cell lines by whole-genome sequencing (WGS).2, 3, 8 There are a variety of viral integration detection tools developed for WGS data, such as VirusFinder212, 13 and VirusSeq14. However, these strategies are designed for a broad range of .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 6 virus types and require whole genomes to be sequenced at uniform coverage, which can result in a lower sensitivity of detection for specific types of rare viral integration events. To overcome this issue, others have begun to use HPV targeted capture sequencing.5, 15-18 This strategy allows for better coverage of integration sites than an untargeted approach like WGS but requires sensitive and accurate viral-human fusion detection bioinformatic tools, of which the field has been lacking. In our lab, we have found the previously available viral integration callers to have a relatively low validation rate and limitations on the structural information surrounding the fusion sites, which impairs mechanistic studies. Therefore, we set out to generate a novel pipeline specifically for targeted capture sequencing data to serve as a new gold standard in the field. MATERIALS AND METHODS: Targeted Capture Sequencing: DNA from UM-SCC-47 and PDX-294R were submitted to the University of Michigan Advanced Genomics Core for targeted capture sequencing. Targeted capture was performed using a custom designed probe panel with high density coverage of the HPV16 genome, the HPV18/33/35 L2/L1 regions, and over 200 HNSCC-related genes, which are detailed in Heft Neal et. al 2020.19 Following library preparation and capture, the samples were sequenced on an Illumina NovaSEQ6000 or HiSEQ4000, respectively, with 300nt paired end run. Data was de-multiplexed and FastQ files were generated. Novel Integration Caller (SearcHPV): The pipeline of SearcHPV has four main steps which are detailed below: (1) Alignment; (2) Genome fusion point calling; (3) Assembly; (4) HPV fusion .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 7 point calling (Figure 1). The package is available on Github: https://github.com/mills- lab/SearcHPV. Alignment The customized reference genome used for alignment was constructed by catenating the HPV16 genome (from Papillomavirus Episteme (PAVE) database20, 21) and the human genome reference (1000 Genomes Reference Genome Sequence, hs37d5). We aligned paired-end reads from targeted capture sequencing against the customized reference genome using BWA mem aligner.22 Then we performed an indel realignment by Picard Tools23 and GATK24. Duplications were marked by Picard MarkDuplicates Tool23 for the filtering in downstream steps. Genome Fusion Points Calling To identify the fusion points, we extracted reads with regions matched to HPV16 and filtered those reads to meet these criteria: (1) not secondary alignment; (2) mapping quality greater or equal than 50; (3) not duplicated. Genome fusion points were called by split reads (reads spanning both the human and HPV genomes) and the paired-end reads (reads with one end matched to HPV and the other matched the human genome) at the surrounding region (+/-300bp) (Figure 1A). The cut-off criteria for identifying the fusion points were based on empirical practice. We then clustered the integration sites within 100bp to avoid duplicated counting of integration events due to the stochastic nature of read mapping and structural variations. Assembly .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://github.com/mills-lab/SearcHPV https://github.com/mills-lab/SearcHPV https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 8 To construct longer sequence contigs from individual reals, we extracted supporting split reads and paired-end for local assembly from each integration event. Due to the library preparation methods we implemented for the targeted capture approach, some reads exhibited an insertion size less than 2 x read length, resulting in overlapping read segments. For such events, we first merged these reads using PEAR25 and then combined them with other individual reads to perform a local assembly by CAP326 (Figure 1). HPV Fusion Point Calling For each integration event, the assembly algorithm was able to report multiple contigs. We developed a procedure to evaluate and select contigs for each integration event to call HPV fusion point more precisely. First, we aligned the contigs against the human genome and HPV genome separately by BWA mem. If the contig met the following criteria, we marked it as high confidence: (1) Has at least 10 supportive reads (2) 10% < 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚ℎ𝑒𝑒𝑒𝑒 𝑙𝑙𝑒𝑒𝑙𝑙𝑙𝑙𝑚𝑚ℎ 𝑜𝑜𝑜𝑜 𝑚𝑚ℎ𝑒𝑒 𝑚𝑚𝑜𝑜𝑙𝑙𝑚𝑚𝑐𝑐𝑙𝑙 𝑚𝑚𝑜𝑜 𝐻𝐻𝐻𝐻𝐻𝐻 𝑙𝑙𝑒𝑒𝑙𝑙𝑙𝑙𝑚𝑚ℎ 𝑜𝑜𝑜𝑜 𝑚𝑚𝑜𝑜𝑙𝑙𝑚𝑚𝑐𝑐𝑙𝑙 < 95% Then we separated the contigs we assembled into two classes: from left side (Contig A in Fig 1B) and from right side (Contig B in Fig 1B). For each class, if there were high confidence contigs in the class, we selected the contig with maximum length among them, otherwise we selected the contig with most supportive reads. For each insertion event, we reported one contig if it only had contigs from one side and we reported two contigs if it had contigs from both sides (Figure 1C). Finally, we identified the fusion points within HPV based on the alignment results of the selected contigs against the HPV genome. The bam/sam file processing in this pipeline was done by Samtools22 and the analysis was performed with R 3.6.127 and Python.28 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 9 RESULTS: SearcHPV pipeline: To overcome the limitations of viral integration detection in WGS of detecting rare events, we performed HPV targeted capture sequencing which allows for deeper investigation of these events. Current bioinformatics pipelines available are not designed for this type of data so we developed a novel HPV integration detection tool for targeted capture sequencing data, which we termed “SearcHPV”. Two HPV16+ HNSCC models, UM-SCC-47 and PDX-294R, were subjected to targeted-capture based Illumina sequencing using a custom panel of probes spanning the entire HPV16 genome. The paired end reads then went through the four steps of analysis of SearcHPV: alignment to custom reference genome, genome fusion points calling, local assembly and precise fusion point calling (Figure 1). Analysis of the integration sites in the models using our pipeline SearcHPV showed a high frequency of HPV16 integration with a total of six events in UM-SCC- 47 and ninety-eight in PDX-294R (Figure 2, Table S1-S2). Comparison to other integration callers and confirmation of integration sites: In addition to using SearcHPV, we used two previously developed integration callers, VirusFinder2 and VirusSeq to independently call integration events in both UM-SCC-47 and PDX-294R (Figure 3, Tables S3-4). We found that SearcHPV called HPV integration events at a much higher rate than either previous caller. There were a large number of sites that were only identified by SearcHPV (n=76). In order to assess the accuracy of each caller, we performed PCR on source genomic DNA followed by Sanger sequencing with primers spanning the HPV-human junction sites predicted by the callers (Figure 3C. S1, Table S5). We tested all integration sites with sufficient sequence complexity for primer design (n=46), twenty-five of which were unique to SearcHPV and five which were unique to VirusSeq. VirusFinder2 does not allow for local .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10 assembly of the integration junctions which rendered us unable to test these sites. Sites unique to SearcHPV had a confirmation rate of 18/25 (72%). The confirmation rate of high confidence SearcHPV sites was higher than that for low confidence sites (23/31 (74%) versus 4/7 (57%)). In contrast, only 2/5 (40%) sites unique to VirusSeq could be confirmed. Localization of integration sites: We next examined the integration sites detected by SearcHPV. The six integration sites discovered in UM-SCC-47 were clustered on chromosome 3q28 within/near the cellular gene TP63 and either involved the HPV16 genes E1, E2 or L1. The integration sites fell within intron 10, intron 12 and exon 14. One additional integration site was 8.6 kb downstream of the TP63 coding region. Within PDX-294R, HPV16 integration sites were identified across 21 different chromosomes, occurring most frequently on chromosome 3. For the 98 integration events of PDX- 294R, we identified 142 breakpoints in the HPV genome. The most frequently involved HPV genes were E1 (45/142 (32%)) and L1 (31/142 (22%)). Most of the integration sites mapped to within/near (<50 kb) a known cellular gene (89/98 (91%)). Of the sites that fell within a gene, the majority of integrations took place within an intronic region (33/42 (78%)). Although the integration sites were scattered throughout the human genome, we saw examples of closely clustered sites around cancer-relevant genes, including ZNF148 and SNX4 on chromosome 3q21.2, MYC on chromosome 8q24.21 and FOXN2 on chromosome 2p16.3. Association of integration sites and large-scale duplications .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 11 We predicted that the complex integration sites we discovered in UM-SCC-47 and PDX- 294R would be associated with large-scale structural alterations of the genome, such as rearrangements, deletions and duplications. To identify these alterations, we subjected UM-SCC- 47 and PDX-294R to 10X linked-read sequencing. We generated over 1 billion reads for each sample (Table S6), with phase blocks (contiguous blocks of DNA from the same allele) of up to 28.9M and 3.8M bases in length for UM-SCC-47 and PDX-294R, respectively (Figure S2). This led to the identification of 444 high confidence large structural events in UM-SCC-47 and 126 events in the PDX-294R model. We then performed integrated analysis with our SearcHPV results. There was a 130 kb duplication surrounding the integration events in TP63 in UM-SCC-47 (Figure 4A). In PDX-294R, 38/98 (39%) integration sites were within a region that contained a large-scale duplication, while the other 50 integration events fell outside regions of large structural variation. This suggested that in this PDX model, 38/126 (30%) large structural events were potentially induced during HPV integration. For example, the clusters of integration events surrounding ZNF148 and SNX4, MYC, as well as FOXN2 were also associated with large genomic duplications (Figure 4B-C). Microhomology at junction sites: Finally, to evaluate possible mechanisms of DNA repair-mediated integration, we examined the degree of sequence overlap between the genomes at each junction sites that covered by contigs. We saw three types of junction points: those with a gap of unmapped sequence between the human and HPV genomes, those that had a clean breakpoint between the genomes, and those with sequence that could be mapped to both genomes (Figure 5A). The majority of junction sites in both samples had at least some degree of microhomology (58%) (Figure 5B-C). Integration .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 12 sites with clean breaks (0 bp overlap) and 3 bp of overlap were the most frequently seen junctions in PDX-294R, but there was a wide range of levels seen. There was also a large number of junctions with gaps between the human and HPV genomes ranging from 1 - 54 bp long. Discussion We developed a novel bioinformatics pipeline that we termed “SearcHPV” and show that it operated in a more accurate and efficient manner than existing pipelines on targeted capture sequencing data. The software also has the advantage of performing local contig assembly around the junction sites, which simplifies downstream confirmation experiments. We used our new caller to interrogate the integration sites found in two HNSCC models in order to compare the accuracy of our caller to the existing pipelines. We then evaluated the genomic effects of these integrations on a larger scale by 10X linked-reads sequencing to identify the role of HPV integration in driving structural variation in the tumor genome. Using SearcHPV, we were able to investigate the HPV-human integration events present in UM-SCC-47 and PDX-294R. Importantly, UM-SCC-47 has been previously assessed for HPV integration by a variety of methods8, 29-32, which we leveraged as ground truth knowledge to validate our integration caller. All previous studies were in agreement that HPV16 is integrated within the cellular gene TP63, although the exact number of sites and locations within the gene varied by study. In this study, SearcHPV also called HPV integration sites within TP63. We found integrations of E1, E2 and L1 within TP63 intron 10, L1 within intron 12 and E2 within TP63 exon 14. These integration sites were also detected using DIPS-PCR32 and/or WGS8 with the exception of E1 into intron 10, which was unique to our caller and confirmed by direct PCR. It is possible that the integration sites detected in this sample represent multiple fragments of one larger .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 13 integration site. There were additional sites called by other WGS studies that we did not detect (intron 98 and exon 731), although it is possible that alternate clonal populations grew out due to different selective pressures in different laboratories. Nonetheless, the analysis clearly demonstrated that SearcHPV was able to detect a well-established HPV insertion site. In contrast to UM-SCC-47, to our knowledge PDX-294R has not been previously analyzed for viral-host integration sites and therefore represented a true discovery case. We identified widespread HPV integration sites throughout the host genome and also observed that 66% of integration sites were found within or near genes. This aligns with previous reports that integrations are detected in host genes more frequently than expected by chance.2, 3, 7, 33 One particularly interesting cluster of integration events surrounded the cellular proto-oncogene MYC. Importantly, MYC has been identified as a potential hotspot for HPV integration7, 34 and the junctions we detected in/near this gene had 2-4 bp of microhomology, potentially driving this observation. Accordingly, an HPV-integration related promoter duplication event, which may be expected to drive expression, would be consistent with a novel genetic mechanism to drive expression of this oncogene. TP63 has also been reported to be a hotspot for HPV integration, as it has been recorded in multiple samples besides UM-SCC-47.3, 7, 35, 36 There is a high degree of microhomology between HPV16 and this gene. Given the high frequency of molecular alterations in the epidermal differentiation pathway (e.g. NOTCH1/2, TP63 and ZNF750) in HPV+ HNSCCs, this data supports HPV integration as a pivotal mechanism of viral-driven oncogenesis in this model.37 HPV integration sites have been associated with structural variations in the human genome3, 8, 37, which supports an additional genetic mechanism as to why HPV integration sites may often be detected adjacent to host cancer-related genes. These structural variation events are .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 14 thought to be due to the rolling circle amplification that takes place at the integration breakpoint, leading to the formation of amplified segments of genomic sequence flanked by HPV segments.8, 38 Our data are consistent with these previous reports in that approximately half of the integration events we discovered were associated with a large-scale amplification. It is unclear why only some integration sites were associated with structural variants, but it is possible that an alternative mechanism of integration occurred.38 Importantly, this observation that HPV integration events tended to be enriched in cellular genes could result from multiple different mechanisms. Integration could occur preferentially in regions of open chromatin during cell replication and keratinocyte differentiation. Other potential mechanisms are: 1) that HPV integration is directed to specific host genes by homology, or 2) that HPV integration is random, but events that are advantageous for oncogenesis are clonally selected and expanded, implicating non-homology based DNA repair mechanisms. Therefore, to help resolve differences in the mechanism of integration, we assessed microhomology at the HPV- human junction points. The majority of breakpoints had some level of microhomology. The most frequent levels of overlap were 0 and 3 bp, which potentially implicates non-homologous end joining (NHEJ) in repair at these sites, since this pathway most frequently results in 0-5 bp of overlap.39 There were also a number of junction sites that demonstrated a gap of inserted sequence between the HPV and human genomes. It has been described that during polymerase theta- mediated end joining (TMEJ), stretches of 3-30 bp are frequently inserted at the site of repair, possibly accounting for these sites.40 However, given the relatively small number of events we examined, we expect that future analysis with our pipeline will help resolve the specific role of each DNA repair pathway in HPV-human fusion breakpoints. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 15 Overall, our new HPV detection pipeline SearchHPV overcomes a gap in the field of viral- host integration analysis. While the performance of SearcHPV has only been examined on two models, in the future, we expect that the application of this pipeline in large HPV+ cancer tissue cohorts will help advance our understanding of the potential oncogenic mechanisms associated with viral integration. With the emerging set of tools such as SearcHPV, we believe the field is now primed to make major advances in the understanding of HPV-driven pathogenesis, some of which may lead to the development of novel biomarkers and/or treatment paradigms. REFERENCES: 1. Gao G, Wang J, Kasperbauer JL, et al. Whole genome sequencing reveals complexity in both HPV sequences present and HPV integrations in HPV-positive oropharyngeal squamous cell carcinomas. BMC Cancer. Apr 11 2019;19(1):352. doi:10.1186/s12885-019-5536-1 2. Nulton TJ, Olex AL, Dozmorov M, Morgan IM, Windle B. Analysis of The Cancer Genome Atlas sequencing data reveals novel properties of the human papillomavirus 16 genome in head and neck squamous cell carcinoma. Oncotarget. Mar 14 2017;8(11):17684-17699. doi:10.18632/oncotarget.15179 3. Parfenov M, Pedamallu CS, Gehlenborg N, et al. Characterization of HPV and host genome interactions in primary head and neck cancers. Proc Natl Acad Sci U S A. Oct 28 2014;111(43):15544-9. doi:10.1073/pnas.1416074111 4. Pinatti LM, Sinha HN, Brummel CV, et al. Association of human papillomavirus integration with better patient outcomes in oropharyngeal squamous cell carcinoma. Head Neck. Oct 19 2020;doi:10.1002/hed.26501 5. Tian R, Cui Z, He D, et al. Risk stratification of cervical lesions using capture sequencing and machine learning method based on HPV and human integrated genomic profiles. Carcinogenesis. Oct 16 2019;40(10):1220-1228. doi:10.1093/carcin/bgz094 6. McBride AA, Warburton A. The role of integration in oncogenic progression of HPV- associated cancers. PLoS Pathog. Apr 2017;13(4):e1006211. doi:10.1371/journal.ppat.1006211 7. Bodelon C, Untereiner ME, Machiela MJ, Vinokurova S, Wentzensen N. Genomic characterization of viral integration sites in HPV-related cancers. Int J Cancer. Nov 01 2016;139(9):2001-11. doi:10.1002/ijc.30243 8. Akagi K, Li J, Broutian TR, et al. Genome-wide analysis of HPV integration in human cancers reveals recurrent, focal genomic instability. Genome Res. Feb 2014;24(2):185-99. doi:10.1101/gr.164806.113 9. Pinatti LM, Walline HM, Carey TE. Human Papillomavirus Genome Integration and Head and Neck Cancer. J Dent Res. Jun 2018;97(6):691-700. doi:10.1177/0022034517744213 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 16 10. Luft F, Klaes R, Nees M, et al. Detection of integrated papillomavirus sequences by ligation-mediated PCR (DIPS-PCR) and molecular characterization in cervical cancer cells. Int J Cancer. Apr 01 2001;92(1):9-17. 11. Klaes R, Woerner SM, Ridder R, et al. Detection of high-risk cervical intraepithelial neoplasia and cervical cancer by amplification of transcripts derived from integrated papillomavirus oncogenes. Cancer Res. Dec 15 1999;59(24):6132-6. 12. Wang Q, Jia P, Zhao Z. VirusFinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data. Plos One. 2013;8(5):e64465. doi:10.1371/journal.pone.0064465 13. Wang Q, Jia P, Zhao Z. VERSE: a novel approach to detect virus integration in host genomes through reference genome customization. Genome Med. 2015;7(1):2. doi:10.1186/s13073-015-0126-6 14. Chen Y, Yao H, Thompson EJ, Tannir NM, Weinstein JN, Su X. VirusSeq: software to identify viruses and their integration sites using next-generation sequencing of human cancer tissue. Bioinformatics. Jan 15 2013;29(2):266-7. doi:10.1093/bioinformatics/bts665 15. Holmes A, Lameiras S, Jeannot E, et al. Mechanistic signatures of HPV insertions in cervical carcinomas. NPJ Genom Med. 2016;1:16004. doi:10.1038/npjgenmed.2016.4 16. Montgomery ND, Parker JS, Eberhard DA, et al. Identification of Human Papillomavirus Infection in Cancer Tissue by Targeted Next-generation Sequencing. Appl Immunohistochem Mol Morphol. Aug 2016;24(7):490-5. doi:10.1097/PAI.0000000000000215 17. Morel A, Neuzillet C, Wack M, et al. Mechanistic Signatures of Human Papillomavirus Insertions in Anal Squamous Cell Carcinomas. Cancers (Basel). Nov 22 2019;11(12)doi:10.3390/cancers11121846 18. Nkili-Meyong AA, Moussavou-Boundzanga P, Labouba I, et al. Genome-wide profiling of human papillomavirus DNA integration in liquid-based cytology specimens from a Gabonese female population using HPV capture technology. Sci Rep. Feb 6 2019;9(1):1504. doi:10.1038/s41598-018-37871-2 19. Heft Neal ME, Bhangale AD, Birkeland AC, et al. Prognostic Significance of Oxidation Pathway Mutations in Recurrent Laryngeal Squamous Cell Carcinoma. Cancers (Basel). Oct 22 2020;12(11)doi:10.3390/cancers12113081 20. NIAID. Papillomavirus Episteme. Bioinformatics and Computational Biosciences Branch. 2020. https://pave.niaid.nih.gov/ 21. Van Doorslaer K, Li Z, Xirasagar S, et al. The Papillomavirus Episteme: a major update to the papillomavirus sequence database. Nucleic Acids Res. Jan 4 2017;45(D1):D499-D506. doi:10.1093/nar/gkw879 22. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. Jul 15 2009;25(14):1754-60. doi:10.1093/bioinformatics/btp324 23. Institute B. Picard toolkit. Broad Institute GitHub Repository. 2019; 24. McKenna A, Hanna M, Banks E, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. Sep 2010;20(9):1297-303. doi:10.1101/gr.107524.110 25. Zhang J, Kobert K, Flouri T, Stamatakis A. PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics. Mar 1 2014;30(5):614-20. doi:10.1093/bioinformatics/btt593 26. Huang X, Madan A. CAP3: A DNA sequence assembly program. Genome Res. Sep 1999;9(9):868-77. doi:10.1101/gr.9.9.868 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://pave.niaid.nih.gov/ https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 17 27. Team RC. R: A language and environment for statistical computing. R Foundation for Statistical Computing. 2019; 28. Van Rossum G, Drake F.L. Python 3 Reference Manual: Python Documentation Manual Part 2. CreateSpace Independent Publishing Platform. 2009; 29. Khanal S, Shumway BS, Zahin M, et al. Viral DNA integration and methylation of human papillomavirus type 16 in high-grade oral epithelial dysplasia and head and neck squamous cell carcinoma. Oncotarget. Jul 13 2018;9(54):30419-30433. doi:10.18632/oncotarget.25754 30. Myers JE, Guidry JT, Scott ML, et al. Detecting episomal or integrated human papillomavirus 16 DNA using an exonuclease V-qPCR-based assay. Virology. Nov 2019;537:149- 156. doi:10.1016/j.virol.2019.08.021 31. Olthof NC, Huebbers CU, Kolligs J, et al. Viral load, gene expression and mapping of viral integration sites in HPV16-associated HNSCC cell lines. Int J Cancer. Mar 1 2015;136(5):E207- 18. doi:10.1002/ijc.29112 32. Walline HM, Goudsmit CM, McHugh JB, et al. Integration of high-risk human papillomavirus into cellular cancer-related genes in head and neck cancer cell lines. Head Neck. May 2017;39(5):840-852. doi:10.1002/hed.24729 33. Hu Z, Zhu D, Wang W, et al. Genome-wide profiling of HPV integration in cervical cancer identifies clustered genomic hot spots and a potential microhomology-mediated integration mechanism. Nat Genet. Feb 2015;47(2):158-63. doi:10.1038/ng.3178 34. Ferber MJ, Thorland EC, Brink AA, et al. Preferential integration of human papillomavirus type 18 near the c-myc locus in cervical carcinoma. Oncogene. Oct 16 2003;22(46):7233-42. doi:10.1038/sj.onc.1207006 1207006 [pii] 35. Schmitz M, Driesch C, Jansen L, Runnebaum IB, Durst M. Non-random integration of the HPV genome in cervical cancer. Plos One. 2012;7(6):e39632. doi:10.1371/journal.pone.0039632 PONE-D-12-09523 [pii] 36. Walline HM, Komarck CM, McHugh JB, et al. Genomic Integration of High-Risk HPV Alters Gene Expression in Oropharyngeal Squamous Cell Carcinoma. Mol Cancer Res. Oct 2016;14(10):941-952. doi:10.1158/1541-7786.MCR-16-0105 37. Cancer Genome Atlas N. Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature. Jan 29 2015;517(7536):576-82. doi:10.1038/nature14129 38. Groves IJ, Coleman N. Human papillomavirus genome integration in squamous carcinogenesis: what have next-generation sequencing studies taught us? The Journal of pathology. May 2018;245(1):9-18. doi:10.1002/path.5058 39. Pannunzio NR, Li S, Watanabe G, Lieber MR. Non-homologous end joining often uses microhomology: implications for alternative end joining. DNA Repair (Amst). May 2014;17:74- 80. doi:10.1016/j.dnarep.2014.02.006 40. Carvajal-Garcia J, Cho JE, Carvajal-Garcia P, et al. Mechanistic basis for microhomology identification and genome scarring by polymerase theta. Proc Natl Acad Sci U S A. Apr 14 2020;117(15):8476-8485. doi:10.1073/pnas.1921791117 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 18 FIGURE LEGENDS: Figure 1: Workflow of SearcHPV. (A) Paired-end reads from targeted capture sequencing were aligned to a catenated Human-HPV reference genome. After removing duplication and filter, fusion points were identified by split reads and pair-end reads. Informative reads were extracted for local assembly. Reads pairs that have overlaps were merged first before assembly. Assembled contigs were aligned to HPV genome to identify the breakpoints on HPV. (B) Contigs were divided to two classes. Blue solid triangle demonstrates the matched region of the contig. Grey dashed triangle demonstrates the clipped region of the contig. Contig A would be assigned to left group and Contig B would be assigned to right group. Contig C would be randomly assigned to left or right group. (C) Workflow for the contig selection procedures for fusion point with multiple candidates contigs. For each fusion point. we report at least one contig and at most two contigs representing two directions. Figure 2: Distribution of breakpoints in the human and HPV genomes called by SearcHPV. (A) Distribution of integration sites in the human genome for PDX-294R. Each bar denotes the count of breakpoints within the region. (B) Links of breakpoints in the human and HPV16 genomes for PDX-294R. (C) Links of breakpoints in the human and HPV16 genomes for UM-SCC-47. (D) Quantification of breakpoint calls in human genes for PDX-294R. (E) Quantification of breakpoints calls in the HPV16 genes for PDX-294R. (F) Quantification of breakpoint calls in the HPV16 genes for UM-SCC-47. Figure 3: Comparison of integration sites called by SearcHPV, VirusSeq and VirusFinder2 in both models. (A) Each bar denotes an integration site. The colormap shows the count of the .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 19 integration sites. (B) Number of integration sites called by each program. (C) PCR confirmation rate of sites called by each program. Figure 4: Genomic duplications associated with HPV integration in UM-SCC-47 (A) and PDX-294R (B-D). Red arrows indicate integration site. Each plot shows the number of overlapping barcodes observed in sequencing reads of that region. Figure 5: Microhomology at junction points. (A) The three types of junction points. (B) Level of microhomology (in bp) in UM-SCC-47. (C) Level of microhomology (in bp) in PDX-294R. Junctions with a gap are shown as negative numbers. Figure S1: PCR validation gel electrophoresis. Top band of each row shows GAPDH (535 bp), bottom bands represent predicted HPV-human junctions (ranging from 70-250 bp). Red boxes demonstrate bands that appeared at the correct molecular weight and were validated by Sanger sequencing. Figure S2: Linked read SNP phase plots for UM-SCC-47 (A) and PDX-294R (B) genomes. Alternating colors represent different phase blocks, which are contiguous blocks of DNA from the same allele based on differential SNP phasing performed by LongRanger software. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ A B C Targeted capture sequencing BWA mem HG19 Alignment Genome HPV Pair-end Read Pair-end Read Split Read Fusion point Genome Fusion Points Calling Remove duplication + Filter Assembly Pair-end Read Read length Insertion size Merge Split read Assemble Contigs HPV Fusion Points Calling BWA mem HPV type16 Contig Fusion point HPV type16 Figure 1 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 2 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ VirusSeqVirusFinder2 SearcHPV - 72% 18/25 NT 40% 2/5 57% 4/7 40% 2/5 100% 4/4 B. VirusSeq n=16 VirusFinder2 n=41 SearcHPV n=104 0 76 17 5 17 4 7 A. Integration Calls Integration Confirmation RatesC. SRCH VF2 VS 0 1 2 3 4 !5 Figure 3 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ 62 kb duplication A. B. 130 kb duplication 93 kb duplication 93 kb duplication C. D. x3 x3 x4 x2 x7 x3 Figure 4 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430847doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430847 http://creativecommons.org/licenses/by-nc-nd/4.0/ Human HPV16Un ma pp ed GAP (- bp overlap) CLEAN BREAK (0 bp overlap) MICROHOMOLOGY (+ bp overlap) Human HPV16 Human HPV16 A. B. C. Microhomology at junction – UM-SCC-47 Microhomology at junction – PDX294 Figure 5 GAPDH HPV-human junction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 GAPDH HPV-human junction 42 43 44 45 46 GAPDH 34 35 36 37 38 39 40 41 HPV-human junction GAPDH HPV-human junction 27 28 29 30 31 32 33 GAPDH HPV-human junction GAPDH HPV-human junction Figure S1 A. B.UM-SCC-47 PDX-294R Figure S2 manuscript.pdf REFERENCES: figures.pdf 10_1101-2021_02_11_430871 ---- ParticleChromo3D: A Particle Swarm Optimization Algorithm for Chromosome and Genome 3D Structure Prediction from Hi-C Data ParticleChromo3D: A Particle Swarm Optimization Algorithm for Chromosome and Genome 3D Structure Prediction from Hi-C Data David Vadnais1, Michael Middleton1, and Oluwatosin Oluwadare1* 1Department of Computer Science, University of Colorado, Colorado Springs, CO, USA. * Corresponding author Email: ooluwada@uccs.edu (OO) Abstract The three-dimensional (3D) structure of chromatin has a massive effect on its function. Because of this, it is desirable to have an understanding of the 3D structural organization of chromatin. To gain greater insight into the spatial organization of chromosomes and genomes and the functions they perform, chromosome conformation capture techniques, particularly Hi-C, have been developed. The Hi-C technology is widely used and well-known because of its ability to profile interactions for all read pairs in an entire genome. The advent of Hi-C has greatly expanded our understanding of the 3D genome, genome folding, gene regulation and has enabled the development of many 3D chromosome structure reconstruction methods. Here, we propose a novel approach for 3D chromosome and genome structure reconstruction from Hi-C data using Particle Swarm Optimization approach called ParticleChromo3D. This algorithm begins with a grouping of candidate solution locations for each chromosome bin, according to the particle swarm algorithm, and then iterates its position towards a global best candidate solution. While moving towards the optimal global solution, each candidate solution or particle uses its own local best information and a randomizer to choose its path. Using several metrics to validate our results, we show that ParticleChromo3D produces a robust and rigorous representation of the 3D structure for input Hi-C data. We evaluated our algorithm on simulated and real Hi-C data in this work. Our results show that ParticleChromo3D is more accurate than most of the existing algorithms for 3D structure reconstruction. Our results also show that constructed ParticleChromo3D structures are very consistent, hence indicating that it will always arrive at the global solution at every iteration. The source code for ParticleChromo3D, the simulated and real Hi-C datasets, and the models generated for these datasets are available here: https://github.com/OluwadareLab/ParticleChromo3D Introduction Chromosome Conformation Capture (3C) and its subsequent derivative technologies are invaluable for describing chromatin's three-dimensional (3D) structure [1]. 3C's biochemical approach to studying DNA's topography within chromatin has outperformed the traditional microscopy approaches like fluorescence in situ hybridization (FISH) due to 3C's systematic nature [2]. As a side note, Microscopy is still used in conjunction with 3C for verifying the actual 3D structure of chromatin against the predicted outcome [1]. 3C was first described by [3] Dekker et al. (2002). Since then, more technologies were developed [4], such as the Chromosome Conformation Capture-on-Chip (4C) [5], Chromosome Conformation Capture Carbon Copy (5C) [6], Hi-C[7], TCC[8], and Chromatin Interaction Analysis by Paired-End Tag sequencing ChIA-PET [2,9]. These derivative technologies were designed to augment 3C's in the following areas, measure spatial data within chromatin, increase measuring throughput, and analyze proteins and RNA within chromatin instead of just DNA. Lieberman- Aiden et al., 2009 [7] designed Hi-C as a minimally biased "all vs. all" approach. Hi-C works by injecting biotin- labeled nucleotides during the ligation step [4]. Hi-C provides a method for finding genome-wide chromatin IF data in the form of a contact matrix [1]. Hi-C analysis doubtlessly introduced great benefit to 3D genome research— they explain a series of events such as genome folding, gene regulation, genome stability, and the relationship between regulatory elements and structural features in the cell nucleus [2,7,10]. Importantly, it is possible to glean insight into chromatin's 3D structure using the Hi-C data. However, to use Hi-C data for 3D structure modeling, some pre-processing is necessary to extract the interaction frequencies (IF) between the chromosome or genome’s interacting loci [11]. This process involves quality control and mapping of the data [12]. Once these steps are completed, an IF matrix, or called contact matrix or map, is generated. An IF matrix is a symmetric matrix that records a one-to-one interaction frequency .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint mailto:ooluwada@uccs.edu https://github.com/OluwadareLab/ParticleChromo3D https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ for all the intersecting loci [7,10]. The IF matrix is represented as either a square contact matrix or as a three- column sparse matrix. Each cell has genomic bins within these matrices that are the length of the data's resolution representing each cell [12]. Hence, the higher the resolution (5KB), the larger the contact matrix's size. And similarly, the lower the resolution (1MB), the smaller the contact matrix's size. Next, this Hi-C data is normalized to remove biases that next-generation sequencing can create [12,13]. An example of this type of bias would be copy number variation [13]. Other systematic biases introduced during the Hi-C experiment are by external factors, such as DNA shearing and cutting [10]. Today, several computational algorithms have been developed to remove these biases from the Hi-C IF data [13-20]. Once the Hi-C IF matrix data is normalized, it is most suitable for 3D chromosome or genome modeling. Some tools have been developed to automate this Hi-C pre- processing steps; they include GenomeFlow [21], Hi-Cpipe [22], Juicer [23], HiC-Pro [24], and HiCUP [25]. To create 3D chromosome and genome structures from IF data, many techniques can be used. Oluwadare, O., et al. (2019) [10] pooled the various developed analysis techniques into three buckets, which are Distance- based, Contact-based, and Probability-based methods. The first method is a Distance-based method that maps IF data to distance data and then uses an optimizer to solve for the 3D coordinates [12]. This type of analysis's final output will be (x, y, z) coordinates [12]. However, the difficulty is picking out how to convert the IF data and which optimization algorithm to use [10]. The distance between two genomic bins is often represented as 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑖,𝑗 = 1/(𝐼𝐹𝑖,𝑗 𝛼 ) [10,11]. In this approach 𝐼𝐹𝑖,𝑗 is the number of times two genomic bins had contact and 𝛼 is a factor which is used for modeling, called the conversion factor. This distance can then be optimized against other genomic bins' other distance values to create a 3D model. Several methods [10] belong in this category include, ChromSDE[26], AutoChrom3D [27], Chromosome3D [28], 3DMax [29], ShRec3D [30], LorDG [31], InfMod3DGen [32], HSA [33], ShNeigh[34]. The second classification for 3D genome structure modeling algorithms from IF data is Contact-based methods. This technique uses the IF data directly instead of starting by converting the data to a (x, y, z) coordinate system [10]. One way to model this data is with a gradient descent/ascent algorithm [10]. This approach was explored by Trieu T, and Cheng J., 2015 through the algorithm titled MOGEN [35]. MOGEN works by optimizing a scoring function that scores how well the chromosomal contact rules have been satisfied [35]. Another contact method was to take the interaction frequency and use it for spatial restraints [36]. Gen3D [37], Chrom3D [38], and GEM [39] are other examples in this category. The third classification is Probability-based. The advantages of probability-based approaches are that they easily account for uncertainties in experimental data and can perform statistical calculations of noise sources or specific structural properties [10]. Unfortunately, probability techniques can be very time-consuming compared to Contact and Distance methods. Rousseau et al., 2011 created the first model in this category using a Markov chain Monte Carlo approach called MCMC5C [40]. Markov chain Monte Carlo was used due to its synergy with estimating properties' distribution [10]. Varoquaux. N., et al., 2014 [41] extended this probability-based approach to modeling the 3D structure of DNA. They used a Poisson model and maximized a log-likelihood function [41]. Many other statistical models can still be explored. This paper presents ParticleChromo3D, a new distance-based algorithm for chromosome 3D structure reconstruction from Hi-C data. ParticleChromo3D uses Particle Swarm Optimization (PSO) to generate 3D structures of chromosomes from Hi-C data. Here, we show that ParticleChromo3D can generate candidate structures for chromosomes from Hi-C data. Additionally, we analyze the effects of parameters such as confidence coefficient and swarm size on the structural accuracy of our algorithm. Finally, we compared ParticleChromo3D to a set of commonly used chromosome 3D reconstruction methods, and it performed better than most of these methods. We showed that ParticleChromo3D effectively generates 3Dstructures from Hi-C data and is highly consistent in its modeling performance. Materials and Methods The Particle Swarm Optimization Algorithm Kennedy J., and Eberhart R. (1995) [42] developed the Particle Swarm Optimization (PSO) as an algorithm that attempts to solve optimization problems by mimicking the behavior of a flock of birds. PSO has been used in the following fields: antennas, biomedical, city design/civil engineering, communication networks, combinatorial optimization, control, intrusion detection/cybersecurity, distribution networks, electronics and electromagnetics, engines and motors, entertainment, diagnosis of faults, the financial industry, fuzzy logic, .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ computer graphics/visualization, metallurgy, neural networks, prediction and forecasting, power plants, robotics, scheduling, security and military, sensor networks, and signal processing [43-47]. Since PSO has been used in so many disparate fields, it appears to be robust and flexible, which gives credence to the idea that it could be used in this use case of bioinformatics and many others [48]. PSO falls into the optimization taxonomy of swarm intelligence [49]. PSO works by creating a set of particles or actors that explore a topology and look for the global minimum of that topology [49]. At each iteration, the swarm stores each particle's minimum result, as well as the global swarm's minimum, found. The particles explore the space with both a position and velocity, and they change their velocity based on three parameters. These three parameters are current velocity, distance to the personal best, and distance to the global best [49]. Position changes are made based on the calculated velocity during each iteration. The velocity function is as follows [50]: 𝑉𝑛+1 = 𝑤 ∗ 𝑉𝑛 + 𝑐1 ∗ 𝑅1 ∗ ( 𝑃𝑛 − 𝑋𝑛 ) + 𝑐2 ∗ 𝑅2 ∗ ( 𝐺𝑛 − 𝑋𝑛 ) (1) Then position is updated as follow: 𝑋𝑛+1 = 𝑋𝑛 + 𝑉𝑛+1 (2) Where: • 𝑉𝑛 is the current velocity at iteration 𝑛 • 𝑐1 and 𝑐2 are two real numbers that stand for local and global weights and are the personal best of the specific particle and the global best vectors, respectively, at iteration 𝑛 [50]. • The 𝑅1 and 𝑅2 values are randomized values used to increase the explored terrain [50]. • 𝑤 is the inertia weight parameter, and it determines the rate of contribution of a velocity [42]. • 𝐺𝑛 represents the best position of the swarm at iteration 𝑛. • 𝑃𝑛 represents the best position of an individual particle. • 𝑋𝑛 is the best position of an individual particle at the iteration 𝑛. Why PSO This project's rationale is that using PSO could be a very efficient method for optimizing Hi-C data due to its inherent ability to hold local minima within its particles. This inherent property will allow sub-structures to be analyzed for optimality independently of the entire structure. In Fig 1, particle one is at the global best minimum found so far. However, particle two has a better structure in its top half, and it is potentially independent of the bottom half. Because particle one has a better solution so far, particle two will traverse towards the structure in particle one in the iteration 𝑛 + 1. While particle two is traversing, it will go along a path that maintains its superior 3D model sections. Thus, it has a higher chance of finding the absolute minimum distance value. The more particles there are, the greater the time complexity of PSO and the higher the chance of finding the absolute minimum. The inherent breaking up of the problem could lend itself to powerful 3D structure creation results. More abstractly relative to Hi-C data but in the traditional PSO sense, the same problem as above might look as follows (Fig 2) when presented in a topological map. Fig 1. PSO potential advantage for structure holding. The figure summarizes the PSO algorithm performance expectation on the 3D genome structure reconstruction problem. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ From Fig 2, in the 𝑛𝑡ℎ iteration, particle 1 found a local minimum within this step. Since of all the particles, this is the lowest point; particle two will search towards particle one with a random chance amount added to its velocity [49, 51, 52]. The random chance keeps particle two from going straight to the optimal solution [49,53]]. In this case, particle two found the absolute minima, and from here on, all the particles will begin to migrate towards particle 2. We will test this hypothesis by analyzing its output with the evaluation metrics defined in the “Results” section. In summary, we believe the particle-based structure of PSO may lend itself well to the problem of converting Hi-C IF data into 3D models. We will test this hypothesis and compare our results to the existing modeling methods. Fig 2. PSO particle iteration description This figure explains the PSO algorithm's search mechanism for determining the best 3D structure following the individual particles' modified velocity and position in the swarm. Fig 3. PSO for chromosome and genome 3D Structure prediction We present a step-by-step illustration of the significant steps taken by ParticleChromo3D for 3D chromosome and genome structure reconstruction from an input normalized IF matrix. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ PSO for 3D Structure Reconstruction from Hi-C Data Here we describe how we implemented the PSO algorithm as a distance-based approach for 3D genome reconstruction from Hi-C data. This algorithm is called ParticleChromo3D. In this context, the input IF data is converted to the distance equivalent using the conversion factor, 𝛼 , for 3D structure reconstruction. First, we initialize the particles' 3D (x,y,z) coordinates for each genomic bin or regions randomly in the range [-1, 1]. We used the sum of squared error function as the loss function to compute chromosome structures from a contact map. Finally, we used PSO to iteratively improve our function until it has converged on either an absolute or local minima. The full ParticleChromo3D algorithm is presented in Fig 3. Some parameters are needed to use the PSO algorithm for 3D structure reconstruction. This work has provided the parameter values that produced our algorithm's optimal results. The users can also provide their settings to fit their data where necessary. The results of the series of tests and validation performed to determine the default parameters are described in the "Parameters Estimation" section of the Results section. Model Representation A particle is a candidate solution. A list of XYZ coordinates represents each particle in the solution. The candidate solution's length in the number of regions in the input Hi-C data. Each particle's point is the individual coordinate, XYZ, of each bead. A swarm consists of N candidate solution, also called the swarm size, which the user provides as program input. We provide more explanation in the "Parameters Estimation" section below for how to determine the swarm size. Data Our study used the yeast synthetic or simulated dataset from Adhikari et al., 2016 [28] to perform parameter tuning and validation. The simulated dataset was created from a yeast structure for chromosome 4 at 50kb resolution [54]. The number of genome loci in the synthetic dataset is 610. We used the GM12878 cell Hi- C dataset to analyze a real dataset, GEO Accession number GSE63525[55]. The normalized contact matrix was downloaded from the GSDB database with GSDB ID: OO7429SF [56]. Results Metrics Used for Evaluation To evaluate the structure’s consistency with the input Hi-C matrix, we used the following metrics: Pearson Correlation Coefficient (PCC) The Pearson correlation coefficient is as follows [10], 𝑃𝐶𝐶 = ∑((𝑑𝑖 − �̅�) ∗ (𝐷𝑖 − �̅�)) √∑(𝑑𝑖 − �̅�) 2 ∗ ∑(𝐷𝑖 − �̅�) 2 Where: • 𝐷𝑖 and 𝑑𝑖 are instances of a distance value between two bins. • 𝐷 and �̅� are the means of the distances within the data set. • It measures the relationship between variables. Values a between -1 to +1 • A higher value is better. Spearman Correlation Coefficient (SCC) Spearman’s correlation coefficient is defined below [10], 𝑆𝐶𝐶 = ∑(𝑥𝑖 − �̅�) ∗ (𝑦𝑖 − �̅�) √∑(𝑥𝑖 − �̅�) 2 ∗ √∑(𝑦𝑖 − �̅�) 2 Where: .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ • xi and yi are the rank of the distances,𝐷𝑖 and 𝑑𝑖 , defined in the PCC equation above. • �̅� and �̅� are the sample mean tank of both x and y, respectively. • Values a between -1 to +1. A higher value is better. Root Mean Squared Error (RMSE) Root mean squared error follows the equation below [10], 𝑅𝑀𝑆𝐸 = √ 1 𝑛 ∗ ∑(𝑑𝑖 − 𝐷𝑖 ) 2 Where: • Di and di are instances of distance values from the data and another data source. • The value n is the size of the data set. • TM-Score TM-Score is defined as follows [42][43], 𝑇𝑀 − 𝑠𝑐𝑜𝑟𝑒 = 𝑀𝐴𝑋𝐼𝑀𝑈𝑀 [ 1 𝐿𝑇𝑎𝑟𝑔𝑒𝑡 ∗ ∑ 1 1 + ( d𝑖 𝑑0 ∗ 𝐿𝑇𝑎𝑟𝑔𝑒𝑡 )2 𝐿𝑎𝑙𝑖 𝑖 ] Where: • LTarget is the length of the chromosome. • di is an instance of a distance value between two bins. • Lali Represents the count of all aligned residues. • d0 is a normalizing parameter. The TM-score is a metric to measure the structural similarity of two proteins or models [57,58]. A TM-score value can be between (0,1] were 1 indicates two identical structures [57]. A score of 0.17 indicates pure randomness, and a score above 0.5 indicates the two structures have mostly the same folds [58]. Hence the higher, the better. Parameters Estimation We used the yeast synthetic dataset to decide on ParticleChromo3D's best parameters. We used this data set to investigate the mechanism for choosing the best alpha conversion factor for input Hi-C data. Also, determine the optimal swarm size; determine the best threshold value for the algorithm, inertia value(w), and the best coefficients for our PSO velocity (𝑐1 and 𝑐2). We evaluated our reconstructed structures by comparing them with the synthetic dataset's true distance structure provided by Adhikari et al., 2016 [28]. We evaluated our algorithms with the PCC, SCC, RMSE, and TM-score metrics. Based on the results from the evaluation, the default value for the ParticleChromo3D parameters are set as presented below: Conversion Factor Test (𝛂) The synthetic interaction frequency data set was generated from a yeast structure for chromosome 4 at 50kb [69] with an 𝛼 value of 1 using the formula: 𝐼𝐹 = 1/𝐷𝛼. Hence, the relevance of using this test data is to test if our algorithm can predict the alpha value used to produce the synthetic dataset. For both PCC and SCC, our algorithm performed best at a conversion factor (alpha) of 1.0 (Fig 4). Our algorithm's default parameter setting is that it searches for the best alpha value in the range [0.1, 1.5]. Side by side comparison of the true simulated data (yeast) structure and the reconstructed structure by ParticleChromo3D shows that they are highly similar (Fig 5) .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ Fig 4. A plot of the evaluation metric versus the conversion factors. (A) A plot of SCC vs. Conversion factor. (B) A plot of PCC vs. Conversion factor. Here, we show the performance of ParticleChromo3D on the SCC and PCC metric for the simulated dataset at 𝛼 value in the range 0.1 to 1.9. The result shows the best result is recorded at 𝛼 = 1. The SCC and PCC metric values were obtained by comparing the ParticleChromo3D algorithm's output structure at each 𝛼 value with the true structure. In Fig 4A and 4B, the Y-axis denotes the SCC and PCC scores, respectively, in the range [-1,1], and the X-axis denotes the conversion factor values. A higher SCC and PCC value is better. Fig 5. A comparison of the simulated data true structure and reconstructed structure by ParticleChromo3D. (A)True structure from Duan et al. [54] (B) Reconstructed structures for the simulated data using ParticleChromo3D. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ Swarm Size The swarm size defines the number of particles in the PSO algorithm. We evaluated the performance of the ParticleChromo3D with changes in swarm size (Fig 6A, Fig 6B, Fig 6C). Also, we evaluated the effect of an increase in swarm size against computation time (Fig 6D). Our result shows that computational time increases with increased swarm size. Given the computational implication and the algorithm’s performance at various swarm size, we defined a swarm size of 15 as our default value for this parameter. According to our experiments, the Swarm size 10 is most suitable if the user’s priority is saving computational time, and swarm size 20 is suitable when the user's preference is algorithm performance over time. Hence, setting the default swarm size 15 gives us the best of both worlds. The structures generated by ParticleChromo3D also shows that the result at swarm size 15(Fig 7C) and 20(Fig 7D) are most similar to the simulated data true structure represented in Fig 5A (Fig 7). Fig 6. A plot of the evaluation metric versus the Swarm Size parameter. (A) A plot of the SCC vs. the Swarm Size. (B) A plot of PCC vs. the Swarm Size. (C) A plot of RMSE vs. the Swarm Size. (D) A plot of the runtime, in seconds, vs. the Swarm Size. The SCC, PCC, and RMSE values were obtained by comparing the ParticleChromo3D algorithm's output structure with the simulated data true structure. In Fig 6A and Fig 6B, the Y-axis denotes the SCC and PCC score in the range [-1,1], and the X-axis denotes the Swarm Sizes values considered. A higher SCC and PCC value is better. In Fig 6C, the Y-axis denotes the RMSE score, and the X-axis denotes the Swarm Size values. A lower RMSE value is better. In Fig 6D, the Y-axis denotes the running time in seconds, and the X-axis denotes the Swarm Size values. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ Fig 7. Structures generated by ParticleChromo3D at different swarm size values. Here, we show the structure generated at Swarm size = (A) 5 (Represented with blue color), (B) 10 (Represented with magenta color), (C) 15 (Represented with red color), and (D) 20 (Represented with green color). As shown, the structure generated at swarm size 5 is not smooth; it has a couple of rough edges (Fig 7A). This correlates to the SCC, PCC, and RMSE recorded at this swarm size as it is the lowest at swarm size 5. Next, at Swarm size 10(Fig 7B), we observe a smoother representation but still with some rough edges. The result here shows that the results were really similar at swarm size 15 and 20(Fig 7C, Fig 7D). Threshold The threshold parameter is designed to serve as an early stopping criterion if the algorithm converges before the maximum number of iterations is reached. Hence, we evaluated the effect of varying threshold levels using the evaluation metrics(Fig 8). The output structures generated by each of the thresholds also allow a visual examination of a threshold value(Fig 9). We observed that the lower the threshold, the more accurate(Fig 8) and similar the structure is to the generated true simulated data structure in Fig 5A(Fig 9F). It worth noting that this does have a running time implication. Reducing the threshold led to a longer running time. However, since this was a trade-off between a superior result and longer running time or a fairly good result and short running time, we chose the former for ParticleChromo3D. The default threshold for our algorithm is 0.000001. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ Fig 8. A plot of the evaluation metric versus the Threshold parameter. (A) A plot of the SCC versus different threshold levels. (B) A plot of PCC versus the different threshold levels. (C) A plot of RMSE versus different threshold levels. The results show the performance of our algorithm at threshold values 0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001. The SCC, PCC, and RMSE values reported were obtained by comparing the ParticleChromo3D algorithm’s output structure to the simulated dataset's true structure. In Fig 8A and 8B, the Y-axis denotes the SCC and PCC scores in the range [-1,1], and the X-axis denotes the Threshold values. A higher SCC and PCC value is better. In Fig 8C, Y-axis denotes the RMSE score, and the X-axis denotes the Threshold values. A lower RMSE value is better. Fig 9. Structures at a threshold of 0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001 respectively. (A) Represents the structure produced using a threshold of 0.1. (B) Represents the structure produced using a threshold of 0.01. (C) Represents the structure produced using a threshold of 0.001. (D) Represents the structure produced using a .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ threshold of 0.0001. (E) Represents the structure produced using a threshold of 0.00001. (F) Represents the structure produced using a threshold of 0.000001. The results showed that the threshold value of 0.000001, Fig 9F, produced the best result. Confidence Coefficient (𝒄𝟏 and 𝒄𝟐) The 𝒄𝟏 and 𝒄𝟐 parameters represent the local-confidence and local and global swarm confidence level coefficient. Kennedy and Eberhart, 1995[42] proposed that 𝒄𝟏= 𝒄𝟐 = 𝟐. We experimented with testing how this value's changes affected our algorithm's accuracy for local confidence coefficient (𝑐1) 0.3 to 0.9 and global confidence values 0.1 to 2.8 (S1 and S2 Fig). From our results, we found that a local confidence coefficient (𝒄𝟏) of 0.3 with a global confidence coefficient (𝒄𝟐) of 2.5 performed best (Fig 10). Hence, these values were set as ParticleChromo3D's confidence coefficient values. The accuracy results generated for all the local confidence coefficient (c1) at varying global confidence values is compiled in Fig 11. Fig 10. Confidence Coefficient Test. (A) A plot of the SCC by Global Confidence at Local Confidence 0.3. (B) A plot of PCC by Global Confidence at Local Confidence 0.3. The plot of the local confidence value local confidence coefficient (𝑐1) = 0.3 against the varying level of global confidence coefficient (𝑐2) values from 0.1 to 2.8. The results show that the best result was obtained at 𝑐2 = 2.5. The SCC and PCC values reported were obtained by comparing the ParticleChromo3D algorithm’s output structure with the simulated dataset’s true structure. In Fig 10A and 10B, the Y-axis denotes the SCC and PCC scores in the range [-1,1], the X-axis denotes the global confidence values, and the colored plot denotes the local confidence values. A higher SCC and PCC value is better. Fig 11. A combined plot of different local confidences versus global confidences. The result's combined plot was obtained by comparing the ParticleChromo3D algorithm’s output structure with the simulated dataset’s true structure for local confidence values of 0.3 to 0.9 and global confidence values of 0.1 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ to 2.8. This plot shows the SCC accuracy of the structures generated. The Y-axis denotes the SCC score in the range [-1,1], the X-axis denotes the global confidence values, and the colored plot denotes the local confidence values. A higher SCC value is better. Random Numbers (𝑹𝟏 and 𝑹𝟐) 𝑅1 and 𝑅2 are uniform random numbers between 0 and 1[59]. Assessment on simulated data We evaluated how noise levels affect ParticleChromo3D's ability to predict chromosome 3D structures in the presence of noise. Using the yeast synthetic dataset from Adhikari et al., 2016 [28]. The data were simulated with a varying noise level. Adhikari, et al. introduced noise into the yeast IF matrix to make 12 additional datasets with different levels of noise at 3%, 5%, 7%, 10%, 13%, 15%, 17%, 20%, 25%, 30%, 35%, and 40%. As reported by the authors, converting this IF to their distance equivalent produced distorted distances that didn’t match the true distances. They were thereby simulating the inconsistent constraints that can sometimes be observed in un- normalized Hi-C data. As shown, our algorithm performed the best with no noise in the data at 0 (Fig 12). Furthermore, the other result obtained by comparing the ParticleChromo3D algorithm’s output structure from the noisy input datasets with the simulated dataset’s true structure shows that it can achieve a competitive result when dealing with un-normalized or noisy Hi-C datasets(Fig 13). The result shows that our algorithm can achieve the results obtainable at reduced noise level even at increased noise as indicated by Noise 7%(Fig 13B) and 20%(Fig 13C), respectively (Fig 12). Also, the difference in performance between the best structure and the worst structure is ~0.01. Hence, our algorithm cannot be potentially be affected by the presence of noise in the input Hi-C data. Fig 12. Assessment of the structures generated by ParticleChromo3D for the simulated dataset on varying noise levels. (A) A plot of the SCC versus Noise level. (B) A plot of PCC versus the Noise level. This plot shows the SCC and PCC accuracy of the structures generated by ParticleChromo3D at different noise levels introduced. In Fig 12A and Fig 12B, the Y-axis denotes the metric score in the range [-1,1]. The X-axis denotes the Noise level. A higher SCC and PCC value is better. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ Fig 13. Structures generated by ParticleChromo3D at different Noise levels. Here, we show the structure generated by ParticleChromo3D at Noise level = (A) 0, that is no Noise, (B) 7% (70), (C) 20% (200), and (D) 40% (400) Assessment on Real Hi-C data For evaluation on the real Hi-C data, we used the GM12878 B-lymphoblastoid cells line by Rao et al., 2014 [55]. The normalized 1MB and 500KB resolution interaction frequency matrices GM12878 cell line datasets were downloaded from the GSDB repository under the GSDB ID OO7429SF [56]. The datasets were normalized using the Knight-Ruiz normalization technique [15]. The performance of ParticleChromo3D was determined by computing the SCC value between the distance matrix of the normalized frequency input matrix and the Euclidean distance calculated from the predicted 3D structures. Fig 14 shows the assessment of ParticleChromo3D on the GM12878 cell line dataset. The reconstructed structure by ParticleChromo3D is compared against the input IF expected distance using the PCC, SCC, and RMSD metrics for the 1MB and 500KB resolution Hi-C data. When ParticleChromo3D performance is evaluated using both 1MB and 500KB resolution HiC data of the GM12878 cell, we observed some consistency in the algorithm’s performance for both datasets. Chromosome 18 had the lowest SCC value of 0.932 and 0.916 at 1MB and 500KB resolutions, respectively, while chromosome 5 had the highest SCC value of 0.975 and 0.966 at 1MB and 500KB resolutions, respectively. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ Fig 14. Performance evaluation of ParticleChromo3D using SCC values for 1MB and 500KB resolution GM12878 cell Hi-C data. (A) A plot of ParticleChromo3D SCC performance on 1MB GM12878 cell Hi-C data chromosome 1 to 23 (B) A plot of ParticleChromo3D SCC performance on 500KB GM12878 cell Hi-C data for chromosome 1 to 23. Model Consistency Next, we assessed the consistency of our generated structures. We created 30 structures for the chromosomes and then evaluated the structure’s similarity using the SCC, PCC, RMSE, and TM-Score (Fig 15). We assessed the consistency for both the 1MB and 500KB resolution Hi-C data of the GM12878 cell. As illustrated for the TM-score, a score of 0.17 indicates pure randomness, and a score above 0.5 indicates the two structures have mostly the same folds. Hence the higher, the better. Our results show from the selected chromosomes that the structures generated by ParticleChromo3D are highly consistent for both the 1MB (Fig 15) and 500KB (Fig 16) datasets. As shown in Fig 15 for the 1MB Hi-C datasets, the average SCC and PCC values recorded between the models for the selected chromosomes is >=0.985 and >=0.988, respectively, indicating that chromosomal models generated by ParticleChromo3D are highly similar. It also indicates that it finds an absolute 3D model solution on each run of the algorithm (Fig 15C and Fig 15D). Similarly, as shown in Fig 16, for the 500KB Hi-C datasets, the average SCC and PCC values recorded between the models for the selected chromosomes is >= 0.992. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ Fig 15. The model consistency check for 1MB resolution structures generated by ParticleChromo3D using different evaluation metrics. (A) The average SCC Between 30 Structures per chromosome at 1MB Resolution for the GM12878 datasets. (B) The average PCC Between 30 Structures per chromosome at 1MB Resolution for the GM12878 datasets.(C) The average TM-Score Between 30 Structure per chromosome at 1MB Resolution for the GM12878 datasets. (D) The boxplot shows the distribution of the 30 structure's TM-score by chromosome for the GM12878 datasets. The Y-axis denotes the SCC and PCC metric score in the range [-1,1], and TM-Score in the range [-0,1]. The X- axis denotes the chromosome. A higher SCC, PCC, and TM-Score value is better. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ Fig 16. The model consistency check for 500KB resolution structures generated by ParticleChromo3D using different evaluation metrics. (A) The average SCC Between 30 Structures per chromosome at 500KB Resolution for the GM12878 datasets. (B) The average PCC Between 30 Structures per chromosome at 500KB Resolution for the GM12878 datasets. (C) The average TM-Score Between 30 Structure per chromosome at 500KB Resolution for the GM12878 datasets. (D) The boxplot shows the distribution of the 30 structure's TM-score by chromosome for the GM12878 datasets. The Y-axis denotes the SCC and PCC metric score in the range [-1,1], and TM-Score in the range [- 0,1]. The X-axis denotes the chromosome. A higher SCC, PCC, and TM-Score value is better. Comparison with existing chromosome 3D structure reconstruction methods Here, we compared the performance of ParticleChromo3D side by side with nine existing high-performing chromosome 3D structure reconstruction algorithms on the GM12878 data set at both the 1MB and 500KB resolutions. The reconstruction algorithms are ChromSDE [26], Chromosome3D [28], 3DMax [29], ShRec3D [30], LorDG [31], GEM [72], HSA [33], MOGEN [35] and PASTIS [41] (Fig 17). According to the SCC value reported, we observed that ParticleChromo3D outperformed most of the existing methods in many chromosomes evaluated at 1MB and 500KB resolution. At a minimum, ParticleChromo3D secured the top-two best overall performance position among the ten algorithms compared. ParticleChromo3D achieving these results against these methods and algorithms shows the robustness and suitability of the PSO algorithm to be used to solve the 3D chromosome and genome structure reconstruction problem. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ Fig 17. A comparison of the accuracy of nine existing methods and ParticleChromo3D for 3D structure reconstruction on the 1MB and 500KB real Hi-C dataset. (A) An SCC Comparison of 3D structure reconstruction methods on the GM12878 Hi-C dataset at 1Mb resolution for chromosomes 1 to 23. (B) An SCC Comparison of 3D structure reconstruction methods on the GM12878 Hi- C dataset at 500KB resolution for chromosomes 1 to 23. The Y-axis denotes the SCC metric score in the range [- 1,1], and X-axis denotes the chromosome. A higher SCC value is better. Discussion We discussed the Swarm Size value's relevance in the Parameters Estimation section. We showed on the synthetic dataset that a Swarm Size (SS) value of 5 did not produce satisfactory performance. However, it was the fastest considering the other swarm sizes. At SS = 10, the performance was significantly improved than at SS = 5, but with an increase in computation time as a consequence. SS values 15 and 20 similarly achieved better .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ performance, but the cost of this performance improvement similarly is an increase in the program running time. However, we settled for a SS = 15 because it achieved one of the best performances, and the computational cost can be considered manageable. To investigate the implication of our choice, we carried out two tests discussed below: ParticleChromo3D performance on different Swarm Size values First, we evaluated the performance of the ParticleChromo3D algorithm on the GM12878 data set on both the 1MB and 500KB resolutions at Swarm Sizes 5, 10, and 15 to ensure that the performance at SS = 15 that we observed on the synthetic dataset is carried over to the real dataset (Fig 18). The 1MB and 500KB dataset result shows that SS = 15 achieved the best SCC value mostly across the chromosomes (Fig 18). However, we observed that the result generated at SS = 10 were also competitive and achieved an equal performance a few times with SS = 15. This shows us that choosing the SS = 10 does not necessarily reduce the performance of our ParticleChromo3D. There is an additional gain of saving on computational time if this value is used. Fig 18. ParticleChromo3D SCC performance on Swarm Size values 5,10 and 15 for 1MB and 500KB GM12878 cell Hi-C data. (A) Comparing the performance by ParticleChromo3D on the 1MB GM12878 cell Hi-C data at Swarm Size values 5, 10, and 15. (B) Comparing the performance by ParticleChromo3D on the 500KB GM12878 cell Hi-C data at Swarm Size values 5, 10, and 15. The Y-axis denotes the SCC metric score in the range [-1,1], and X-axis denotes the chromosome. A higher SCC value is better. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ Computational Time Second, we evaluated the time it took our algorithm to perform the 3D reconstruction for select chromosomes of the 1MB and 500KB GM12878 cell Hi-C data set. The modeling of the structures generated by ParticleChromo3D for the synthetic and real dataset was done on an AMD Ryzen 7 3800x 8-Core Processor, 3.89GHZ with installed RAM 31.9GB. ParticleChromo3D is programmed to multithread. It utilizes each core present on the user's computer to run a specific task, speeding up the modeling process and significantly reducing computational time. Accordingly, the more the number of processors a user has, the faster ParticleChromo3D will generate an output 3D structures. As mentioned earlier in the Parameter Estimation section, one of the default settings for ParticleChromo3D is to automatically determine the best conversion factor that fits the data in the range [0.1, 1.5]. Even though this is one of our ParticleChromo3D's strengths, this process has the consequence of increasing the algorithm's computational time. Based on the real Hi-C dataset analysis, our result shows that the Swarm Size 10 consistently has a lower computational time than the SS = 15 as speculated for the 500KB and 1MB Hi-C datasets (Fig 19). These results highlight an additional strength of ParticleChromo3D that it can achieve a competitive result in a lower time (Fig 19) without trading it off with performance (Fig 18). It is worth noting that we recommend that users can set the Swarm Size to the preferred value depending on the objective. In this manuscript, we favored the algorithm achieving a high accuracy over speed. We made up for this by making our algorithm multi-threaded, reducing the running time significantly. Fig 19. ParticleChromo3D Computational Time at Swarm Size (SS) 10 and 15 for 1MB and 500KB GM12878 cell Hi-C data. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ (A) Comparing the running time for ParticleChromo3D for select chromosomes for 1MB GM12878 cell Hi-C data (B)A comparison of the running time for ParticleChromo3D for select chromosomes for 500KB GM12878 cell Hi-C data. The Y-axis denotes the running time for ParticleChromo3D in minutes, and X-axis denotes the chromosome. Availability of data and Materials The models generated, all the datasets used for all analysis performed, and the source code for ParticleChromo3D are available at https://github.com/OluwadareLab/ParticleChromo3D. Conclusions We developed a new algorithm for 3D genome reconstruction called ParticleChromo3D. ParticleChromo3D uses the Particle Swarm Optimization algorithm as the foundation of its solution approach for 3D chromosome reconstruction from Hi-C data. The results of ParticleChromo3D on simulated data show that with the best-fine-tuned parameters, it can achieve high accuracy in the presence of noise. We compared ParticleChromo3D accuracy with nine (9) existing high-performing methods or algorithms for chromosome 3D structure reconstruction on the real dataset. The results show that ParticleChromo3D is effective and a high performer by achieving more accurate results over the other methods in many chromosomes; and securing the top-two best overall position in our comparative analysis with other algorithms. Our experiments also show that ParticleChromo3D can also achieve a faster computational run time without losing accuracy significantly. ParticleChromo3D’s parameters have been optimized to achieve the best result for any input Hi-C by searching for the best conversion factor (𝛼) and using the optimal PSO hyperparameters for any given input automatically. This algorithm was implemented in python and can be run as an executable or as a Jupyter Notebook found at https://github.com/OluwadareLab/ParticleChromo3D. Acknowledgments Not applicable. References 1. Sati S, Cavalli G. Chromosome conformation capture technologies and their impact in understanding genome function. Chromosoma. 2017 Feb;126(1):33-44. 2. De Wit E, De Laat W. A decade of 3C technologies: insights into nuclear organization. Genes & development. 2012 Jan 1;26(1):11-24. 3. Dekker J, Rippe K, Dekker M, Kleckner N. Capturing chromosome conformation. science. 2002 Feb 15;295(5558):1306-11. 4. Han J, Zhang Z, Wang K. 3C and 3C-based techniques: the powerful tools for spatial genome organization deciphering. Molecular Cytogenetics. 2018 Dec;11(1):1-0. 5. Simonis, M., Klous, P., Splinter, E., Moshkin, Y., Willemsen, R., De Wit, E., Van Steensel, B. and De Laat, W., 2006. Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture–on-chip (4C). Nature genetics, 38(11), pp.1348-1354. 6. Dostie J, Richmond TA, Arnaout RA, Selzer RR, Lee WL, Honan TA, Rubio ED, Krumm A, Lamb J, Nusbaum C, Green RD. Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome research. 2006 Oct 1;16(10):1299- 309. 7. Lieberman-Aiden E, Van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. science. 2009 Oct 9;326(5950):289-93. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://github.com/OluwadareLab/ParticleChromo3D https://github.com/OluwadareLab/ParticleChromo3D https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ 8. Kalhor R, Tjong H, Jayathilaka N, Alber F, Chen L. Genome architectures revealed by tethered chromosome conformation capture and population-based modeling. Nature biotechnology. 2012 Jan;30(1):90-8. 9. Li G, Fullwood MJ, Xu H, Mulawadi FH, Velkov S, Vega V, Ariyaratne PN, Mohamed YB, Ooi HS, Tennakoon C, Wei CL. ChIA-PET tool for comprehensive chromatin interaction analysis with paired-end tag sequencing. Genome biology. 2010 Feb;11(2):1-3. 10. Oluwadare O, Highsmith M, Cheng J. An overview of methods for reconstructing 3-D chromosome and genome structures from Hi-C data. Biological procedures online. 2019 Dec;21(1):1-20. 11. Pal K, Forcato M, Ferrari F. Hi-C analysis: from data generation to integration. Biophysical reviews. 2019 Feb;11(1):67-78. 12. MacKay K, Kusalik A. Computational methods for predicting 3D genomic organization from high- resolution chromosome conformation capture data. Briefings in functional genomics. 2020 Jul;19(4):292- 308. 13. Cournac A, Marie-Nelly H, Marbouty M, Koszul R, Mozziconacci J. Normalization of a chromosomal contact map. BMC genomics. 2012 Dec;13(1):1-3. 14. Servant N, Varoquaux N, Heard E, Barillot E, Vert JP. Effective normalization for copy number variation in Hi-C data. Bmc Bioinformatics. 2018 Dec;19(1):1-6. 15. Imakaev M, Fudenberg G, McCord RP, Naumova N, Goloborodko A, Lajoie BR, Dekker J, Mirny LA. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nature methods. 2012 Oct;9(10):999-1003. 16. Knight PA, Ruiz D. A fast algorithm for matrix balancing. IMA Journal of Numerical Analysis. 2013 Jul 1;33(3):1029-47. 17. Yaffe E, Tanay A. Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture. Nature genetics. 2011 Nov;43(11):1059. 18. Imakaev M, Fudenberg G, McCord RP, Naumova N, Goloborodko A, Lajoie BR, Dekker J, Mirny LA. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nature methods. 2012 Oct;9(10):999-1003. 19. Hu M, Deng K, Selvaraj S, Qin Z, Ren B, Liu JS. HiCNorm: removing biases in Hi-C data via Poisson regression. Bioinformatics. 2012 Dec 1;28(23):3131-3. 20. Lyu H, Liu E, Wu Z. Comparison of normalization methods for Hi-C data. BioTechniques. 2020 Feb;68(2):56-64. 21. Trieu T, Oluwadare O, Wopata J, Cheng J. GenomeFlow: a comprehensive graphical tool for modeling and analyzing 3D genome structure. Bioinformatics. 2019 Apr 15;35(8):1416-8. 22. Castellano G, Le Dily F, Beato M, Roma G. Hi-Cpipe: a pipeline for high-throughput chromosome capture. 23. Durand NC, Shamim MS, Machol I, Rao SS, Huntley MH, Lander ES, Aiden EL. Juicer provides a one- click system for analyzing loop-resolution Hi-C experiments. Cell systems. 2016 Jul 27;3(1):95-8. 24. Servant N, Varoquaux N, Lajoie BR, Viara E, Chen CJ, Vert JP, Heard E, Dekker J, Barillot E. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome biology. 2015 Dec;16(1):1-1. 25. Wingett S, Ewels P, Furlan-Magaril M, Nagano T, Schoenfelder S, Fraser P, Andrews S. HiCUP: pipeline for mapping and processing Hi-C data. F1000Research. 2015;4. 26. Zhang Z, Li G, Toh KC, Sung WK. Inference of spatial organizations of chromosomes using semi-definite embedding approach and Hi-C data. InAnnual international conference on research in computational molecular biology 2013 Apr 7 (pp. 317-332). Springer, Berlin, Heidelberg. 27. Peng C, Fu LY, Dong PF, Deng ZL, Li JX, Wang XT, Zhang HY. The sequencing bias relaxed characteristics of Hi-C derived data and implications for chromatin 3D modeling. Nucleic acids research. 2013 Oct 1;41(19):e183-. 28. Adhikari B, Trieu T, Cheng J. Chromosome3D: reconstructing three-dimensional chromosomal structures from Hi-C interaction frequency data using distance geometry simulated annealing. BMC genomics. 2016 Dec;17(1):1-9. 29. Oluwadare O, Zhang Y, Cheng J. A maximum likelihood algorithm for reconstructing 3D structures of human chromosomes from chromosomal contact data. BMC genomics. 2018 Dec;19(1):1-7. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ 30. Lesne A, Riposo J, Roger P, Cournac A, Mozziconacci J. 3D genome reconstruction from chromosomal contacts. Nature methods. 2014 Nov;11(11):1141. 31. Trieu T, Cheng J. 3D genome structure modeling by Lorentzian objective function. Nucleic acids research. 2017 Feb 17;45(3):1049-58. 32. Wang S, Xu J, Zeng J. Inferential modeling of 3D chromatin structure. Nucleic acids research. 2015 Apr 30;43(8):e54-. 33. Zou C, Zhang Y, Ouyang Z. HSA: integrating multi-track Hi-C data for genome-scale reconstruction of 3D chromatin structure. Genome biology. 2016 Dec;17(1):1-4. 34. Li FZ, Liu ZE, Li XY, Bu LM, Bu HX, Liu H, Zhang CM. Chromatin 3D structure reconstruction with consideration of adjacency relationship among genomic loci. BMC bioinformatics. 2020 Dec;21(1):1-7. 35. Trieu T, Cheng J. MOGEN: a tool for reconstructing 3D models of genomes from chromosomal conformation capturing data. Bioinformatics. 2016 May 1;32(9):1286-92. 36. Kalhor R, Tjong H, Jayathilaka N, Alber F, Chen L. Solid-phase chromosome conformation capture for structural characterization of genome architectures. Nature biotechnology. 2012;30(1):90. 37. Nowotny J, Ahmed S, Xu L, Oluwadare O, Chen H, Hensley N, Trieu T, Cao R, Cheng J. Iterative reconstruction of three-dimensional models of human chromosomes from chromosomal contact data. BMC bioinformatics. 2015 Dec;16(1):1-9. 38. Paulsen J, Sekelja M, Oldenburg AR, Barateau A, Briand N, Delbarre E, Shah A, Sørensen AL, Vigouroux C, Buendia B, Collas P. Chrom3D: three-dimensional genome modeling from Hi-C and nuclear lamin- genome contacts. Genome biology. 2017 Dec;18(1):1-5. 39. Zhu G, Deng W, Hu H, Ma R, Zhang S, Yang J, Peng J, Kaplan T, Zeng J. Reconstructing spatial organizations of chromosomes through manifold learning. Nucleic acids research. 2018 May 4;46(8):e50- . 40. Rousseau M, Fraser J, Ferraiuolo MA, Dostie J, Blanchette M. Three-dimensional modeling of chromatin structure from interaction frequency data using Markov chain Monte Carlo sampling. BMC bioinformatics. 2011 Dec;12(1):1-6. 41. Varoquaux N, Ay F, Noble WS, Vert JP. A statistical approach for inferring the 3D structure of the genome. Bioinformatics. 2014 Jun 15;30(12):i26-33. 42. Kennedy J, Eberhart R. Particle swarm optimization. InProceedings of ICNN'95-international conference on neural networks 1995 Nov 27 (Vol. 4, pp. 1942-1948). IEEE. 43. Garcia-Gonzalo E, Fernandez-Martinez JL. A brief historical review of particle swarm optimization (PSO). Journal of Bioinformatics and Intelligent Control. 2012 Jun 1;1(1):3-16. 44. Li MW, Hong WC, Kang HG. Urban traffic flow forecasting using Gauss–SVR with cat mapping, cloud model and PSO hybrid algorithm. Neurocomputing. 2013 Jan 1;99:230-40. 45. Wang J, Hong X, Ren RR, Li TH. A real-time intrusion detection system based on PSO-SVM. InProceedings. The 2009 International Workshop on Information Security and Application (IWISA 2009) 2009 (p. 319). Academy Publisher. 46. Mohamed MA, Eltamaly AM, Alolah AI. PSO-based smart grid application for sizing and optimization of hybrid renewable energy systems. PloS one. 2016 Aug 11;11(8):e0159702. 47. Zhang Y, Wang S, Ji G. A comprehensive survey on particle swarm optimization algorithm and its applications. Mathematical Problems in Engineering. 2015 Feb;2015. 48. Mansour N, Kanj F, Khachfe H. Particle swarm optimization approach for protein structure prediction in the 3D HP model. Interdisciplinary Sciences: Computational Life Sciences. 2012 Sep;4(3):190-200. 49. Mohapatra R, Saha S, Dhavala SS. Adaswarm: A novel pso optimization method for the mathematical equivalence of error gradients. arXiv preprint arXiv:2006.09875. 2020 May 19. 50. Bonyadi MR, Michalewicz Z. Particle swarm optimization for single objective continuous space problems: a review. Evolutionary computation. 2017 Mar;25(1):1-54. 51. Wang G, Guo J, Chen Y, Li Y, Xu Q. A PSO and BFO-based learning strategy applied to faster R-CNN for object detection in autonomous driving. IEEE Access. 2019 Feb 4;7:18840-59. 52. Tu C, Chuang L, Chang J, and Yang C, Feature Selection using PSO-SVM International Journal of Computer Science. 2007 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ 53. Mohapatra R, Saha S, Dhavala SS. Adaswarm: A novel pso optimization method for the mathematical equivalence of error gradients. arXiv preprint arXiv:2006.09875. 2020 May 19. 54. Duan Z, Andronescu M, Schutz K, McIlwain S, Kim YJ, Lee C, Shendure J, Fields S, Blau CA, Noble WS. A three-dimensional model of the yeast genome. Nature. 2010 May;465(7296):363-7. 55. Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, Lander ES, Aiden EL. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014 Dec 18;159(7):1665-80. 56. Oluwadare O, Highsmith M, Turner D, Lieberman-Aiden E, Cheng J. GSDB: a database of 3D chromosome and genome structures reconstructed from Hi-C data. BMC molecular and cell biology. 2020 Dec;21(1):1-0. 57. Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics. 2004 Dec 1;57(4):702-10. 58. Xu J, Zhang Y. How significant is a protein structure similarity with TM-score= 0.5?. Bioinformatics. 2010 Apr 1;26(7):889-95. 59. Wilke DN. Analysis of the particle swarm optimization algorithm (Doctoral dissertation, University of Pretoria). .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ Supporting information S1 Fig. A Plot of local confidence values 0.3 and 0.5 versus global confidence. (A) SCC by Global Confidence at Local Confidence 0.3 (B) PCC by Global Confidence at Local Confidence 0.3. (C) SCC by Global Confidence at Local Confidence 0.5 (D) PCC by Global Confidence at Local Confidence 0.5. Each of the plots shows the SCC and PCC results obtained by comparing the ParticleChromo3D algorithm's output structure with the simulated dataset's true structure for local confidence values 0.3 to 0.5 and global confidence values 0.1 to 2.8. The Y-axis denotes the SCC or PCC scores, respectively, as a label in the title, in the range [-1,1], the X-axis denotes the global confidence values, and the colored plot denotes the local confidence values. A higher SCC and PCC value is better. S2 Fig. A Plot of local confidence values 0.7 and 0.9 versus global confidence. (A) SCC by Global Confidence at Local Confidence 0.7. (B) PCC by Global Confidence at Local Confidence 0.7. (C) SCC by Global Confidence at Local Confidence 0.9. (D) PCC by Global Confidence at Local Confidence 0.9. Each of the plots shows the SCC and PCC results obtained by comparing the ParticleChromo3D algorithm’s output structure with the simulated dataset’s true structure for local confidence values 0.7 to 0.9 and global confidence values 0.1 to 2.8. The Y-axis denotes the SCC or PCC scores, respectively, as a label in the title, in the range [-1,1], the X-axis denotes the global confidence values, and the colored plot denotes the local confidence values. A higher SCC and PCC value is better. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.11.430871doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430871 http://creativecommons.org/licenses/by/4.0/ Abstract Introduction Materials and Methods The Particle Swarm Optimization Algorithm Why PSO PSO for 3D Structure Reconstruction from Hi-C Data Model Representation Data Results Metrics Used for Evaluation Pearson Correlation Coefficient (PCC) Spearman Correlation Coefficient (SCC) Root Mean Squared Error (RMSE) TM-Score Parameters Estimation Conversion Factor Test (𝛂) Swarm Size Threshold Confidence Coefficient (,𝒄-𝟏. and ,𝒄-𝟐.) Random Numbers (,𝑹-𝟏. and ,𝑹-𝟐.) Assessment on simulated data Assessment on Real Hi-C data Model Consistency Comparison with existing chromosome 3D structure reconstruction methods Discussion ParticleChromo3D performance on different Swarm Size values Computational Time Availability of data and Materials Conclusions Acknowledgments References Supporting information 10_1101-2021_02_12_430739 ---- Mutations in bdcA and valS correlate with quinolone resistance in wastewater Escherichia Coli Malekian et al. RESEARCH Mutations in bdcA and valS correlate with quinolone resistance in wastewater Escherichia Coli Negin Malekian1, Ali Al-Fatlawi1, Thomas U. Berendonk2 and Michael Schroeder1* Abstract Single mutations can confer resistance to antibiotics. Identifying such mutations can help to develop and improve drugs. Here, we systematically screen for candidate quinolone resistance-conferring mutations. We sequenced highly diverse wastewater E. coli and performed a genome-wide association study (GWAS) correlating over 200,000 mutations against quinolone resistance phenotypes. We uncovered 13 statistically significant mutations including one located at the active site of the biofilm dispersal genes bdcA and six silent mutations in the aminoacyl-tRNA synthetase valS. The study also recovered the known mutations in the topoisomerases gyrA and parC. In summary, we demonstrate that GWAS effectively and comprehensively identifies resistance mutations without a priori knowledge of targets and mode of action. The results suggest that bdcA and valS may be novel resistance genes with biofilm dispersal and translation as novel resistance mechanisms. Keywords: E Coli; Quinolone; Antibiotic Resistance; Genome-Wide Association Study (GWAS) 1 Background In the sixties, an impurity during the synthesis of the anti-malarial chloroquine led to the discovery of nalidixic acid [1, 2]. Two years after its introduction to the market, resistances were observed, but it took another ten years before the drug’s target and mecha- nism of action were understood [3]. Subsequently, im- proved derivatives of nalidixic acid were found, such as norfloxacin and ciprofloxacin and then levofloxacin. Today, there are over 20 fluoroquinolones on the mar- ket. Generally, fluoroquinolones act by converting their targets, gyrase (gyrA) and topoisomerase IV (parC), into toxic enzymes that fragment the bacterial chro- mosome [4]. With the wide use of quinolones, however, bacteria developed resistances through several routes such as increased expression of efflux pumps, which transport drugs outside the bacterial cell, or horizontal gene transfer of resistance genes, whose gene products bind to the quinolone targets [4]. However, the most direct route to resistance is mutations in the drug tar- gets gyrA and parC. Specifically, changes in the amino * Correspondence: michael.schroeder@tu-dresden.de 1 Biotechnology Center (BIOTEC), Technische Universität Dresden, Tatzberg 47-49, 01307 Dresden, Germany Full list of author information is available at the end of the article acids Ser83 and Asp87 of gyrA and Ser80 of parC con- fer resistance [4, 5] to quinolones. The discovery of these mutations was driven by a deep understanding of the mechanism of action of quinolones. Already over 50 years ago, Crumplin et al. suggested that “a comparative study of [...] mu- tants and otherwise isogenic bacteria should facilitate identification of the hitherto unknown [...] target” [3], which was at the time not possible on a genome-wide scale. This changed with the advent of deep sequencing technology. Thus, we want to complement the original hypothesis-driven approach to understand resistance [3] with a hypothesis-free, high-throughput approach, in which we systematically evaluate the mutational landscape of resistant and susceptible bacteria. Instead of investigating the quinolone targets in depth for resistance-conferring mutations, we screen entire bacterial genomes of many isolates and corre- late them to patterns of the isolates’ susceptibility and resistance. This approach termed genome-wide associ- ation study, GWAS, rose with the advent of deep se- quencing and was initially applied to human genomes and disease phenotypes [6]. Recently, the success of hu- man GWAS sparked interest in microbial GWAS [7, 8]. Genome-wide associations in bacteria are challenging, as clonal reproduction in bacteria leads to population .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430739doi: bioRxiv preprint mailto:michael.schroeder@tu-dresden.de https://doi.org/10.1101/2021.02.12.430739 http://creativecommons.org/licenses/by-nd/4.0/ Malekian et al. Page 2 of 13 stratification and a non-random association of alleles at different loci (linkage disequilibrium or LD) [8, 9]. E. coli’s population structure is predominantly clonal, allowing the delineation of major phylogenetic groups, the largest being A (40%), B2 (25%), and B1 and D (both 17%) [10]. Therefore, any model of a genome-wide association study in E. coli should ac- commodate these groups. Interestingly, the groups also relate to pathogenicity: Commensal E. coli, as e.g. found in human intestines, are more likely to belong to A and B1 and pathogenic to B2 and D. Generally, E. coli genomes vary in size between 4000 to 5500 genes, of which only half are shared by all E. coli [11]. These genes, which are common to all E. coli, define the core-genome. It can be approximated as the intersection of genes present in a set of genomes. In contrast to the core-genome, the pan-genome is defined as the union of genes in a population. The E. coli pan- genome exceeds 13000 genes and has possibly no limit due to their ability to absorb genetic material [11]. Parallel to the core and pan-genome, we coin the core and pan-variome. The former is defined as the intersec- tion and the latter as the union of all mutations across all genomes. Mutations correlating with resistance will - by definition - not be part of the core-variome. Hence, it is important for a genome-wide association study that there is a significant gap in size between core and pan-variome. A second major challenge besides population strat- ification is the dependencies of loci (linkage disequi- librium). The mutations in gyrA and parC correlate with each other, as they belong to the same resistance mechanism. However, following terminology from can- cer biology, all of them are driver mutations, which cause clonal expansion in contrast to passenger mu- tations, which do not influence the fitness of a clone [12]. Driver mutations may impact clonal expansion di- rectly by changing the amino acid sequence (non- synonymous mutations) and thus protein structure or function. As an example, the gyrA and parC muta- tions are located at the drug’s binding site and there- fore influence binding. Driver mutations may also act indirectly as synonymous mutations without changes to the amino acid sequence. Synonymous mutations may have an effect on splicing, RNA stability, RNA folding, translation, or co-translational protein fold- ing [13]. As an example, Kimchi et al. showed that a synonymous mutation in the multi-drug resistance gene MDR1 altered drug and inhibitor interactions. The authors argue that the reason may be a changed timing of co-translational folding and insertion into the membrane [14]. Thus, a genome-wide association study aiming to uncover novel resistance mechanisms should consider both non-synonymous and synonymous mu- tations, which are independent of already known mech- anisms. To date, it is not fully understood, how antibiotic resistance develops. It is ancient and inherent to bac- teria [15] and can therefore be found in the natural en- vironment. But with the wide use of antibiotics, major sources of resistant bacteria are clinics and wastewater [16]. In particular, the latter plays an important role, since treatment plants act as melting pots for bacteria of human, clinical, animal, and environmental origin [16]. The high genetic diversity of a clinical E. coli population was substantially exceeded by a wastewa- ter population [17], which makes wastewater E. coli a suitable source for a GWAS analysis. In summary, we aim to show that a bacterial genome- wide association study can effectively and compre- hensively identify targets relevant to antibiotic resis- tance. We aim to recover the known mutations in gyrA and parC together with novel candidate mutations. To maximise genomic diversity, we investigate wastewa- ter E. coli. We employ a computational approach and implement variant calling on these genomes and then correlate the identified mutations against resistance levels of four quinolones covering first to third gen- eration (nalidixic acid, norfloxacin, ciprofloxacin, and levofloxacin). We apply stringent filtering and cater for missing and rare data, population effects, and depen- dencies among mutations. Building on gyrA and parC mutations as controls, we expect to characterise the quantity and quality of the mutational resistance land- scape. We will answer the question of whether there are resistance mutations beyond gyrA and parC and whether they may open new avenues for future drug discovery. 2 Methods Sequencing and Phenotyping. Mahfouz et al. col- lected 1178 E. coli isolates from the inflow and outflow of the municipal wastewater treatment plant in Dres- den, Germany. Based on representative resistance phe- notypes, the authors selected 103 isolates for sequenc- ing with Illumina MiSeq, 92 of which are available from NCBI’s assembly database (PRJNA380388 : https:/ /www.ncbi.nlm.nih.gov/assembly/?term=PRJNA38 0388) and the rest by the authors. Phage and virus sequences were removed [17]. The unbiased sampling and selection of represen- tative phenotypes were important for the subsequent GWAS analysis, which requires both resistant and sus- ceptible isolates. The isolates were phenotyped using the agar diffusion method measuring the diameters of inhibition zone for 20 commonly prescribed antibi- otics, including the four quinolones nalidixic acid, nor- floxacin, ciprofloxacin, and levofloxacin [17]. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430739doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430739 http://creativecommons.org/licenses/by-nd/4.0/ Malekian et al. Page 3 of 13 Variant Calling, Quality Control, and Func- tional Annotation. Reads were mapped onto E. coli K12 MG1655 with the Burrow-Wheeler Aligner (BWA) v0.7.12 and sorted with Picard v1.105. Vari- ants were called using the genomic analysis toolkit GATK 4.1.1.0 [18] with E. coli K12 MG1655 as ref- erence. We combined them into a single VCF file and re-genotyped them. Next, we filtered variants following standard protocols [19] and settings according to the GATK 4.1.1.0 website (for SNPs QD < 2.0, QUAL < 30.0, or FS > 60.0 and for INDELs QD < 2.0, QUAL < 30.0, or FS > 200.0). Variants with low genotype qual- ity (GQ < 20) and variants with > 15% of missing data were removed. After normalisation with BCFtools 1.7 [20], rare variants with minor allele frequency (MAF) < 5% were excluded with Pyseer 1.3.0. Finally, vari- ants were functionally annotated using SnpEff 4.3T [21]. Genome-Wide Association Study (GWAS). We performed a GWAS study by Pyseer 1.3.0 [22], using a generalized linear model for each variant. We built a phylogenetic tree from the VCF file with VCF- kit 0.1.6 [23]. Using multidimensional scaling (MDS) on the distances in the phylogenetic tree, four outlier isolates were removed. For the remaining 99 isolates, we drew a scree plot for the eigenvalues of the MDS model and picked four components, which we used as covariates for the regression model to control for pop- ulation structure. Finally, we calculated a Bonferroni- corrected significance threshold for our GWAS analysis with pyseer. Meta-analysis. We visualized GWAS results with quantile-quantile (QQ) and Manhattan plots using the R package qqman. ROC Curve and area under the curve (AUC) were calculated using the matplotlib and scikit-learn Python packages. We calculated the link- age disequilibrium (LD) between the loci of significant variants using PLINK v1.90b6.10 [24]. The R package LDheatmap [25] was used to visualize LD results. We applied and visualized MDS on the phylogenetic dis- tances between the samples using the cmdscale and scatter3d functions from the stats and plot3d R pack- ages, respectively. We drew a heatmap with dendro- gram on the binary matrix of presence/absence of vari- ants for different samples using the heatmap function from the R package stats. 3D structures. The 3D Structure of bdcA was re- trieved from protein databank PDB (4PCV). The 3D structure of valS was retrieved from Swiss-model (based on PDB structure pdbid 1IVS). The 3d struc- tures were visualized using PyMOL 2.2.0. Conservation across other bacterial genomes. We retrieved the multiple sequence alignment ENOG50 1RQ0S for bdcA across all gammaproteobacteria from Eggnog 5.0 [26]. Residue 135 in the ungapped bdcA sequence was shifted to position 207 in the gapped multiple sequence alignment. Conservation across other bacterial genomes. To check the frequency of bdcA G135S in other E. Coli genomes, we downloaded 1340 E. Coli genomes from NCBI (https://www.ncbi.nlm.nih.gov/) (accessed on 27th of October 2020) and identified the locus in each genome by searching for an exact match of the ten nucleotide long sequence ATTCACGGAG, which fol- lows after the locus of the bdcA mutation and which is conserved across all the retrieved genomes. 3 Results We aimed to identify mutations, which correlate with quinolone resistance. After extracting raw variants from 99 wastewater E. coli genomes, we proceeded in two steps: First, we reduced raw to high-quality and then high-quality to highly significant variants. From raw to high-quality variants. From the genomes, we extracted 457,554 raw variants, which we subjected to five quality control steps resulting in 206,633 high-quality variants. Rare variants, which ap- pear in less than 5% of isolates, led to the greatest reduction of mutations of nearly 50% (Table 1). The pan- and core variome. For a genome-wide association study, it is vital that the mutations spread across the isolates. To characterise the distribution and diversity of the high-quality mutations, we computed the core and the pan-variome (see Figure 2). The core- variome reflects the number of variants shared by a given number of genomes. In contrast, the pan-variome consists of the union of all variants, thus reflecting the total diversity of variants present in all genomes. As expected, the pan-variome grows fast and the core- variome tails off fast. For 20 genomes, the pan-variome consists already of some 256,000 variants, while the core variome is reduced to some 600 variants. This means that there are only very few variants that are shared across many or even all of the genomes. Simi- larly, the graph for the pan-variome continually grows. Each added genome contributes new variants until the pan-variome reaches 413,283 variants (206,633 high- quality plus 206,650 rare variants) in total. Overall, the distribution of variants is thus suitable for GWAS. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430739doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430739 http://creativecommons.org/licenses/by-nd/4.0/ Malekian et al. Page 4 of 13 From high-quality to highly significant vari- ants Next, we carried out a GWAS study correlat- ing the high-quality variants against resistance levels of the four quinolones investigated (nalidixic acid, nor- floxacin, ciprofloxacin, and levofloxacin). Two aspects were important: We wanted to control the population structure and ensure the independence of the novel mutations from the known resistance-conferring mu- tations. To assess the control of the study over the popu- lation structure, we plotted p-values expected under randomness against observed p-values (see QQ plots in Figure 3). The plots confirm that the correction for population structure was satisfactory, as a deviation from the null hypothesis (the identity line) is only ev- ident at the tail of the plots. Next, we visualized the results of the GWAS us- ing Manhattan plots, which reveal that there are some highly significant variants passing the rigorous Bonferroni-corrected p-value (the horizontal line). To confirm the level of significance, we evaluated how well these variants predict resistance. To this end, we plot- ted a receiver operating characteristic (ROC) curve and calculated the area under the curve (AUC) as a measure of predictive performance. The AUC for most of the significant variants was above 90% (see Figure 3) reflecting that the identified variants very accurately predict resistance. Summary statistics of the GWAS analysis. In total, we obtained 13 highly significant variants, three in gyrA and parC and ten novel candidate variants in the five genes bdcA, valS, lptG, lptF, and ivy. The variant in bdcA leads to an amino acid change, while the remaining nine do not. Across all four quinolones, the mutations in gyrA and parC ranked highest thus confirming the validity of the approach taken (Table 2). As shown in the table, the frequency and effect sizes of the novel candidate variants are on a par with the positive controls. This means that the existence of an effect (p-value) and the size of the effect (beta) are both given. While all vari- ants pass the Bonferroni-corrected p-value threshold (5.21E-07), the positive controls exceed it very sub- stantially (Table 3). Novel candidate variants are independent of controls. To check the independence of the signifi- cant variants from one another, we measured the link- age disequilibrium (LD) for the loci of these vari- ants (see Figure 4). The known quinolone resistance- conferring variants, gyrA S83L, gyrA D87N, and parC S80I are in LD. They are located at the drugs’ binding sites to gyrA and parC and ensure the correct function of the gene products despite treatment. The known resistance-conferring variants are not in LD with the ten novel loci, which suggests that they confer resistance by a different mechanism from gyrA and parC. Among the novel loci, there are de- pendencies. In particular, the non-synonymous vari- ant in bdcA is in LD with synonymous mutations in valS. This may mean that these novel variants act in a shared mechanism, which raises the question of whether the biological functions of the novel loci can be linked to antibiotic resistance. Biological function of bdcA. The bdcA gene plays a role in biofilm dispersal [27, 28] and gener- ally, biofilm formation increases antimicrobial resis- tance [29, 30]. It could be hypothesised that a variant in this gene disrupts biofilm dispersal and leads to biofilm formation and resistance. However, while this may happen in nature, it is unclear whether this effect is also present in the disk diffusion assay underlying the present data. This gene is present in nearly all isolates (85-90% in our data and NCBI data), which means that is close to being a core gene, but that it is not essential for survival. Biological function of valS. The valS gene prod- uct is an aminoacyl-tRNA synthetase (aaRS), which charges tRNA encoding valine with the valine amino acid. The aaRS enzymes are promising targets for an- timicrobial development [31, 32] as targeting them can inhibit the translation process, cell growth, and finally cell viability. Although aaRS enzymes are not known as direct quinolone targets, there is evidence that non-synonymous mutations in aaRS enzymes increase ciprofloxacin resistance by upregulating the expression of efflux pumps [33]. In our data, we found synony- mous valS mutations for ciprofloxacin just below the p-value cut-off. For levofloxacin and norfloxacin, they were above the cut-off. valS provides a very basic func- tion and is a core gene present in all isolates. Biological function of ivy. The gene product of ivy is a strong inhibitor of lysozyme C. Expression of ivy protects porous cell-wall E. coli mutants from the lytic effect of lysozyme, suggesting that it is a response against the permeabilizing effects of the innate verte- brate immune system. As such, ivy acts as a virulence factor for a number of gram-negative bacteria-infecting vertebrates [34]. Biological function of lptG and lptF. The gene products of lptG and lptF are part of the ABC trans- porter complex LptBFG involved in the translocation of lipopolysaccharide from the inner membrane to the outer membrane. Thus, there is no direct connection .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430739doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430739 http://creativecommons.org/licenses/by-nd/4.0/ Malekian et al. Page 5 of 13 to antibiotic resistance, however, the link to transport is in line with other resistance mechanisms such as in- creased expression of efflux pumps [35]. Structural Analysis of bdcA and valS. To shed more light on the possible causality of the GWAS can- didate variants, we explored their protein structures (Figure 5). The variant Gly135Ser in bdcA is in the vicinity of the active site residues Ser132 and Tyr146 [27]. Serine is bigger than glycine and it may influ- ence a loop formed by the residues 136-144 and thus regulate the active site, which may influence biofilm dispersal. In valS, the identified variants are synonymous and thus have no direct impact on the structure of the protein. However, for some loci, there were non- synonymous variants such as e.g. D452E. Therefore, we wanted to understand, where the valS mutations are located in the 3D structure. Figure 5 shows the structure of a model for valS in E. coli, which is gener- ated by Swiss-model based on a template in Thermus thermophilus. The model reveals that the valS muta- tions are on the surface of the protein. Variant bdcA G135S wrt. other antibiotics, other E. coli, and other bacterial sequences. For the non-synonymous variant bdcA G135S, we wanted to understand whether its role in antibiotic resistance is limited to quinolones or not. For 16 other antibiotics, [17] there are variants, which significantly correlated with resistance (data not shown). For all antibiotics but tobramycin, the bdcA mutation is not significant. This suggests, that bdcA G135S may act independently of fluoroquinolone, which would be con- sistent with biofilm formation being a general mecha- nism independent of fluoroquinolone. Next, we wanted to know whether the prevalence of bdcA G135S in our data is representative of other E. coli genomes. In 1340 complete E. coli genomes available from the NCBI, we could find the bdcA gene in 1209 genomes and bdcA G135S in 24. Thus, about 2% of genomes carry this mutation, which is slightly less, but comparable to the 5% present in our data. BdcA is present in other bacteria. We investigated gammaproteobacteria, which comprise pseudomon- adaceae besides enterobacteria. We analysed 152 bdcA sequences retrieved from Eggnog 5.0 and found ala- nine most frequently (65%) and glycine less frequently (24%). Serine appeared in 2% of the species, which may mean that the resistance mechanism is not lim- ited to E. coli. Phylogenetic groups. A key ingredient of the GWAS model is the population structure. We ap- plied dimension reduction and hierarchical cluster- ing to isolates represented as high-dimensional bi- nary vectors, where each dimension corresponds to one of the 206,633 mutations. We identified four clusters (Figure 6), which broadly correspond to phylogenetic groups A, B1, B2, and D. Thus, our GWAS model correctly caters for the main E. coli lineages. 4 Discussion and Conclusion It took over a decade to move from the discovery of nalidixic acid to the discovery of its target and mech- anism of action. Here, we have shown that sequencing and phenotyping data of a small number of genomes from a single site are sufficient for a GWAS model to reveal the quinolone targets with a very high statis- tical significance. Furthermore, the GWAS model re- vealed ten new mutations, which correlate significantly with quinolone resistance. A key to the success of the GWAS model was an unbiased sampling of isolates, which contained resistant and susceptible isolates. The most promising mutation is G135S in the biofilm dispersal gene bdcA, which is present in nearly all isolates, but which is not essential for E. coli sur- vival [36]. Mapping the bdcA mutation onto a pro- tein structure of bdcA revealed its location on the surface of the protein and close to the active site. Hence, this suggests an impact on enzymatic activity, which may influence biofilm dispersion and hence indi- rectly relate to antibiotic resistance. In fact, Ma et al. could show that E. coli bdcA controls biofilm disper- sal in Pseudomonas aeruginosa [37], which were the most abundant gammaproteobacteria containing bdcA in our analysis. This indicates that mutations in E. coli bdcA may act indirectly on antibiotic resistance. If consequently, bdcA emerges as a novel drug target, then the next steps in drug development could target the active site with residues S132 and Y146, which are in direct proximity to the mutation bdcA G135S. Im- portantly, bdcA G135S is a novel candidate resistance mutation as it is not in LD with the known mutations in gyrA and parC. We found bdcA G135S in 5% of the analysed genomes, which appears in line with a prevalence of 2% in 1209 other E. coli genomes obtained from the NCBI. We also checked the presence of these muta- tions in other gammaproteobacteria and revealed that bdcA is present and well conserved, but that the mu- tation appears specific to E. coli. Furthermore, we also checked whether bdcA G135S correlates with re- sistance to non-quinolone antibiotics. This was the case for tobramycin, an aminoglycoside, but not for all other examined antibiotics. Isolates with the bdcA G135S mutation belonged to the phylogenetic group A, which is less likely to contain pathogenetic isolates. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430739doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430739 http://creativecommons.org/licenses/by-nd/4.0/ Malekian et al. Page 6 of 13 Phylogroup A is equally abundant in human faeces and wastewater [38], which may point to an origin of the mutation in a human rather than a natural envi- ronment. Besides bdcA G135S, we found nine mutations, which are synonymous, whose mechanism of action is likely to be indirect. Most interesting are the abun- dant mutations in the aminoacyl-tRNA synthetase valS, which has an essential role in protein synthesis and which is part of the core-genome and is therefore present in all isolates. Furthermore, it is classified as an essential gene [36]. It may be a suitable drug tar- get [39] due to their evolutionary divergence between prokaryotic and eukaryotic enzymes, high conservation across different bacterial pathogens, as well as solubil- ity, stability, and ease of purification. However, since the mutations in valS were synonymous, they will not exert a direct structural or functional effect on their gene product but may act indirectly. In summary, bdcA G135S and the discovered silent mutations are statistically significant correlating with quinolone resistance in wastewater E. coli. They ap- pear to be mostly specific to E. coli and to quinolones and independent of known resistance-conferring mu- tations. Further research is needed to corroborate the correlation between these mutations and quinolone re- sistance and to shed light on the molecular mechanism leading to resistance .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430739doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430739 http://creativecommons.org/licenses/by-nd/4.0/ Malekian et al. Page 7 of 13 bdcA Mutation(s) valS Mutation(s)GWAS Resistance Phenotyping SequencingE. Coli Wastewater Variant Calling G T G A T C T A C A . . . G T G A T C T A C A . . . Figure 1: Wastewater E. Coli were phenotyped and sequenced. Variants were called and correlated to quinolone resistance in a GWAS study resulting in novel candidate resistance mutations. Table 1: Quality control (QC): Reduction of some 457.000 raw variants to 206.633 high-quality variants. Rare variants (MAF) is the main filter. Step Change Mutations 1. Variant calling 457,554 2. Hard filters -2% 449,017 3. GQ filter and missingness -15% 382,922 4. Normalisation by allele +8% 413,283 5. Minor allele frequency (MAF) -50% 206,633 Number of genomes N u m b er o f v a ri a n ts Number of genomes N u m b er o f v a ri a n ts a) Pan-variome b) Core-variome Figure 2: Pan-variome (union of variants) and core-variome (intersection of variants) of 206,633 high-quality and 206,650 rare variants (413,283 in total). Most variants appear only in a few of the isolates. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430739doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430739 http://creativecommons.org/licenses/by-nd/4.0/ Malekian et al. Page 8 of 13 Expected -log10 (p-value) O b se rv ed -l o g 1 0 (p -v a lu e) Position O b se rv ed -l o g 1 0 (p -v a lu e) valS R733 False positive rate T ru e p o si ti v e ra te gyrA D87N, parC S80I gyrA S83L bdcA G135S, lptG V106, lptF Q197, other valS mutations valS N815 valS E874 a) Levofloxacin Expected -log10 (p-value) O b se rv ed -l o g 1 0 (p -v a lu e) Position O b se rv ed -l o g 1 0 (p -v a lu e) False positive rate T ru e p o si ti v e ra te gyrA D87N, parC S80I gyrA S83L valS R733 bdcA G135S, lptG V106, lptF Q197, ivy T123, other valS mutations valS N815 valS E874 b) Norfloxacin Expected -log10 (p-value) O b se rv ed -l o g 1 0 (p -v a lu e) Position O b se rv ed -l o g 1 0 (p -v a lu e) gyrA D87N, parC S80I gyrA S83L False positive rate T ru e p o si ti v e ra te c) Ciprofloxacin Expected -log10 (p-value) O b se rv ed -l o g 1 0 (p -v a lu e) Position O b se rv ed -l o g 1 0 (p -v a lu e) gyrA S83L False positive rate T ru e p o si ti v e ra te d) Nalidixic acid Figure 3: GWAS analysis. Left: QQ plots of observed vs. expected p-values show a few highly significant p- values. Middle: Manhattan plots of chromosomal position vs. p-value show mutations passing the Bonferroni- corrected threshold as dots above the red line. Right: Area under the ROC curves show that the significant mutations predict resistance well (Most AUC > 90%). .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430739doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430739 http://creativecommons.org/licenses/by-nd/4.0/ Malekian et al. Page 9 of 13 Table 2: Mutations significantly correlating with quinolone resistance. Freq. is the relative frequency among isolates and Beta the effect size. Effect size is similar for all, p-values differ. Quinolone Position Allele Gene Effect Freq. Beta SE Call rate P-value AUC Levofloxacin 3165735 A parC S80I 0.08 -1.56 0.20 100% 2.43E-12 97% 2339162 T gyrA D87N 0.08 -1.56 0.20 100% 2.43E-12 97% 2339173 A gyrA S83L 0.15 -1.20 0.16 99% 4.47E-12 91% 4473651 T bdcA G135S 0.05 -1.58 0.29 90% 1.35E-07 91% 4481639 A valS R733 0.07 -1.15 0.24 100% 4.09E-09 93% 4481393 A valS N815 0.12 -1.11 0.20 100% 6.79E-08 84% 4481216 T valS E874 0.16 -1.61 0.29 100% 7.09E-08 59% 4482482 A valS D452 0.05 -1.58 0.29 100% 1.35E-07 91% 4482443 A valS V465 0.05 -1.58 0.29 100% 1.35E-07 91% 4482440 T ValS L466 0.05 -1.58 0.29 100% 1.35E-07 91% 4486808 A lptF Q197 0.05 -1.58 0.29 100% 1.35E-07 91% 4487635 A lptG V106 0.05 -1.58 0.29 100% 1.35E-07 91% Norfloxacin 3165735 A parC S80I 0.08 -2.29 0.22 100% 1.10E-18 98% 2339162 T gyrA D87N 0.08 -2.29 0.22 100% 1.10E-18 98% 2339173 A gyrA S83L 0.15 -1.59 0.19 99% 9.25E-14 93% 4473651 T bdcA G135S 0.05 -2.01 0.36 90% 7.56E-08 91% 4481639 A valS R733 0.07 -1.85 0.30 100% 5.24E-09 92% 4481216 T valS E874 0.16 -2.03 0.35 100% 4.36E-08 55% 4481393 A valS N815 0.12 -1.39 0.25 100% 5.40E-08 82% 4482482 A valS D452 0.05 -2.01 0.36 100% 7.56E-08 91% 4482443 A valS V465 0.05 -2.01 0.36 100% 7.56E-08 91% 4482440 T valS L466 0.05 -2.01 0.36 100% 7.56E-08 91% 4486808 A lptF Q197 0.05 -2.01 0.36 100% 7.56E-08 91% 4487635 A lptG V106 0.05 -2.01 0.36 100% 7.56E-08 91% 240711 T ivy T123 0.05 -2.00 0.36 100% 1.04E-07 91% Ciprofloxacin 3165735 A parC S80I 0.08 -1.90 0.25 100% 7.37E-12 97% 2339162 T gyrA D87N 0.08 -1.90 0.25 100% 7.37E-12 97% 2339173 A gyrA S83L 0.15 -1.22 0.22 99% 7.13E-08 87% Nalidixic acid 2339173 A gyrA S83L 0.15 -1.57 0.24 99% 1.32E-09 90% Table 3: Ranking of mutations significantly correlating with quinolone resistance. Levofloxacin Norfloxacin Ciprofloxacin Nalidixic acid Position Allele Gene Effect Rank/P-value Rank/P-value Rank/P-value Rank/P-value 3165735 A parC S80I 1 / 2.43E-12 1 / 1.10E-18 1 / 7.37E-12 7 / 1.82E-05 2339162 T gyrA D87N 1 / 2.43E-12 1 / 1.10E-18 1 / 7.37E-12 7 / 1.82E-05 2339173 A gyrA S83L 3 / 4.47E-12 3 / 9.25E-14 3 / 7.13E-08 1 / 1.32E-09 4473651 T bdcA G135S 7 / 1.35E-07 7 / 7.56E-08 10 / 1.00E-05 152 / 6.44E-04 4481639 A valS R733 4 / 4.09E-09 4 / 5.24E-09 4 / 8.50E-07 182 / 8.00E-04 4481393 A valS N815 5 / 6.79E-08 6 / 5.40E-08 6 / 3.90E-06 405 / 2.14E-03 4481216 T valS E874 6 / 7.09E-08 5 / 4.36E-08 8 / 5.75E-06 162 / 6.67E-04 4482482 A valS D452 7 / 1.35E-07 7 / 7.56E-08 10 / 1.00E-05 152 / 6.44E-04 4482443 A valS V465 7 / 1.35E-07 7 / 7.56E-08 10 / 1.00E-05 152 / 6.44E-04 4482440 T valS L466 7 / 1.35E-07 7 / 7.56E-08 10 / 1.00E-05 152 / 6.44E-04 4486808 A lptF Q197 7 / 1.35E-07 7 / 7.56E-08 10 / 1.00E-05 152 / 6.44E-04 4487635 A lptG V106 7 / 1.35E-07 7 / 7.56E-08 10 / 1.00E-05 152 / 6.44E-04 240711 T ivy T123 68 / 4.69E-05 13 / 1.04E-07 9 / 9.07E-06 395 / 2.08E-03 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430739doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430739 http://creativecommons.org/licenses/by-nd/4.0/ Malekian et al. Page 10 of 13 2 3 3 9 1 6 2 (g y rA D 8 7 N ) 2 3 3 9 1 7 3 (g y rA S 8 3 L ) 3 1 6 5 7 3 5 (p a rC S 8 0 I) 4 4 7 3 6 5 1 (b d cA G 1 3 5 S ) 4 4 8 2 4 8 2 (v a lS D 4 5 2 ) 4 4 8 6 8 0 8 (l p tF Q 1 9 7 ) 4 4 8 7 6 3 5 (l p tG V 1 0 6 ) 4 4 8 1 2 1 6 (v a lS E 8 7 4 ) 4 4 8 1 3 9 3 (v a lS N 8 1 5 ) 4 4 8 1 6 3 9 (v a lS R 7 3 3 ) 4 4 8 2 4 4 0 (v a lS L 4 6 6 ) 4 4 8 2 4 4 3 (v a lS V 4 6 5 ) 2 4 0 7 1 1 (i v y T 1 2 3 ) Figure 4: Linkage disequilibrium. High values (red) indicate a dependence of the loci. As expected, the loci in gyrA and parC are in linkage disequilibrium. Importantly, they are not in LD with the remaining novel candidate loci. Interestingly, there is some dependence within the novel loci, in particular, bdcA is in LD with valS. a) bdcA b) valS Figure 5: 3D structures of bdcA and valS. Significant mutations (red) are at the surface and bdcA G135S is near the active site (green). .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430739doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430739 http://creativecommons.org/licenses/by-nd/4.0/ Malekian et al. Page 11 of 13 P h y lo g ro u p s A B1 B2 D a) MDS Plot A B1 B2 D P h y lo g ro u p s None b) Hierarchical Clustering Figure 6: a) Dimension reduction of isolates represented as high-dimensional vectors of all mutations. Four clusters are found, which reflect the population structure in the GWAS model and which broadly coincide with phylogroups A, B1, B2, and D. b) Same as a) but hierarchical clustering. Here, the presence of a mutation is shown by black and its absence by gray. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430739doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430739 http://creativecommons.org/licenses/by-nd/4.0/ Malekian et al. Page 12 of 13 Competing interests The authors declare that they have no competing in- terests. Author’s contributions NM,TB, MS conceived the idea, TB contributed data, NM, AA, MS analysed data, NM, MS wrote the article. Acknowledgements We would like to thank Norhan Mahfouz, Eric Achatz, and Serena Caucci for an initial analysis of the data and valuable input and Magali De La Cruz Barron, Uli Klümper, Amay Ajaykumar Agrawal, Aldo Acevedo, Claudio Duran, and Mahmood Nazari for feedback. Funding of the ACRAS-R project is kindly acknowl- edged. Author details 1 Biotechnology Center (BIOTEC), Technische Universität Dresden, Tatzberg 47-49, 01307 Dresden, Germany. 2 Institute of Hydrobiology, Technische Universität Dresden, Germany,. References 1. Emmerson, A., Jones, A.: The quinolones: decades of development and use. Journal of Antimicrobial Chemotherapy 51(suppl 1), 13–20 (2003) 2. Bisacchi, G.S.: Origins of the quinolone class of antibacterials: an expanded “discovery story” miniperspective. Journal of medicinal chemistry 58(12), 4874–4882 (2015) 3. Crumplin, G., Smith, J.: Nalidixic acid and bacterial chromosome replication. Nature 260(5552), 643–645 (1976) 4. Aldred, K.J., Kerns, R.J., Osheroff, N.: Mechanism of quinolone action and resistance. Biochemistry 53(10), 1565–1574 (2014) 5. Conrad, S., Saunders, J.R., Oethinger, M., Kaifel, K., Klotz, G., Marre, R., Kern, W.: gyra mutations in high-level fluoroquinolone-resistant clinical isolates of escherichia coli. Journal of Antimicrobial Chemotherapy 38(3), 443–456 (1996) 6. Hirschhorn, J.N., Daly, M.J.: Genome-wide association studies for common diseases and complex traits. Nature reviews genetics 6(2), 95 (2005) 7. Power, R.A., Parkhill, J., de Oliveira, T.: Microbial genome-wide association studies: lessons from human gwas. Nature reviews genetics 18(1), 41 (2017) 8. Chen, P.E., Shapiro, B.J.: The advent of genome-wide association studies for bacteria. Current opinion in microbiology 25, 17–24 (2015) 9. Lees, J.A., Vehkala, M., Välimäki, N., Harris, S.R., Chewapreecha, C., Croucher, N.J., Marttinen, P., Davies, M.R., Steer, A.C., Tong, S.Y., et al.: Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. Nature communications 7, 12797 (2016) 10. Tenaillon, O., Skurnik, D., Picard, B., Denamur, E.: The population genetics of commensal escherichia coli. Nature Reviews Microbiology 8(3), 207–217 (2010) 11. Rasko, D.A., Rosovitz, M., Myers, G.S., Mongodin, E.F., Fricke, W.F., Gajer, P., Crabtree, J., Sebaihia, M., Thomson, N.R., Chaudhuri, R., et al.: The pangenome structure of escherichia coli: comparative genomic analysis of e. coli commensal and pathogenic isolates. Journal of bacteriology 190(20), 6881–6893 (2008) 12. Greenman, C., Stephens, P., Smith, R., Dalgliesh, G.L., Hunter, C., Bignell, G., Davies, H., Teague, J., Butler, A., Stevens, C., et al.: Patterns of somatic mutation in human cancer genomes. Nature 446(7132), 153–158 (2007) 13. Sharma, Y., Miladi, M., Dukare, S., Boulay, K., Caudron-Herger, M., Groß, M., Backofen, R., Diederichs, S.: A pan-cancer analysis of synonymous mutations. Nature communications 10(1), 1–14 (2019) 14. Kimchi-Sarfaty, C., Oh, J.M., Kim, I.-W., Sauna, Z.E., Calcagno, A.M., Ambudkar, S.V., Gottesman, M.M.: A” silent” polymorphism in the mdr1 gene changes substrate specificity. Science 315(5811), 525–528 (2007) 15. D’Costa, V.M., King, C.E., Kalan, L., Morar, M., Sung, W.W., Schwarz, C., Froese, D., Zazula, G., Calmels, F., Debruyne, R., et al.: Antibiotic resistance is ancient. Nature 477(7365), 457–461 (2011) 16. Berendonk, T.U., Manaia, C.M., Merlin, C., Fatta-Kassinos, D., Cytryn, E., Walsh, F., Bürgmann, H., Sørum, H., Norström, M., Pons, M.-N., et al.: Tackling antibiotic resistance: the environmental framework. Nature Reviews Microbiology 13(5), 310–317 (2015) 17. Mahfouz, N., Caucci, S., Achatz, E., Semmler, T., Guenther, S., Berendonk, T.U., Schroeder, M.: High genomic diversity of multi-drug resistant wastewater escherichia coli. Scientific reports 8(1), 8928 (2018) 18. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., et al.: The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. Genome research 20(9), 1297–1303 (2010) 19. Van der Auwera, G.A., Carneiro, M.O., Hartl, C., Poplin, R., Del Angel, G., Levy-Moonshine, A., Jordan, T., Shakir, K., Roazen, D., Thibault, J., et al.: From fastq data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Current protocols in bioinformatics 43(1), 11–10 (2013) 20. Narasimhan, V., Danecek, P., Scally, A., Xue, Y., Tyler-Smith, C., Durbin, R.: Bcftools/roh: a hidden markov model approach for detecting autozygosity from next-generation sequencing data. Bioinformatics 32(11), 1749–1751 (2016) 21. Cingolani, P., Platts, A., Wang, L.L., Coon, M., Nguyen, T., Wang, L., Land, S.J., Lu, X., Ruden, D.M.: A program for annotating and predicting the effects of single nucleotide polymorphisms, snpeff: Snps in the genome of drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6(2), 80–92 (2012) 22. Lees, J.A., Galardini, M., Bentley, S.D., Weiser, J.N., Corander, J.: pyseer: a comprehensive tool for microbial pangenome-wide association studies. Bioinformatics 34(24), 4310–4312 (2018) 23. Cook, D.E., Andersen, E.C.: Vcf-kit: assorted utilities for the variant call format. Bioinformatics 33(10), 1581–1582 (2017) 24. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., De Bakker, P.I., Daly, M.J., et al.: Plink: a tool set for whole-genome association and population-based linkage analyses. The American journal of human genetics 81(3), 559–575 (2007) 25. Shin, J.-H., Blay, S., McNeney, B., Graham, J., et al.: Ldheatmap: an r function for graphical display of pairwise linkage disequilibria between single nucleotide polymorphisms. Journal of Statistical Software 16(3), 1–10 (2006) 26. Huerta-Cepas, J., Szklarczyk, D., Heller, D., Hernández-Plaza, A., Forslund, S.K., Cook, H., Mende, D.R., Letunic, I., Rattei, T., Jensen, L.J., et al.: eggnog 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic acids research 47(D1), 309–314 (2019) 27. Lord, D.M., Baran, A.U., Wood, T.K., Peti, W., Page, R.: Bdca, a protein important for escherichia coli biofilm dispersal, is a short-chain dehydrogenase/reductase that binds specifically to nadph. PloS one 9(9), 105751 (2014) 28. Ma, Q., Yang, Z., Pu, M., Peti, W., Wood, T.K.: Engineering a novel c-di-gmp-binding protein for biofilm dispersal. Environmental microbiology 13(3), 631–642 (2011) 29. Evans, D., Allison, D., Brown, M., Gilbert, P.: Susceptibility of pseudomonas aeruginosa and escherichia coli biofilms towards ciprofloxacin: effect of specific growth rate. Journal of Antimicrobial Chemotherapy 27(2), 177–184 (1991) 30. Høiby, N., Bjarnsholt, T., Givskov, M., Molin, S., Ciofu, O.: Antibiotic resistance of bacterial biofilms. International journal of antimicrobial agents 35(4), 322–332 (2010) 31. Manickam, Y., Chaturvedi, R., Babbar, P., Malhotra, N., Jain, V., Sharma, A.: Drug targeting of one or more aminoacyl-trna synthetase in the malaria parasite plasmodium falciparum. Drug discovery today 23(6), 1233–1240 (2018) 32. Agarwal, V., Nair, S.K.: Aminoacyl trna synthetases as targets for antibiotic development. MedChemComm 3(8), 887–898 (2012) 33. Garoff, L., Huseby, D.L., Praski Alzrigat, L., Hughes, D.: Effect of aminoacyl-trna synthetase mutations on susceptibility to ciprofloxacin .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430739doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430739 http://creativecommons.org/licenses/by-nd/4.0/ Malekian et al. Page 13 of 13 in escherichia coli. Journal of Antimicrobial Chemotherapy 73(12), 3285–3292 (2018) 34. Abergel, C., Monchois, V., Byrne, D., Chenivesse, S., Lembo, F., Lazzaroni, J.-C., Claverie, J.-M.: Structure and evolution of the ivy protein family, unexpected lysozyme inhibitors in gram-negative bacteria. Proceedings of the National Academy of Sciences 104(15), 6394–6399 (2007) 35. Ruiz, N., Gronenberg, L.S., Kahne, D., Silhavy, T.J.: Identification of two inner-membrane proteins required for the transport of lipopolysaccharide to the outer membrane of escherichia coli. Proceedings of the National Academy of Sciences 105(14), 5537–5542 (2008) 36. Luo, H., Lin, Y., Liu, T., Lai, F.-L., Zhang, C.-T., Gao, F., Zhang, R.: Deg 15, an update of the database of essential genes that includes built-in analysis tools. Nucleic Acids Research (2020) 37. Ma, Q., Zhang, G., Wood, T.K.: Escherichia coli bdca controls biofilm dispersal in pseudomonas aeruginosa and rhizobium meliloti. BMC Research Notes 4(1), 447 (2011) 38. Stoppe, N.d.C., Silva, J.S., Carlos, C., Sato, M.I., Saraiva, A.M., Ottoboni, L.M., Torres, T.T.: Worldwide phylogenetic group patterns of escherichia coli from commensal human and wastewater treatment plant isolates. Frontiers in microbiology 8, 2512 (2017) 39. Hurdle, J.G., O’Neill, A.J., Chopra, I.: Prospects for aminoacyl-trna synthetase inhibitors as new antimicrobial agents. Antimicrobial agents and chemotherapy 49(12), 4821–4833 (2005) .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430739doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430739 http://creativecommons.org/licenses/by-nd/4.0/ Abstract Background Methods Results Discussion and Conclusion 10_1101-2021_02_12_430764 ---- Triku: a feature selection method based on nearest neighbors for single-cell data Ascensión et al. SOFTWARE Triku: a feature selection method based on nearest neighbors for single-cell data Alex M. Ascensión1,2†, Olga Ibañez-Solé1,2†, Inaki Inza3, Ander Izeta2 and Marcos J. Araúzo-Bravo1,4* *Correspondence: mararabra@yahoo.co.uk 1Biodonostia Health Research Institute, Computational Biology and Systems Biomedicine Group, Paseo Dr. Begiristain, s/n, 20014, Donostia-San Sebastian, Spain Full list of author information is available at the end of the article †Equal contributor Abstract Feature selection is a relevant step in the analysis of single-cell RNA sequencing datasets. Triku is a feature selection method that favours genes defining the main cell populations. It does so by selecting genes expressed by groups of cells that are close in the nearest neighbor graph. Triku efficiently recovers cell populations present in artificial and biological benchmarking datasets, based on mutual information and silhouette coefficient measurements. Additionally, gene sets selected by triku are more likely to be related to relevant Gene Ontology terms, and contain fewer ribosomal and mitochondrial genes. Triku is available at https://gitlab.com/alexmascension/triku. Keywords: scRNAseq; feature selection; bioinformatics; python 1 Background Single-cell RNA sequencing (scRNA-seq) is a powerful technology to study the bi- ological heterogeneity of tissues at the individual cell level, allowing the characteri- zation of new cell populations and cell states–i.e. cell types responding to different environmental stimuli– previously undetected due to their low frequency within the tissue and the lack of individual resolution of bulk methods [1, 2]. scRNA-seq datasets are multidimensional, i.e. the expression profile per cell con- sists of multiple genes. Two common characteristics of multidimensional datasets is their high dimensionality and their sparsity, which are worsened in single-cell datasets due the high proportion of zeros from low signal recovery [3]. This spar- sity affects downstream methods such as cell type detection or differential gene expression [4]. A common task when working with multidimensional datasets is feature selection (FS). FS, alongside with feature extraction (FE), responds to the need of obtaining a reduced dataset with a smaller dimensionality [5]. While FE methods like Principal Component Analysis (PCA) extract new features based on combinations of the original features, FS methods aim to select a subset of the features that best explains the original dataset. There are three main types of FS methods: filter, wrapper and embedded methods [5]. Current FS methods in scRNA-seq analysis are filter methods because common downstream analysis steps do not embed the FS within the pipeline [6]. FS methods represent a key step in processing pipelines of bioinformatic datasets and provide several advantages [5]: they reduce model overfitting risk, improve clustering qual- ity, and favour a deeper insight into the underlying processes that generated the data (features –genes– that contain random noise do not contribute to the biology of .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint mailto:mararabra@yahoo.co.uk https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 2 of 18 the dataset and are removed). Specifically, in scRNA-seq, removing non-informative features can improve results in downstream analyses such as differential gene ex- pression. Early methods for FS in scRNA-seq data were based on the idea that genes whose expression show a greater dispersion across the dataset are the ones that best capture the biological structure of the dataset. Conversely, genes that are evenly expressed across cells are unlikely to define cell types or cell functions in a heterogeneous dataset. The most straightforward way of selecting genes that are not evenly expressed is to look at a measure of dispersion of the counts of each gene and to select those genes that have a dispersion over a threshold. However, the correlation between mean expression and dispersion introduces a bias whereby genes with higher expression are more likely to be selected by FS methods. However, biological gene markers that define minor cell types are usually expressed in a medium to small subset of cells. Therefore, new FS methods based on dispersion are designed to correct for this dispersion/expression correlation to select genes with a broader expression spectrum. Brennecke et al. [7] developed a FS method that introduces a correction over the dispersion that accounts for differences in the mean expression of genes. It does so by setting a threshold to the correlation between the average gene expression and its coefficient of variation across cells. Newer FS methods have arisen after different corrections, like the one originally described by Satija et al. [8] implemented in Seurat, later adapted to scanpy [9], or the one implemented in scry [10]. A new generation of FS methods emerged when Svensson discovered that the proportion of zeros in droplet-based scRNA-seq data, originally assumed to be dropouts, was tightly related to the mean expression of genes, following a nega- tive binomial (NB) curve [11]. Genes with an expected lower percentage of zeros tend to have an even expression across the entire set of cells. Conversely, genes with a higher than expected percentage of zeros might possess biological relevance because they are expressed in fewer cells than expected, and these cells might be associated to a specific cell type or state. This finding opened the path for new FS methods that would rely on genes that showed a greater than expected proportion of zeros, according to their mean ex- pression. These methods are based on a null distribution of some property of the dataset, and genes whose behavior differs from the expected are selected. The FS method nbumi, a negative binomial method based on M3Drop [12], works under this premise. Nbumi fits the NB zero-count probability distribution to the dataset, and selects genes of interest calculating p-values of observed dropout rates. M3Drop works similarly by fitting a Michaelis-Menten model instead of the NB from nbumi. In summary, existing FS methods assume that an unexpected distribution of counts for a particular gene in a dataset is explained by cells belonging to different cell types. However, we observe that there are three main patterns of expression according to the distribution of zeros of a particular gene and overall transcriptional similarity (expression of all genes), as explained in detail in Figure 1: a) a gene evenly expressed across cells, or a gene expressed by a subset of cells, which can be b1) transcriptionally separate or b2) transcriptionally similar. Thus, in some cases a particular gene shows an unexpected distribution of counts because a subset of cells are expressing it but those cells might not be transcriptionally similar. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 3 of 18 Here we present triku, a FS method that selects genes that show an unexpected distribution of zero counts and whose expression is localized in cells that are tran- scriptomically similar. Figure 2 summarizes the feature selection process. Triku identifies genes that are locally overexpressed in groups of neighboring cells by inferring the distribution of counts in the vicinity of a cell and computing the ex- pected distribution of counts. Then, the Wasserstein distance between the observed and the expected distributions is computed and genes are ranked according to that distance. Higher distances imply that the gene is locally expressed in a subset of transcriptionally similar cells. Finally, a subset of relevant features is selected using a cutoff value for the distance. Triku outperforms other feature selection methods on benchmarking and artificial datasets, using unbiased evaluation metrics such as Normalized Mutual Information (NMI) or Silhouette. Of note, features selected by triku are more biologically meaningful. 2 Results The objective of FS methods is to select the features that are the most relevant in order to understand and explain the structure of the dataset. In the context of single-cell data, this means finding the subset of genes that, when given as input to a clustering method, will yield a clustering solution where each cluster can be annotated as a putative cell type. Initially, we generated artificial datasets with the splatter package [13], so that cells belonging to the same cluster have a similar gene expression. All datasets contained the same number of genes, cells and populations, but differed in the de.prob parameter value. This parameter was set so that higher values indicate a higher probability of genes being differentially expressed, resulting in more resolved populations. A combination of 8 de.prob values, from 0.0065 to 0.3 were used (see Methods). In addition, we tested triku on two biological benchmarking datasets by Ding et al. [14] and Mereu et al. [15] that have been expert-labeled using a semi-supervised procedure. Both benchmarking datasets are composed of individual subsets of data with different library preparation methods (10X, SMART-seq2, etc.) in human Peripheral Blood Mononuclear Cells (PBMCs) (Mereu and Ding) and mouse colon (Mereu) and cortex (Ding) cells. We have evaluated the relevance of the features selected by triku by comparing them to the ones selected using other feature selection methods. The relevance of the features was first measured using metrics associated to the efficacy of clustering, and then using metrics to evaluate the quality of the genes selected. We made six types of comparisons between the subsets of genes selected by each feature selection method: 1) the ability to recover basic dataset structure (main cell types) in artificial and biological datasets, 2) the ability to obtain transcriptomically distinct cell clusters, 3) the overlap of features between different FS methods, 4) the localized pattern of expression of the features selected, 5) the ability to avoid the overrepresentation of mitochondrial and ribosomal genes and 6) the biological relevance of the genes by studying the composition and quality of the gene ontology (GO) terms obtained. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 4 of 18 2.1 triku efficiently recovers cell populations present in sc-RNAseq datasets The first set of metrics evaluates the ability to recover the original cell types based on the NMI index, and the cluster separation and cohesion using the Silhouette coefficient. 2.1.1 NMI NMI measures the correspondence between a labelling considered as the ground truth and the clustering solution that we obtained using the genes selected by triku and other FS methods (scanpy, std, scry, brennecke, m3drop, nbumi). First, we evaluated how well the clustering using the genes selected by the FS methods was able to recover the same populations that were defined when gen- erating the artificial datasets. Figure 3 shows that triku is among the best three feature selection methods for a wide range of de.prob values. For low values of de.prob –below 0.05–, where the selection of genes that lead to a correct recovery of cell populations is more challenging, triku notably outperforms the rest of the FS methods. NMI values obtained with triku are 0.1 to 0.2 higher than the second and third best FS methods. In addition, the results obtained when using the first 250 selected genes were comparable to those obtained when selecting 500 genes. We also studied how well the genes selected led to a clustering solution that was similar to the manually-assigned cell labels in the biological benchmarking datasets, as shown in Figure 4. For each dataset, the variability between NMI scores was quite low, meaning that features selected with the different methods yielded clustering so- lutions that were quite similar to the manually-labeled cell types, although there are some exceptions to this rule–e.g. Brennecke in Ding datasets, which showed notably reduced NMI values–. In some datasets, for instance, 10X human, QUARTZseq hu- man and SMARTseq2 human from Mereu’s benchmarking set, features selected by FS methods did not lead to increased NMI values as compared with randomly selected genes. Despite the differences in NMI between methods being small for each particular dataset, post-hoc analysis revealed that triku is significantly the best ranked method across all datasets. To do the post-hoc analysis, we ranked for each dataset the NMI of each FS method. Figure 4 (left) shows the mean rank of each FS method across datasets. Triku is the best-ranked FS method in both Mereu’s and Ding’s benchmarking datasets, with a mean rank of 2.7 and 2.8, respectively. M3drop is the second best-ranked FS method and triku is in both cases statistically significantly better (Quade test, p < 0.05). 2.1.2 Silhouette coefficient Another important aspect of the genes selected by FS methods in scRNA-seq data analysis is their ability to cluster data into well-separated groups that are transcrip- tomically similar. We used the Silhouette coefficient to measure the compactness and separation-degree of cell communities obtained with a clustering method. When the same clustering algorithm is used on a dataset but using two different FS meth- ods, the differences in the resulting Silhouette coefficients can be entirely attributed to the features selected by those methods. We assume that FS methods that increase the separation between clusters and the compactness within clusters are better at recovering the cell types present in the dataset. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 5 of 18 Figure 5 shows the Silhouette coefficients obtained with the different FS meth- ods. For the Mereu and Ding datasets, we observed that triku was the best-ranked method–mean rank of 1.8 and 1.1–, and the second best-ranked methods were m3drop and scanpy with a mean rank of 3.8 and 2.2, respectively. In both cases, the difference between triku and the second-ranked method was statistically significant (Quade test, p < 0.05). We performed an additional analysis using the labels obtained with leiden clus- tering instead of the manually curated cell types (Figure S1). Again, triku outper- formed the rest of the FS methods showing a statistically significant best mean-rank. 2.2 Genes selected by different FS methods show limited overlap Next, we studied the characteristics of the genes selected by triku and compared them to the genes selected by other methods. Initially, we studied the level of consistency between the results obtained using different FS methods by studying their degree of overlap, as shown in Figure 6. In order to compare between equally sized gene lists, we ranked the genes based on p-values or scoring value from each FS method and set the number of genes selected by triku as a cutoff to select the first genes. Although the genes selected by the different methods yielded clustering solutions that are highly consistent, as shown in the previous section, we did not see any clear gene overlap pattern between pairs of FS methods. In fact, there is no correlation between the degree of overlap between the genes selected by the different methods and the clustering solutions that are obtained when using those genes as input. For instance, we found an overlap of 11% between the genes selected by scanpy and std for the 10x mouse dataset, yet the NMI between the clustering solutions obtained with each of them and the expert-labeled cell types was 0.7. On the other hand, the overlap between scanpy and Brennecke is one of the highest across datasets (ranging from 26 to 67%), yet the differences between their corresponding NMI scores are 0.45. 2.3 triku selects genes that are biologically relevant Based on these results, we studied the biological relevance of the genes selected by different FS methods in three alternative ways. Genes whose expression, or lack thereof, is limited to a single population are more likely to be cell-type specific and thus might be better candidates as positive or negative cell population markers. Therefore, we studied which are the best FS methods to select genes showing a localized expression pattern. Mitochondrial and ribosomal genes are usually highly expressed and many FS methods tend to overselect them despite them not being particularly relevant in most single-cell studies and are commonly excluded from downstream analysis [16, 17, 18]. Assuming that in these benchmarking datasets ribosomal and mitochondrial genes are not as relevant to the biology of the dataset, we measured the percentage of these genes in the subset of genes selected by triku and compared it to other FS methods. Lastly, we analyzed the biological pertinence of the selected genes by performing Gene Ontology Enrichment Analysis (GOEA) on a dataset of immune cell popu- lations whose underlying biology is well understood, as a robust indicator of FS quality. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 6 of 18 2.3.1 Selection of locally-expressed genes We first studied the expression pattern of genes selected by triku and other methods, as shown in Figure S2. We observed that out of the 9 populations of the artificial dataset, when a gene is selected by triku–exclusively or together with other FS methods–, one of the populations had a markedly higher or lower expression com- pared to the rest. On the other hand, when a gene is selected by other FS methods and not by triku, we do not observe any population-specific expression pattern. For instance, genes exclusively selected by scanpy had a wide expression variation across clusters, but they were not exclusive of one or two clusters. Features selected by std and scry showed some variation, but it was overshadowed by the high expression of the gene, and therefore not relevant under the previous premise. To evaluate the cluster expression of selected genes in benchmarking datasets, for each gene we scaled its expression to the 0-1 range, and sorted the clusters so that the first one had the greatest expression. Figure S3 shows the expression patterns for several benchmarking datasets. We see that, in most datasets, triku showed more biased expression patterns, that is, genes selected by triku were expressed, on average, on fewer clusters than the genes selected by other FS methods. The second and third best methods were scanpy and brennecke, with similar or slightly less biased expression patterns as compared to triku. With these methods, up to 80% of the expression of the gene was usually restricted to the 2 to 3 clusters that most express it. m3drop and nbumi performed similarly, and showed an expression distribution across clusters similar to a random selection of genes, which was slightly biased towards 3 to 5 clusters accumulating up to 80 % of the expression of the gene. Lastly, std and scry methods were the least biased, and showed almost a linear decrease of expression percentage across clusters, with 4 to 6 clusters accumulating up to 80 % of the expression of the gene. 2.3.2 Avoidance of mitochondrial and ribosomal genes Table 1 shows the percentage of genes that code for ribosomal and mitochondrial proteins within the genes selected by different FS methods in the two sets of bench- marking datasets. We observed that std and scry were the only methods that tended to overselect mitochondrial and ribosomal genes. Among the rest of the methods, triku showed percentages that were comparable to the rest of the methods, and slightly lower for the Ding datasets. 2.3.3 Selection of genes based on gene ontologies We assessed the quality of the GO output by studying its term composition. We se- lected two PBMC datasets from the Ding datasets: the 10X human and the Dropseq human. We used PBMC datasets for this analysis because their cell-to-cell variabil- ity has been extensively studied using single-cell technologies as Fluorescence Acti- vated Cell Sorting (FACS) and scRNA-seq [19, 20, 21, 22, 23]. Using these datasets, we measured the proportion of GO terms obtained in the output that were tightly related to the biological system under study. Figures 7 and S4 show the first 25 GO terms obtained with the genes selected by each FS method on the two PBMC datasets, where the terms tightly related to .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 7 of 18 immune processes–chosen by three independent assessors–have been highlighted. We observed that triku was the FS method that yielded the most terms directly related to immune processes, with 23/25 and 19/25 related terms in the Ding Dropseq and 10X datasets, respectively. Examples of terms that we considered to be tightly related to immune processes included B cell receptor signalling pathway, neutrophil degranulation and T cell proliferation. The next methods were scanpy and m3drop, whose performances were comparable to that of triku for the 10X dataset (23/25) but less robust for the Dropseq dataset (10/25 and 9/25 related terms). The rest of the FS methods mainly selected genes that were related to general cell functions such as RNA processing, protein processing and cell-cycle regulation. 3 Discussion FS methods are a key step in any scRNA-seq sequencing analysis pipeline as they help us obtain a dimensionally reduced version of the dataset that captures the most relevant information and eases the interpretation and understanding of its under- lying biology. However, every FS method relies on a set of assumptions regarding what characteristics make a gene relevant. FS methods that sort genes according to their dispersion assume that gene expression variability is indicative of its biological relevance. FS methods like nbumi and m3drop assume that genes showing a propor- tion of zero-counts that is greater than expected (according to a null distribution) are more likely to be informative. Triku assumes that genes that have a localized expression in a subset of cells that share an overall transcriptomic similarity are more likely to define cell types. A general trend in FS method design has been to refine the requirements that a gene must meet in order for it to be selected, from the more general dispersion-based to more sophisticated formulations. It is noteworthy that the requirements in triku are consistent with the previous dispersion-based and zero-count-based formulations, but involve a new aspect that we consider essential for an accurate gene selection: a localized expression in neighboring cells. Another important advantage of triku over FS methods that consider the zero-count dis- tribution is that, unlike m3drop and nbumi, triku does not assume gene counts to follow any particular distribution, since it estimates the null distribution from the dataset, thus extending the range of single-cell technologies that it can use beyond droplet-based technologies. We verified the locality of the genes selected by triku in different artificial and real scRNA-seq datasests and concluded that, on average, the expression of triku- selected genes is restricted to fewer, well-defined clusters. In addition, the clusters obtained when using triku-selected genes as input for unsupervised clustering in both artificially generated and biological datasets have a better resolved pattern structure, as shown by their increased Silhouette coefficients. In the case of artificial datasets, where the degree of mixture between clusters can be predefined, triku proved to be able to recover the originally-defined cell populations. In fact, we found that the higher the degree of mixture between clusters, the more obvious the advantage of triku over the rest of the FS methods tested. An important difficulty in the interpretation of single-cell data is that we must consider that cell-to-cell variability has both technical and biological components. I.e., it is difficult to know whether a set of genes is differentially expressed between .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 8 of 18 cell clusters due to technical reasons (differences in the efficiency of mRNA cap- ture, amplification and sequencing) or if it constitutes a biological signal. Moreover, there is a wide range of sources of biological variability within a dataset, some of which might not be of interest depending on the experimental context. For instance, fluctuations in genes that regulate the cell cycle constitute a source of biological vari- ability that is often disregarded. This has been extensively studied and addressed in a number of ways: normalization, regression of unwanted sources of variation, etc. [24, 25, 26, 27]. The expression of genes whose variability is associated with technical reasons tend to have a high dispersion but their expression is usually not restricted to a few clusters. A good example of these genes are the ribosomal and mitochondrial genes, which are expressed across all cell types at different levels. Our results show that these genes are in fact selected by the majority of compared FS methods due to their high expression and cell-to-cell variability, but are less likely to be selected by triku, since they do not usually meet the locality requirement. Additionally, when performing GOEA, we observed that the list of genes obtained with triku were more enriched for terms that are specifically related to a biological process of the system under study. In our work, we have observed that the genes selected by different FS methods might show little overlap between them. This phenomenon has been described else- where [28]. In fact, gene covariation and redundancy is a well characterized effect that has been observed in omics studies. The effect of redundancy arises from the fact that different cell types must have a common large set of pathways to be ac- tive. The difference between cell type and cell state is that two cell types might have large sets of pathways that are different between each other, and two cell states will only differ in a few pathways. Since pathways are composed of many genes, only choosing a reduced set of genes from a set of pathways from cell type A and B might be enough to differentiate them, and we might not need to select all genes from all pathways. This “paradigm” explains several effects. Qiu et al. described that scRNA-seq datasets could preserve basic structure after gene expression bina- rization [29] or by conducting very shallow sequencing experiments [4]. This can be explained by the fact that only a few genes are necessary to describe the main cell populations in a single-cell dataset, and the presence/absence of a certain marker is often more informative than its expression level. This is related to the notion that despite the high dimensionality of omics studies, most biological systems can be ex- plained in a reduced number of dimensions. Moreover, some authors have claimed this low dimensionality to be a natural and fundamental property of gene expres- sion data [4]. This highlights the importance of designing accurate FS methods that extract the fundamental information from single-cell datasets. Triku Python package is available at https://gitlab.com/alexmascension/ triku and can be downloaded using PyPI. Triku has been designed to be com- patible with scanpy syntax, so that scanpy users can easily include triku into their pipelines. 4 Methods The triku workflow is further described in Suplementary Methods. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint https://gitlab.com/alexmascension/triku https://gitlab.com/alexmascension/triku https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 9 of 18 4.1 Artificial and benchmarking datasets In order to perform the evaluation of the FS methods we used a set of artificial and biological benchmarking datasets. Artificial datasets were constructed using splatter R package (v 1.10.1). Each dataset contains 10,000 cells and 15,000 genes, and consists of 9 populations with abundances in the dataset of {25%, 20%, 15%, 10%, 10%, 7%, 5.5%, 4%, 3.5%} of the cells. Each dataset contains a parameter, de.prob, that controls the probability that a gene is differentially expressed. Lower de.prob values (< 0.05) imply that different populations have fewer differentially expressed genes between them and, therefore, are more difficult to be differentiated. Selected values of de.prob are {0.0065, 0.008, 0.01, 0.016, 0.025, 0.05, 0.1, 0.3}. Populations in datasets with de.prob values above 0.05 are completely separated in the low-dimensionality representation with UMAP, even without feature selection (Figure S5). Regarding biological datasets, two benchmarking datasets have been recently pub- lished by Mereu et al. [15] and Ding et al. [14]. The aim of these two works is to analyze the diversity of library preparation methods, e.g. 10X, SMART-seq2, CEL-seq2, single nucleus or inDrop. Mereu et al. use mouse colon cells and human PBMCs to perform the benchmarking, whereas Ding et al. use mouse cortex and human PBMCs. There are a total of 14 datasets in Mereu et al. and 9 in Ding et al. An additional characteristic of these datasets is that they have been manually annotated, and this annotation is useful as a semi ground truth. Ding dataset files were downloaded from Single Cell Portal (accession numbers SCP424 and SCP425), and cell type metadata is located within the downloaded files. Mereu datasets were downloaded from GEO database (accession GSE133549), and cell type metadata was obtained under personal request. 4.2 FS methods Triku is compared to the following FS methods: • Standard deviation (std). Computed directly using Numpy (v 1.18.3). • brennecke [7]: fits a curve based on the square of the coefficient of variation (CV 2) versus the mean expression (µ) of each gene and selects the features with higher CV 2 and µ. The features are selected with the BrenneckeGetVari- ableGenes function from M3Drop R package (v 1.12.0). • scry [10]: computes a deviance statistic for counts based on a multinomial model that assumes each feature has a constant rate. The features are selected with the devianceFeatureSelection function from scry R package (v 0.99.0). • scanpy [9]: selects features based on a z-scored deviation, adapted from Seu- rat’s method. The features are selected with the sc.pp.highly_variable_genes function from scanpy (v 1.6.0). • M3Drop [12]: fits a Michaelis-Menten equation to the percentage of zeros ver- sus µ, and selects features with higher percentages of zeros than expected. The features are selected with the M3DropFeatureSelection function from M3Drop R package. • nbumi: it acts in the same manner as M3Drop, but fitting a negative binomial equation instead of a Michaelis-Menten equation. The features are selected with the NBumiFeatureSelectionCombinedDrop function. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 10 of 18 4.2.1 FS and dataset preprocessing To make the comparison between FS methods, each feature is ranked based on the score provided by each FS method. Calculating the ranking instead of just selecting the features allow us to select different numbers of features when needed. By default, the number of features is the one automatically selected by triku. Additionally, in some contexts, analyses are performed with all features or with a random selection of features. After the ranking of genes is computed, dataset processing is performed equally for all methods, in artificial and benchmarking datasets. Datasets are first log trans- formed –if required by the method–, and PCA with 30 components is calculated. Then, the k-Nearest Neighbors (kNN) matrix is computed setting k as √ ncells. Uni- form Manifold Approximation and Projection (UMAP) (v 0.3.10) is then applied to reduce the dimensionality for plotting. If community detection is required, leiden (v 0.7.0) is applied selecting the resolution that matches the number of cell types manually annotated in the dataset. This procedure is repeated with 10 different seeds. This conditions the output of triku, random FS, PCA projection, neighbor graph, leiden community detection, and UMAP. 4.2.2 NMI calculation in artificial and benchmarking datasets In order to compare the leiden community detection results with the ground-truth labels from artificial and biological datasets, we used the Normalized Mutual Infor- mation (NMI) score [30]. If T and L are the labels of the cell types (true populations) and leiden commu- nities respectively, the NMI between T and L is: NMI(T, L) = 2I(T ; L) H(T) + H(L) Where H(X) is the entropy of the labels, and I(T ; L) is the mutual information between the two sets of labels. This value is further described in [31]. We used scikit- learn (v 0.23.1) implementation of NMI, sklearn.metrics.adjusted_mutual_info_score. One of the advantages of NMI against other mutual information methods is that it performs better with label sets with class imbalance, which are common in single- cell datasets, where there are differences in the abundance of cell types. On artificial datasets, leiden was applied using the first 250 and 500 selected features, and the resulting community labels were compared with the population labels from the dataset. On benchmarking datasets, leiden was applied with the manually-curated cell types. 4.2.3 Silhouette coefficient in benchmarking datasets In order to assess the clustering performance of the communities obtained with benchmarking datasets we used the Silhouette coefficient. The Silhouette coefficient compares the distances of the cells within each cluster (intra-cluster) and between clusters (inter-cluster) within a measurable space. The distance between two cells is the cosine distance between their gene expression vectors, considering only the genes selected by each FS method. The greater the distance between cells that belong to .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 11 of 18 different clusters and the smaller the distance between cells from different cluster, the greater the Silhouette score. In order to calculate the Silhouette coefficient for a cell c within cluster Ci (out of n clusters), the mean distance between the cell and the rest of the cells within the cluster is computed using the gene expression: a(c) = 1 |Ci| − 1 ∑ j∈Ci,c̸=j d(c, j) Then, the minimum mean distance between that cell and the rest of cells from other clusters is computed: b(c) = min Ck ̸=Ci { 1 Ck ∑ j∈Ck d(i, j) } k ∈ 1, · · · , n Then the Silhouette coefficient is computed as s(c) = b(c) − a(c) max b(c), a(c) Higher Silhouette scores imply a better separation between clusters and, therefore, a better performance of the FS method. We used scikit-learn implementation of Silhouette, sklearn.metrics.silhouette_score. 4.2.4 Overlap between gene lists In order to calculate the overlap between selected features for each FS method, we applied the Jaccard index [32]: jaccard(i, j) = |i∩j||i∪j| , where i, j are the sets of genes selected by the two FS methods. 4.2.5 Performance of gene selection and locality measures In order to assess the performance of different FS methods selecting genes that are relevant for the dataset, we applied two different strategies for artificial and biological datasets. For artificial datasets, we selected 4 representative genes of each of the combi- nations of genes shown in Figure S2. Then we calculated the mean expression of each of the for genes in each population, and we represent this information in the barplots. For benchmarking datasets, in order to represent the Figure S3, for each dataset and FS method we used the following procedure: for each gene, the expression was scaled to sum 1 across all cells. Then, leiden clustering was run with resolution pa- rameter value 1.2. For each cluster, the proportion of the expression was calculated, and the clusters were ordered so that the first cluster is the one that concentrates the majority of the expression. To create Figure S3, the average value of the proportion of expression is calculated. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 12 of 18 4.2.6 Proportion of ribosomal and mitochondrial genes When calculating the proportion of mitochondrial and ribosomal genes, the list of existing ribosomal and mitochondrial proteins was calculated by extracting the genes starting with RPS, RPL or MT-. The proportion of mitochondrial or riboso- mal genes is the quotient between the genes of the previous list that appear selected by that FS method, and the genes in the list. 4.2.7 GO enrichment analysis In order to calculate the sets of gene ontologies enriched for the selected features of each FS method, we used python gseapy (v 0.9.17) module gseapy.enrichr function with the list of the first 1000 selected features against the GO_Biological_Process_2018 ontology. From the list of enriched ontologies, the 25 with the smallest adjusted p-value were selected. 4.2.8 Ranking and CD During calculation of NMI and Silhouette coefficients, to evaluate the overall per- formance of the FS methods across different datasets, the FS methods are ranked –where 1 is the best rank–. The methodology proposed by Demšar [33] is used to test for significant differences among FS methods in the datasets: The Fried- man rank test is applied to test whether the mean rank values for all FS methods are similar (null hypothesis). If the Friedman rank test rejects the null hypothesis (α < 0.05), this implies a statistically significant difference among at least two FS methods. If the null hypothesis is refuted we apply the Quade post-hoc test be- tween all pairs of FS methods to check which pairs of FS methods are significantly different (α < 0.05). These results are then plotted in a critical difference diagram. 5 Abbreviations Single-cell RNA sequencing: scRNA-seq; Feature Selection: FS; Feature Extraction: FE, Principal Component Analysis: PCA, Negative Binomial: NB, Normalized Mu- tual Information (NMI); Fluorescence Activated Cell Sorting: FACS; Gene Ontol- ogy: GO; Gene Ontology Enrichment Analysis: GOEA; Peripheral Blood Mononu- clear Cells: PBMC; Uniform Manifold Approximation and Projection: UMAP; k- Nearest Neighbors: kNN. Declarations Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Availability of data and software Ding dataset files were downloaded from Single Cell Portal (accession numbers SCP424 and SCP425), and cell type metadata is located within the downloaded files. Mereu datasets were downloaded from GEO database (accession GSE133549), and cell type metadata was obtained under personal request. Triku software and analysis notebooks are available at https://www.gitlab.com/alexmascension/triku. Competing interests The authors declare that they have no competing interests. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint https://www.gitlab.com/alexmascension/triku https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 13 of 18 Funding This work was supported by grants from Instituto de Salud Carlos III (AC17/00012 and PI19/01621), cofunded by the European Union (European Regional Development Fund/ European Science Foundation, Investing in your future) and the 4D-HEALING project (ERA-Net program EracoSysMed, JTC-2 2017); Diputación Foral de Gipuzkoa, and the Department of Economic Development and Infrastructures of the Basque Government (KK-2019/00006, KK-2019/00093). AMA was supported by a Basque Government Postgraduate Diploma fellowship (PRE_2020_2_0081), and OIS was supported by a Postgraduate Diploma fellowship from la Caixa Foundation (identification document 100010434; code LCF/BQ/IN18/11660065). Author’s contributions Conceptualization: AMA; Funding Acquisition: MJA-B, AMA, OI-S; Investigation: AMA, OI-S, MJA-B, AI; Methodology: AMA, OI-S, II; Project Administration: AI, MJA-B; Resources: MJA-B; Software: AMA, OI-S; Supervision: II, AI, MJA-B; Visualization: AMA, OI-S; Writing - Original Draft Preparation: AMA, OI-S; Writing - Review and Editing: AMA, OI-S, II, MJA-B, AI. Acknowledgements We would like to thank Amaia Elícegui, Ainhoa Irastorza and Paula Vázquez for the assessment of the immune Gene Ontology terms. Author details 1Biodonostia Health Research Institute, Computational Biology and Systems Biomedicine Group, Paseo Dr. Begiristain, s/n, 20014, Donostia-San Sebastian, Spain. 2Tissue Engineering Group, Biodonostia Health Research Institute, Paseo Dr. Begiristain, s/n, 20014, Donostia-San Sebastian, Spain. 3Intelligent Systems Group, Computer Science Faculty, University of the Basque Country, Donostia-San Sebastian, Spain. 4Max Planck Institute for Molecular Biomedicine, Roentgenstr. 20, 48149, Muenster, Germany. References 1. Trapnell, C.: Defining cell types and states with single-cell genomics. Genome Research 25(10), 1491–1498 (2015). doi:10.1101/gr.190595.115 2. Maclean, A.L., Hong, T., Nie, Q.: Exploring intermediate cell states through the lens of single cells. Current Opinion in Systems Biology 9, 32–41 (2018). doi:10.1016/j.coisb.2018.02.009 3. Bzdok, D., Altman, N., Krzywinski, M.: Statistics versus machine learning. Nature Methods 15(4), 233–234 (2018). doi:10.1038/nmeth.4642 4. Heimberg, G., Bhatnagar, R., El-Samad, H., Thomson, M.: Low dimensionality in gene expression data enables the accurate extraction of transcriptional programs from shallow sequencing. Cell Systems 2(4) (2016). doi:10.1016/j.cels.2016.04.001 5. Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007). doi:10.1093/bioinformatics/btm344. https://academic.oup.com/bioinformatics/article-pdf/23/19/2507/452322/btm344.pdf 6. Luecken, M.D., Theis, F.J.: Current best practices in single‐cell rna‐seq analysis: a tutorial. Molecular Systems Biology 15(6) (2019). doi:10.15252/msb.20188746 7. Brennecke, P., Anders, S., Kim, J.K., Kołodziejczyk, A.A., Zhang, X., Proserpio, V., Baying, B., Benes, V., Teichmann, S.A., Marioni, J.C., et al.: Accounting for technical noise in single-cell rna-seq experiments. Nature Methods 10(11), 1093–1095 (2013). doi:10.1038/nmeth.2645 8. Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck, W.M., Hao, Y., Stoeckius, M., Smibert, P., Satija, R., et al.: Comprehensive integration of single-cell data. Cell 177(7) (2019). doi:10.1016/j.cell.2019.05.031 9. Wolf, F.A., Angerer, P., Theis, F.J.: Scanpy: large-scale single-cell gene expression data analysis. Genome Biology 19(1) (2018). doi:10.1186/s13059-017-1382-0 10. Townes, F.W., Hicks, S.C., Aryee, M.J., Irizarry, R.A.: Feature selection and dimension reduction for single-cell rna-seq based on a multinomial model. Genome Biology 20(1) (2019). doi:10.1186/s13059-019-1861-6 11. Svensson, V.: Droplet scrna-seq is not zero-inflated. Nature Biotechnology 38(2), 147–150 (2020). doi:10.1038/s41587-019-0379-5 12. Andrews, T.S., Hemberg, M.: M3drop: dropout-based feature selection for scrnaseq. Bioinformatics 35(16), 2865–2867 (2018). doi:10.1093/bioinformatics/bty1044 13. Zappi, L., Phipson, B., Oshlack, A.: Splatter: simulation of single-cell rna sequencing data. Genome Biology 18(174) (2017). doi:10.1186/s13059-017-1305-0 14. Ding, J., Adiconis, X., Simmons, S.K., Kowalczyk, M.S., Hession, C.C., Marjanovic, N.D., Hughes, T.K., Wadsworth, M.H., Burks, T., Nguyen, L.T., et al.: Systematic comparison of single-cell and single-nucleus rna-sequencing methods. Nature Biotechnology (2020). doi:10.1038/s41587-020-0465-8 15. Mereu, E., Lafzi, A., Moutinho, C., Ziegenhain, C., Mccarthy, D.J., Álvarez-Varela, A., Batlle, E., Sagar, Grün, D., Lau, J.K., et al.: Benchmarking single-cell rna-sequencing protocols for cell atlas projects. Nature Biotechnology (2020). doi:10.1038/s41587-020-0469-4 16. Freytag, S., Tian, L., Lönnstedt, I., Ng, M., Bahlo, M.: Comparison of clustering tools in r for medium-sized 10x genomics single-cell rna-sequencing data. F1000 Research 7(1297) (2018). doi:10.12688/f1000research.15809.2 17. Lun, A.T.L., Mccarthy, D.J., Marioni, J.C.: A step-by-step workflow for low-level analysis of single-cell rna-seq data with bioconductor. F1000 Research 5(2122) (2016). doi:10.12688/f1000research.9501.2 18. Senabouth, A., Lukowski, S.W., Hernandez, J.A., Andersen, S.B., Mei, X., Nguyen, Q.H., Powell, J.E.: ascend: R package for analysis of single-cell rna-seq data. Gigascience 8(8) (2019). doi:10.1093/gigascience/giz087 19. Chen, J., Cheung, F., Shi, R., Zhou, H., Lu, W.: Pbmc fixation and processing for chromium single-cell rna sequencing. Journal of Translational Medicine 16(1) (2018). doi:10.1186/s12967-018-1578-4 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint http://dx.doi.org/10.1101/gr.190595.115 http://dx.doi.org/10.1016/j.coisb.2018.02.009 http://dx.doi.org/10.1038/nmeth.4642 http://dx.doi.org/10.1016/j.cels.2016.04.001 http://dx.doi.org/10.1093/bioinformatics/btm344 http://arxiv.org/abs/https://academic.oup.com/bioinformatics/article-pdf/23/19/2507/452322/btm344.pdf http://dx.doi.org/10.15252/msb.20188746 http://dx.doi.org/10.1038/nmeth.2645 http://dx.doi.org/10.1016/j.cell.2019.05.031 http://dx.doi.org/10.1186/s13059-017-1382-0 http://dx.doi.org/10.1186/s13059-019-1861-6 http://dx.doi.org/10.1038/s41587-019-0379-5 http://dx.doi.org/10.1093/bioinformatics/bty1044 http://dx.doi.org/10.1186/s13059-017-1305-0 http://dx.doi.org/10.1038/s41587-020-0465-8 http://dx.doi.org/10.1038/s41587-020-0469-4 http://dx.doi.org/10.12688/f1000research.15809.2 http://dx.doi.org/10.12688/f1000research.9501.2 http://dx.doi.org/10.1093/gigascience/giz087 http://dx.doi.org/10.1186/s12967-018-1578-4 https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 14 of 18 20. Massoni-Badosa, R., Iacono, G., Moutinho, C., Kulis, M., Palau, N., Marchese, D., Rodríguez-Ubreva, J., Ballestar, E., Rodriguez-Esteban, G., Marsal, S., et al.: Sampling time-dependent artifacts in single-cell genomics studies. Genome Biology 21(1) (2020). doi:10.1186/s13059-020-02032-0 21. Villani, A.-C., Satija, R., Reynolds, G., Sarkizova, S., Shekhar, K., Fletcher, J., Griesbeck, M., Butler, A., Zheng, S., Lazo, S., et al.: Single-cell rna-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 356(6335) (2017). doi:10.1126/science.aah4573 22. Zheng, G.X.Y., Terry, J.M., Belgrader, P., Ryvkin, P., Bent, Z.W., Wilson, R., Ziraldo, S.B., Wheeler, T.D., Mcdermott, G.P., Zhu, J., et al.: Massively parallel digital transcriptional profiling of single cells. Nature Communications 8(1) (2017). doi:10.1038/ncomms14049 23. Zhu, L., Yang, P., Zhao, Y., Zhuang, Z., Wang, Z., Song, R., Zhang, J., Liu, C., Gao, Q., Xu, Q., et al.: Single-cell sequencing of peripheral mononuclear cells reveals distinct immune response landscapes of covid-19 and influenza patients. Immunity 53(3) (2020). doi:10.1016/j.immuni.2020.07.009 24. Hafemeister, C., Satija, R.: Normalization and variance stabilization of single-cell rna-seq data using regularized negative binomial regression. Genome Biology 20(1) (2019). doi:10.1186/s13059-019-1874-1 25. Lytal, N., Ran, D., An, L.: Normalization methods on single-cell rna-seq data: An empirical survey. Frontiers in Genetics 11 (2020). doi:10.3389/fgene.2020.00041 26. Nestorowa, S., Hamey, F.K., Sala, B.P., Diamanti, E., Shepherd, M., Laurenti, E., Wilson, N.K., Kent, D.G., Göttgens, B.: A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128(8) (2016). doi:10.1182/blood-2016-05-716480 27. Tran, H.T.N., Ang, K.S., Chevrier, M., Zhang, X., Lee, N.Y.S., Goh, M., Chen, J.: A benchmark of batch-effect correction methods for single-cell rna sequencing data. Genome Biology 21(1) (2020). doi:10.1186/s13059-019-1850-9 28. Yip, S.H., Sham, P.C., Wang, J.: Evaluation of tools for highly variable gene discovery from single-cell RNA-seq data. Briefings in Bioinformatics 20(4), 1583–1589 (2018). doi:10.1093/bib/bby011. https://academic.oup.com/bib/article-pdf/20/4/1583/30119512/bby011.pdf 29. Qiu, P.: Embracing the dropouts in single-cell rna-seq analysis. Nature Communications 11 (2020). doi:10.1038/s41467-020-14976-9 30. Kvalseth, T.O.: On normalized mutual information: Measure derivations and properties. Entropy 19(11) (2017). doi:10.3390/e19110631 31. Liu, X., Cheng, H.-M., Zhang, Z.-Y.: Evaluation of Community Detection Methods (2019). 1807.01130 32. Jaccard, P.: The distribution of the flora in the alpine zone. The New Phytologist 11(2) (1912). doi:10.1111/j.1469-8137.1912.tb05611.x 33. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006) Tables Table 1 Percentage of ribosomal protein (RBP) and mitochondrial (MT) genes appearing within the selected genes by each FS method. Mereu Ding % RBP % MT % RBP % MT triku 1.94 0.06 0.07 0.00 m3drop 2.50 0.46 0.62 0.18 nbumi 1.45 0.11 0.44 0.12 scanpy 1.46 0.11 0.36 0.09 std 8.30 0.97 3.57 0.52 scry 6.34 1.14 1.99 0.46 brennecke 0.54 0.06 0.03 0.00 Figures .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint http://dx.doi.org/10.1186/s13059-020-02032-0 http://dx.doi.org/10.1126/science.aah4573 http://dx.doi.org/10.1038/ncomms14049 http://dx.doi.org/10.1016/j.immuni.2020.07.009 http://dx.doi.org/10.1186/s13059-019-1874-1 http://dx.doi.org/10.3389/fgene.2020.00041 http://dx.doi.org/10.1182/blood-2016-05-716480 http://dx.doi.org/10.1186/s13059-019-1850-9 http://dx.doi.org/10.1093/bib/bby011 http://arxiv.org/abs/https://academic.oup.com/bib/article-pdf/20/4/1583/30119512/bby011.pdf http://dx.doi.org/10.1038/s41467-020-14976-9 http://dx.doi.org/10.3390/e19110631 http://arxiv.org/abs/1807.01130 http://dx.doi.org/10.1111/j.1469-8137.1912.tb05611.x https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 15 of 18 Ex pr es sio n � �� �� � �� �� log10 mean expression p er ce n ta g e o f z er o s Figure 1 Distribution of gene expression in three scenarios. There are three main patterns of expression for any particular gene in a single-cell dataset: a) The gene is expressed evenly across cells in the dataset, which probably means it does not define any particular cell type. b) A gene shows an unexpected distribution of zeros, because it is only expressed by a subset of cells. Within case b, there are two possible patterns. b1) The gene is highly expressed by a subset of transciptionally different cells (i.e. cells that are not collocalized in the dimensionally reduced map) and b2) the gene is highly expressed by cells that share an overall transcriptomic profile. Triku preferentially selects the genes shown in the b2 pattern. When looking at the proportion of zeros, genes in cases b1 and b2 show an increased proportion of zeros with respect to a, but they are indistinguishable from each other by that metric. � � � �� � � � � ����������� ��������� �� ������� ��������� � �� ������� ��� �� �� � �� �������� ��� �� �� ��������� ��� �������������������������� ��� �� �� ����� �� � �� �� �� � � �� ���������� ������������ ���� �� � �� �� � �� ���������� �� �� � � �� �� � �� ���������� �� ���������� �� �� ����������������������������� ���� �� � ������������ ���� �� � ��� ���������� �� ��� ��������� � ­����� ������������� ���� ������ � � �� �� � �� Figure 2 Graphical abstract of triku workflow. a) DR representation of the gene expression from the count matrix from a dataset, where each dot represents a cell. b) kNN graph representation with 3 neighbors. For each cell the k transcriptomically most similar cells are selected (3 in this example). c1) Considering the graph in b) for each cell with positive expression, the expression of its k neighbors is summed to yield the kNN distribution in blue. c2) With the distribution of reads (blue line), the null distribution is estimated by sampling k random cells. d) The null and kNN distributions of each gene are compared using the Wasserstein distance. e) For each gene, its distance is plotted against the log mean expression, and divided into w windows (4 in this example). For each window, the median of the distances is calculated and subtracted to the distances in that window. f) All corrected distances are ranked and the cutoff point is selected. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 16 of 18 Figure 3 Comparison of NMI for FS methods on artificial datasets. Barplots of the NMI for all FS methods with different artificial datasets, using the top 250 (top) and 500 (bottom) features of each FS method. The probability of the selected genes being differentially expressed between clusters (de.prob) is shown in the X axis. Higher NMI values mean better recovery of the cell populations. Note that in category all, all features are selected, not the top 250 or 500, therefore their NMI values are the same in both graphs. Figure 4 NMI for annotated cell types in Mereu and Ding datasets. Barplots of Silhouette coefficient for Mereu (top) and Ding (bottom) datasets. Each barplot represents the mean over 5 runs, and the vertical bar is the standard deviation. The plot on the left is a critical difference diagram, where each horizontal bar represents the mean rank for all datasets. If two or more bars are linked by a vertical bar, the mean ranks for those FS methods are not significantly different (Quade test, α = 0.05). .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 17 of 18 Figure 5 Silhouette coefficients for annotated cell types in Mereu and Ding datasets. Barplots of Silhouette coefficient for Mereu (top) and Ding (bottom) datasets. Each barplot represents the mean of 5 seeds, and the vertical bar is the standard deviation. The plot on the left is a critical difference diagram, where each horizontal bar represents the mean rank for all datasets and all seeds. If two or more bars are linked by a vertical bar, the mean ranks for those FS methods are not significantly different (Quade test, α = 0.05). Figure 6 Heatmaps of overlap of features between pairs of methods. For each pair of methods, the value represents the proportion of features that are shared between the two methods. The number of genes selected in each method is the automatic cutoff by triku. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Ascensión et al. Page 18 of 18 triku scanpy std scry brennecke m3drop nbumi Figure 7 Barplot of p-values of GOEA. Each bin represents the number of features selected for each method, in Mereu et al. mouse Dropseq dataset. The y value is the -log10 adjusted p-value for the best 25 ontologies. On the bottom, the bar plot shows the names of the ontology terms for the case with the best 1000 features. In immune datasets, gray dots at the left of each term represent that that term is directly-related to an immune process. Non-dotted terms refer to more general processes that may or may not be related to immune processes. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430764 http://creativecommons.org/licenses/by/4.0/ Abstract Background Results triku efficiently recovers cell populations present in sc-RNAseq datasets NMI Silhouette coefficient Genes selected by different FS methods show limited overlap triku selects genes that are biologically relevant Selection of locally-expressed genes Avoidance of mitochondrial and ribosomal genes Selection of genes based on gene ontologies Discussion Methods Artificial and benchmarking datasets FS methods FS and dataset preprocessing NMI calculation in artificial and benchmarking datasets Silhouette coefficient in benchmarking datasets Overlap between gene lists Performance of gene selection and locality measures Proportion of ribosomal and mitochondrial genes GO enrichment analysis Ranking and CD Abbreviations 10_1101-2021_02_12_430830 ---- Simultaneous estimation of per cell division mutation rate and turnover rate from bulk tumor sequence data Simultaneous estimation of per cell division mutation rate and turnover rate from bulk tumour sequence data Gergely Tibély1,2, Dominik Schrempf2, Imre Derényi2,3, Gergely J. Szöllősi1,2,4 1MTA-ELTE “Lendület” Evolutionary Genomics Research Group, Pázmány P. stny. 1A, H-1117 Budapest, Hungary 2Department of Biological Physics, Eötvös University, Pázmány P. stny. 1A, H-1117 Budapest, Hungary 3MTA-ELTE Statistical and Biological Physics Research Group, Pázmány P. stny. 1A, H-1117 Budapest, Hungary 4Institute of Evolution, Centre for Ecological Research, Konkoly-Thege M. út 29-33. H-1121 Budapest, Hungary February 12, 2021 Abstract Tumors often harbor orders of magnitude more mutations than heal thy tis- sues. The increased number of mutations may be due to an elevated mutation rate or frequent cell death and correspondingly rapid cell turnover leading to an increased number of cell divisions and more mutations, or some combina- tion of both these mechanisms. It is difficult to disentangle the two based on widely available bulk sequencing data where mutations from individual cells are intermixed. As a result, the cell linage tree of the tumor cannot be resolved. Here we present a method that can simultaneously estimate the cell turnover rate and the rate of mutations from bulk sequencing data by averaging over ensembles of cell lineage trees parameterized by cell turnover rate. Our method works by simulating tumor growth and matching the observed data to these simulations by choosing the best fitting set of parameters according to an ex- plicit likelihood-based model. Applying it to a real tumor sample, we find that both the mutation rate and the intensity of death is high. Author Summary Tumors frequently harbor an elevated number of mutations, compared to healthy tissue. These extra mutations may be generated either by an in- creased mutation rate or the presence of cell death resulting in increased cellular turn over and additional cell divisions for tumor growth. Sepa- rating the effects of these two factors is a nontrivial problem. Here we present a method which can simultaneously estimate cell turnover rate and genomic mutation rate from bulk sequencing data. Our method is based on the maximum likelihood estimation of the parameters of a gen- erative model of tumor growth and mutations. Applying our method to a 1 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ human hepatocellular carcinoma sample reveals an elevated per cell divi- sion mutation rate and high cell turnover. 1 Introduction Cancer is an evolutionary phenomenon within a host organism that unfolds on the timescale of years or more. New mutations can appear with each cell di- vision, while cells can also die for reasons such as lack of nutrients or immune reactions. Due to the limitations of bulk sequencing, which only essays muta- tion frequencies for a population of cells from each tumor sample and does not resolve individual cells’ genotype, basic evolutionary parameters. In particular, the cell turnover rate and per cell division mutation rate remain unknown, with estimated values spanning several orders of magnitudes [1]. While tumors can contain a large number of mutations, it is not clear whether this is due to an elevated mutation rate or frequent cell death, as frequent cell death results in more cell divisions, which, in turn, gives rise to more mutations. There are arguments for both cases [2, 3, 4, 5, 6], but distinguishing between these two alternatives is difficult becasue we cannot resolve the tumor’s cell lineage tree from bulk sequencing data. In previous work [2, 1], an elevated number of mutations was observed, but only the combined effect of the mutation rate and the death rate could be estimated. Williams et al. [7] targeted the problem of separating these two quantities by separately sequencing in bulk multiple samples from the same tumor thus resolving a coarse grained cell linage tree. However, it is not clear whether this approach resolves the cell lineage tree in sufficient detail to identify the regime of frequent cell death when the number of mutations is orders of magnitude larger than under growth without cell death. Here, we describe a method to simultaneously estimate the per cell division mutation rate and the turnover rate (the ratio of death and birth rates) of a tumor from bulk sequencing data. The estimation is based on a maximum like- lihood fit of the parameters of a birth-death model to the measured mutant and wild-type read counts. While requiring only a single tumor-normal sample pair, the fitting procedure can differentiate between death rates, which are extremely close to the critical value where the birth rate equals the death rate, resulting in accurate estimation of the mutation rate across orders of magnitudes. The rest of the paper is structured as follows. After introducing our model and the fitting procedure in Sec. Methods, we assess model accuracy on simulated data in Sec. Results on synthetic data. Results on empirical data are described in Sec. Results on empirical data, and conclusions are given in Sec. Discussion. 2 Methods Model We describe the evolution of tumor cells with the cell linage tree, i.e. the bi- furcating tree traced out by cell divisions. As cells that have died cannot be observed by sequencing we consider the tree spanned by surviving cells. The leaves of this tree correspond to extant cells and internal nodes to observed cell 2 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ divisions. To model the descendance of the extant cells, we employ the con- ditioned birth-death process with birth rate α and death rate β and a fixed number of cells n sampled [8]. We measure branch lengths in numbers of ex- pected birth events (the product of birth rate and time). Consequently, the role of the birth rate α can be considered as a scaling constant that sets the unit of time and we consider it to be equal to 1 without loss of generality. As result the death rate determines to the turnover rate: t = β/α = β. Mutations occur with a rate µ per site per cell division, mutations are con- sidered neutral and we neglect the probability that a site is hit by a mutation by more than one time, in accordance with the infinite site hypothesis. The data available from bulk sequencing is the mutant and wildtype read- counts of sites. Therefore, we will use the site frequency spectrum, which can be estimated from readcount data, to separate the effects of the mutation rate and cell death intensity. The site frequency spectrum reflects the branch length distribution of the tumors cell lineage tree. The tree’s leaves are the sequenced cells, and its root is the most recent common ancestor of these cells. Chang- ing the turnover rate modifies the shape of the cell linage tree by changing the relative lengths of branches closer to the root compared to terminal ones and as a result modifying the site frequency spectrum. Changing the mutation rate, on the other hand, simply results in more mutations, thus leaving the shape of the tree, and by extension the overall shape of the site frequency spectrum unchanged (see Fig. 1). It should be noted, however, that the information we will use is more detailed than the site frequency spectrum, namely, the read count pairs of mutated sites, which contain more information than just one rational number. E.g., 10 mutant reads out of 30 reads and 1 out of 3 both contribute 1 mutation to the frequency 1/3, while the uncertainty of the first case is significantly lower than for the second case. It also makes quite straight- forward to include nucleotide-dependent transition probabilites, or trinucleotide context-based effects. Site frequency spectra derived from tumors alos contain the effects of the ploidy of the sites and the contamination of the sample by normal cells. The corresponding spectrum is termed Variant Allele Frequency (VAF) spectrum. VAF frequencies are also affected by the finite sequencing depth, which gives rise to a stochastic variation in the observed allele frequencies. Throughout the paper, the following notation is used: Branches of the cell lineage tree are denoted by the index k, the length of branch k is denoted by lk and L = ∑ k lk denotes the sum of all branch lengths in the tree. The numbers of mutations per site from cell divisions along branch k are Poisson distributed and their sum is also Poisson distributed. The number of expected cell divisions along branch k is lk, therefore, the distribution of the number of mutations per site on branch k is a Poisson distribution with parameter lkµ. Similarly, the total number of mutations is Poisson distributed with parameter Lµnsites. Inference To compare different combinations of mutation and turnover rates describing the observed empirical data we employ a maximum likelihood approach. First, we derive the likelihood of the observed data, L(D|µ,t), as a function of the mutation rate and the turnover rate. As described below we maximize this likelihood function averaged over a random sample of cell lineage trees with 3 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ Figure 1: Two possible scenarios for the generation of mutations along cell lin- eage trees. a): different turnover rates lead to different lineage tree shapes. Bifur- cations are cell divisions, leaves are cells comprising the bulk sequencing sample. Note that the (surviving) tree topologies are the same, only branch lengths dif- fer. b): mutations, symbolized by purple stars, accumulate at cell divisions. High turnover rate and low mutation rate can lead to the same number of observed mutations as low turnover rate and high mutation rate, however, the mutation spectrum of the trees are different. c): For simulated trees of 10000 leaves, the differences in the branch length distribution are clearly visible. d): VAFs of the mutation spectra. Fractions of mutant cells are binned (note the logarithmic scale). Ploidy is set to two, contamination is zero. Simulated sequencing depth is 1000. fixed turnover rate t in order to estimate the parameters t and µ that are most likely to have generated the observed data. 4 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ First we derive L(D|µ,T ) the likelihood of the observed data for a fixed cell lineage tree T . It is assumed that sites collect mutations independently of each other, consequently, L(D|µ,T ) takes the form of a product over sites: L(D|µ,T ) = ∏ i in sites p(mi|µ,T ,ri) (1) where mi is the number of reads exhibiting a mutation at site i, and ri is the total number of reads covering site i. To calculate the probability of observing mi mutant reads out of a total of ri reads we consider the following alternatives: i) if mi = 0 either a mutation occurred, with probability Pmut(µ,L) = 1−exp(−µL) (see also Sec. Methods), but no mutant read, i.e. mi = 0 was observed out of ri reads with probability F[0,ri,T ], or no mutation occurred with probability 1 − Pmut(µ,L) or ii) a mutation occurred with probability Pmut(µ,L) and mi mutant reads where observed out of ri, with probability F(mi,ri,T ): p(mi|µ,T ,ri) = = { Pmut(µ,L) ·F(0,ri,T ) + (1 −Pmut(µ,L)) , mi = 0 Pmut(µ,L) ·F(mi,ri,T ) mi > 0 (2) To compute the probability F(m,r,T ) of observing m mutant reads out of r total reads given the cell linage tree T , we assume that the mutant reads descend from a single mutation that occurred at somepoint along branch k, which has a length lk and from which a fraction fk of sequenced cells descend, and take the sum over all branches: F(m,r,T ) = ∑ k lk L · Binom(m,r,fk) = ∑ k lk L · ( m r ) ·fmk (1 −fk) r−m, (3) where L = ∑ k lk and Binom(m,r,fk) is the probability mass function of the binomial distribution, i.e. the probability of getting exactly m successes in r independent Bernoulli trials with a probability of success fk. We consider mul- tiple mutations at the same site as a single mutation, and neglect all subsequent mutations after the first one. In all our applications we verified that µL � 1 is fulfilled. To take into consideration sequencing errors, we must consider that they can lead to an excess of false mutation reads. To account for sequencing error, we introduce a parameter ε denoting the probability of a sequencing error at each position of each read. For ε > 0 new possibilities arise: mutant reads can be real mutants or false mutants, due to sequencing errors. It is also possible that all mutant reads at a site are false mutants, and there may or may not be a real mutation. We neglect the case when two or more mutations happen at the same site. Each position in each read can now be either wild type, false wild type, real mutant or false mutant, 5 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ p(mi|µ,T ,ri,ε) ≈ ≈ Pmut(µ,L) ∑ k lk L ri! (ri −mi)!mi ( fk(1 −ε) + (1 −fk)ε )mi· · ( (1 −fk)(1 −ε) + fkε) )ri−mi + (1 −Pmut(µ,L)) ( ri mi ) εmi (1 −ε)ri−mi (4) Note that Eq. 4 contains the mi = 0 case. Finally, we introduce the ability to differentiate mutant types, to conform the case of real DNA, which has 3 possible mutant types. So far, it was assumed that each site can have 2 states, wild type or mutant, corresponding to a DNA consisting of only two types of nucleotides, instead of four. Therefore, instead of the mutant read count m, we introduce three mutant read counts, corresponding to the three possible mutant types, m(1),m(2),m(3). Consequently, the input data now consists of triplets of mutant read counts, instead of one scalar mutant readcount. This leads to the use of a multinomial distribution, with four states: wild type and 3 mutant types. The possibility of more than one real mutant types at the same site is still neglected, being very rare, technically a second-order process in the mutation probability of one site. We also neglect the probability of more than one error hitting the same site. The likelihood function at a single site is then p(m (1) i ,m (2) i ,m (3) i |µ,T ,ri,ε) ≈ ≈ Pmut(µ,L) ∑ k lk L 1 3 3∑ j=1 Mult (( m (j) i ,m (j+1) i ,m (j+2) i ,ri − ∑ j′ m (j′) i ) ; ri; ( p (j) m (fk,ε),p (j+1) m (fk,ε),p (j+2) m (fk,ε),pw(fk,ε) )) + + (1 −Pmut(µ,L)) · Mult (( m (1) i ,m (2) i ,m (3) i ,ri − ∑ j m (j) i ) ; ri; ( p (j) m (0,ε),p (j+1) m (0,ε),p (j+2) m (0,ε),pw(0,ε) )) (5) and p (j) m (fk,ε) = fk(1 −ε) + (1 −fk)ε/3 (6) p (j+1) m (fk,ε) = fkε/3 + (1 −fk)ε/3 (7) p (j+2) m (fk,ε) = fkε/3 + (1 −fk)ε/3 (8) pw(fk,ε) = fkε/3 + (1 −fk)(1 −ε) (9) where (j + 1) and (j + 2) denote the other two possible mutant types with cyclic notation (j) = (j + 3), and Mult is a multinomial distribution the arguments of which denote (random variables), ndraws, (event probabilities). The factor of 1/3 is due to the assumption that only one true mutation can be present at one 6 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ site, and each of the 3 possible mutated forms has the probability 1/3. Each Mult(m(j) . . .) term is a conditional probability conditioned on the true mutant type being (j). The straight-forward approach for treating the unknown cell linage tree as a nuisance parameter would be to average over all trees T : L(D|µ,t) = ∑ T L(D|µ,T ) ·pBD(T |t), (10) where pBD(T |t) is the probability of the cell lineage tree T given conditioned birth-death process with turnover rate t. Due to the very large number of possi- ble trees the above average, however, is intractable and we must results to sam- pling a finite number of trees drawn from the conditioned birth-death process with fixed t. Based on empirical experience we found that using the geometrical mean of L(D|µ,T ) for a finte sample of trees sampled according to pBD(T |t) results in more robust inference. The geometric mean approximates the aver- age probability of inference [29] or equivalently the average surprisal [30] of cell linage trees given the turnover rate t, which we denote L̄(D|µ,t) = ∏ T L(D|µ,T )pBD(T |t) = exp (∑ T pBD(T |t) lnL(D|µ,T ) ) . (11) In practice, during inference of the turnover rate t and mutation rate µ the log-average over a finte number of trees drawn from the conditioned birth-death process with turnover rate t is maximized: lnL̄(D|µ,t) = 1 ntrees ∑ T lnL(D|µ,T ) (12) Generating trees To generate cell division trees from birth-death conditioned process, we use the ELynx software suite [11], which allows freely adjustable birth rates, death rates, and tree sizes. Generating synthetic samples For generating synthetic samples of read counts of mutated DNA sites, trees simulated by ELynx are used as genealogical trees of hypothetical tumors. For each site, first we determine the total readcount at that site. Then, we draw random numbers to check whether any of the branches contributes a mutation, according to the Poisson process described in Sec. Inference. If there is a muta- tion, the true mutant readcount is drawn from a hypergeometric distribution. The number of successes of the hypergeometric distribution is the number of leaves of the selected branch. The number of failures is the total number of leaves multiplied by ploidy and divided by the hypothetical purity of the sam- ple, minus the number of successes. The number of trials is the total readcount. Finally, errors are introduced by drawing a quadruplet of readcounts (“wildtype errors”) from a multinomial distribution, with probabilities (ε/3,ε/3,ε/3, 1−ε), the number of trials is the wildtype readcount. “Mutant errors” are drawn from another multinomial distribution, with probabilites (1 − ε,ε/3,ε/3,ε/3), the 7 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ number of trials is the mutant readcount. The final readcounts are given by the sum of the two drawn quadruplets. The mutation rate for different turnover rates is chosen such that the total number of observed real mutations should remain close to each other, i.e., the estimation algorithm should have a similar amount of input data. Calculating the likelihood The goal is to find the maximum of the likelihood as the function of the mutation rate, turnover rate, and the error rate, to be able to use it for estimating the mutation and turnover rates by Eq. 10. The input is the read counts of the DNA sites. We use pre-generated division trees from the ELynx suite at pre- determined turnover rate values. Between these pre-determined turnover rate values, the likelihood function is interpolated using cubic splines. In the case of synthetic input datasets, the tree used to generate the test dataset is never included in the likelihood calculation. The maximum of the likelihood function for each fixed turnover rate value is obtained by optimizing the error rate using Brent’s method, implemented in Julia’s Optim package, and estimating the mutation rate from the number of input mutations and the branch lengths of the currently fitted tree. Only mutations having read counts high enough to exclude sequencing errors are taken into account in the mutation rate estimation. The estimated mutation rate is averaged over trees, using uniform weights. Therefore, the likelihood function is optimized for ε at different t values, which come from a pre-defined set, for which the trees can be generated in advance, avoiding the need for new trees at each step of the optimization process. 3 Results on synthetic data No sequencing errors Figs. 2 shows the estimated turnover rates and mutation rates as functions of the true turnover rates and mutation rates. The method can reasonably differ- entiate between datasets with different true turnover rates-mutation rates and estimate their values. Fig. 3 shows the joint estimation of mutation rate-turnover rate parameter pairs. The data points are arranged into lines, corresponding to constant numbers of observed mutations, obeying Nobs mut(µ,t) = µ ·E (∑ k l(k) ∗ ccdf(Binom(nseq,fk), 0) ) trees(t) (13) where ccdf(. . . ) is the complementary cumulative distribution function of a bi- nomial distribution, evaluated at 0. The parameters of the binomial distribution are the average sequencing depth nseq and the fraction of leaves under branch k, fk. The expected value is taken over the trees generated with turnover rate t. The lines defined by Eq. 13 can be numerically approximated for any µ,t points, averaging over a number of trees. We checked the dependency of the results on the sizes of the trees. According to Fig. 4, estimation of large true turnover rates becomes increasingly harder as the tree sizes decrease. This can be attributed to the fact that differences 8 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ 10-5 10-4 10-3 10-2 10-1 100 10-4 10-3 10-2 10-1 100 e st im a te d 1 -t dataset 1-t 10-9 10-8 10-7 10-9 10-8 10-7 e st im a te d µ dataset µ Figure 2: Estimated turnover rates (left) and mutation rates (right) for different true values. 10 synthetic datasets for each true value. The trees used for fitting have 10000 leaves, 10000 trees were used for each t value of the loglikelihood(t) measurements (see Fig. 7). The continuous line is a guide for the eye, correspond- ing to y = x. Points are slightly dispersed horizontally for clarity. Horizontal ordering of the data points is the same for both subplots, e.g., the rightmost point in each group of points corresponds to dataset no. 10 in both plots. Figure 3: Joint estimations of mutation rate-turnover rate parameter pairs. 10 synthetic datasets for each true parameter pair, each of which is denoted by one color. True parameter values are indicated by large full circles. Solid lines show the numerical approximation of µ(1 − t), for Nobs mut = 5 · 105, 5 · 104, 5 · 103. between the effects of very high turnover rates are observable on branches having a very small relative number of leaves, therefore large trees are required to 9 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ distinguish between high turnover rates. Figure 4: The effect of tree sizes on the estimations. Estimated turnover rates for different fitted trees sizes: 100 (top left), 1000 (top right), 10000 (bottom left), 100000 (bottom right). The accuracies of the estimates for different datasets are not equal, besides the effect of the size of the trees in the case of high turnover rates. Differences between estimates for different datasets can be due to 3 possible factors: the trees used in the fitting process, the generated input data, or the tree used for generating the data. To check these factors, we chose a dataset which resulted in a turnover rate estimation deviating from the true value (dataset no. 7 for 1 − t = 1 on Fig. 2, 1 − t = 0.47). We calculated the estimated turnover rate values using 10 independent sets of 1000 trees. The estimations ranged from 1−t = 0.39 to 0.54. Therefore, the deviation of the estimate from 1.0 cannot be attributed to the sample of fitting trees. Then, we generated 10 more datasets using the same tree as for the original dataset. The estimated turnover rate values were between 1−t ∈ [0.44, 0.49], even more closely matching the original estimate. Consequently, the effect does not depend on the generated data but on the tree used to generate the data. It seems that the deviation of the estimates from the true turnover rates is due to the fluctuation of the shapes of the trees used for sample generation. 10 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ Effects of sequencing errors To estimate the effect of sequencing errors, we calculated the estimations of the turnover rates, applying different amounts of errors to the same data (exactly the same mutant and wild type readcounts for each mutation). In this case, the error rate of the data was also estimated by the fitting procedure, along with the mutation and turnover rates. The influence of sequencing errors on the estimation of the turnover rates is shown on Fig. 5. For an error rate of 10−3, which is frequently cited as the error rate of the Illumina sequencing technology [13], the estimated turnover rates can have significant deviations from the true values. For lower error rates, the estimations approach the true values, however, outliers remain even for ε = 10−8. We note that the loglikelihoods of the false estimations were better than those of the true values. We tried estimating the parameters while leaving out the least frequent mutations, to reduce the effect of errors, but the estimated parameters deviated significantly more from their true values. 4 Results on empirical data To estimate the turnover and mutation rates of real tumors, a real human tumor sample is required. Due to the fitting method’s sensitivity to high sequencing error rates, we need a sample which is sequenced using a very low error rate technology. Such samples are much less ubiquitous than those by the standard technology, and are usually restricted to very short genome segments, mostly nonhuman. Nevertheless, we found a sample of a human hepatocellular carci- noma (HCC) [14], which was sequenced using the o2n sequencing technology [15], providing error rates between 10−5-10−8, which is significantly lower than the 10−3 rate of the standard Illumina process. Besides the low error rate, the amount of sequenced positions is enough to cover the targeted region 3400x [15], which is also much better than those of standard quality datasets (typical se- quencing depth is around 30). High sequencing depth results in more identified mutations and more precise mutation frequencies. Sequencing was targeted at a 410k basepair wide region of the genome, which is much narrower than the whole human genome. This is a typical shortcoming of low error rate sequencing methods. Still, as our fitting procedure is much more sensitive to the sequencing error rate than to the amount of the input mutations (compare the deviations on Figs. 3 and 5), the dataset provides significantly better input than typical sequencing data. The raw sequencing data was preprocessed according to [15], using the code provided by the authors. The DNA contents of 10000 cells were sequenced [14], along with a sample of neighboring normal tissue. Mutations were called using VarScan 2 [16], which is flexible and easy to adapt to the requirements of the fitting procedure. 923383 sites remained after preprocessing, with sequencing depth being at least 8 (default for VarScan 2) in both tumor and normal samples. The distribution of sequencing depths is wide, ranging from 8 to 10853, with a mean of 904. For mutation calling by VarScan 2, the minimum number of mutant reads was set to 1 and the strand filter switched off. Although the number of false positives increases with these parameter choices, the resulting called mutations 11 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ Figure 5: The effect of varying the error rate. Sequencing error rates are ε = 10−3 (top left), 10−4 (top right), 10−5 (middle left), 10−6 (middle right), 10−7 (bottom left), 10−8 (bottom right). 10000 trees, tree size = 10000. X coordi- nates are slightly dispersed for clarity. Open circles are results corresponding to error rates used in the fit fixed to their true values, crosses correspond to error rates estimated by the parameter fit. Vertical lines show the ranges between the first and last 10-quantiles, based on 10000 beta value estimations. Each open circle-cross pair corresponding to the same dataset is vertically aligned. correspond better to the error model of our fitting procedure than an error rate which changes sharply with threshold frequency or readcount values. The minimum variant frequency was set to 10−6 to include even the least frequent mutation. Purity was set to 0.85, in accordance with [14]. We also checked that the default somatic p-value threshold does not exclude any candidate somatic 12 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ mutations. Other parameter settings were unaltered from their default values. Mutation frequencies were corrected for copy number variation (CNV), using VarScan 2 with default parameters, and for ploidy of the sex chromosomes. CNV detection for targeted sequencing data is a more difficult task than for whole genome data, and VarScan 2 was found to be a stable performer [17]. Sites having multiple variant types (i.e., number of reads of wildtype plus most frequent mutant type being lower than the sequencing depth) were checked manually. Readcounts of all 4 possible genotypes were identified for all variant sites. After all these steps, 2284 mutations were identified. The variant allele fre- quency spectrum is shown of Fig. 6. 0 10 20 30 40 50 0 0.2 0.4 0.6 0.8 1 c o u n t mutation frequency 1 10 100 1000 10000 0 0.2 0.4 0.6 0.8 1 mutation frequency Figure 6: Variant allele frequency spectrum of a human hepatocellular carcinoma sample [14], obtained by o2n sequencing [15]. To estimate the sequencing error rate, the fitting procedure was applied to the mutation data with various fixed error rates in the range 10−8-10−4. The maximum of the loglikelihood corresponded to ε = 10−7. It is a plausible value, as [15] estimated the error rate between 10−5-10−8. Based on estimating 10000 best error rates, the error rate should be between 9.3 · 10−8 and 1.2 · 10−7. Having determined the error rate, we estimated the mutation and turnover rates, with the error rate fixed, using 105 trees. Fig. 7 shows the estimated turnover rate, 1−t = 1.09·10−3, and the mutation rate, µ = 9.26·10−8 per site per cell division. Corresponding to the range of the error rate, the turnover rate ranges within 1.09·10−3-1.10·10−3. The mutation rate ranges between 9.23·10−8- 9.29 ·10−8. Neglecting mutations over frequency 0.46 does not alter the results. For illustration, Fig. 8 shows the VAF of a synthetic sample, generated using the tree fitting best the empirical data. The estimated mutation rate is rather high, compared to estimations of the order of 10−9-10−10 per site per cell division for healthy human somatic tissues [18]. In comparison with mutation rates of tumors, it is not an outstanding value [1]. Meanwhile, the turnover rate is also high, being very close to the birth rate. Possible causes include the effect of the immune system, the deleterious nature of driver mutations or competition for resources among tumor cells. In conclusion, for this tumor sample, the high number of mutations is due to a combination of an elevated mutation rate and a high turnover rate. The results allow estimating the number of cell division rounds from the founding cell to the biopsied tumor. The average height of simulated trees with the estimated parameters is 2009 cell divisions. It should be noted that a naive 13 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ -34000 -33500 -33000 -32500 -32000 -31500 -31000 1e-7 1e-6 1e-5 lo g li k e li h o o d µ Figure 7: Loglikelihood-turnover rate and loglikelihood-mutation rate curves of the HCC data. Interpolation between data points is by cubic splines. 0 10 20 30 40 50 0 0.2 0.4 0.6 0.8 1 c o u n t mutation frequency Figure 8: VAF of a synthetic sample, generated using the tree which fits the empirical data the best. The grey outline shows the empirical VAF. estimation of tree height using log2(2.7 · 109) successive branches of average length 1/(1.1·10−3) is wrong, due to the very different shapes of surviving trees compared to all trees, most of which go extict before reaching 1/(1.1·10−3) size. It is also possible to estimate the lifetime of the HCC sample and the cell division rate of the HCC tumor. The diameter of the tumor is 35 mm, while the length of a HCC cell is 25 µm [14]. This gives a total number of 2.7 · 109 cells in the whole tumor. The median HCC tumor volume doubling time is 86 days [19]. Based on these figures, the lifetime of the analysed sample is around 7 years, and the cell division rate is estimated to be around 1/32 1/hour. 14 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ 5 Discussion In summary, we described a method to simultaneously estimate the mutation rate and the turnover rate, making it possible to answer the question which of them is responsible for the elevated number of mutations in tumors. In par- ticular, the mutated sites’ read counts, which are closely related to the shape of the site frequency spectrum, contain useful information about the turnover rate (death rate, relative to the birth rate), even in the presence of a moderate sequencing error rate. The sequencing error rate can also be estimated. It is also quite straightforward to elaborate the model by including nucleotide-dependent transition probabilites, or trinucleotide context-based effects. The accuracy of the estimation is influenced by 4 factors. First, the sharp- ness of the peak of the loglikelihood function, which tends to be narrow, for the expected amount of input data. Second, the finite amount of trees used in the fit, causing only a slight dispersion for 104 trees. Third, the shape of the true lineage tree, which, for death rates extremely close to the birth rate, can distort the estimation by one order of magnitude. Finally, the assumptions behind the model (birth-death process with constant rates, neutral mutations from a Pois- son process) also contribute to the uncertainty of the estimations. According to the results, the estimation method works sufficiently well to discern cases of small difference between the birth and death rates (α − β � α) and cases of the death rate being much lower than the birth rate. Without such capability, the answer to the question “Is it the mutation rate or the death rate?” would always be “mutation rate”. Although the method is presented in the context of human tumors, it can handle healthy tissues, and samples from other species, too. In theory, any pop- ulation, descending from one ancestor and possessing genetic material, can be analyzed, however, lengthier genomes giving rise to more mutations are easier, due to the increase in the input data for the estimation. On the more practical side, we also discovered that averaging the loglikeli- hoods over trees, instead of the likelihoods, gives a significant improvement in the robustness of the results. Concerning sequencing errors, the noise level in the standard Illumina tech- nology makes applying the method to typical samples impractical. One solution is to use a sequencing technology with much lower error rates, e.g., [21, 22], or even below the 10−6 error rate of the PCR process, [13, 15]. It should be noted, however, that these technologies have been applied to short DNA segments only, resulting in a reduced number of mutations as input data. Another possibility is to apply noise filtering to standard sequencing data, e.g., deepSNV [23], and modify the error model of the fitting process accordingly. Furthermore, when there is no estimation of the order of magnitude of the sequencing error rate, variance in the accuracy of the method can be quite large in individual cases, despite the much better behavior of the averaged results. Despite the shortcomings, it is clear that the signal does exist in the site frequency spectrum, the mean of the estimated turnover rates monotonically changes with the testing datasets true turnover rates, and are clearly not inde- pendent of them. Besides successes on synthetic data, we were also able to analyze an em- pirical sample of a hepatocellular carcinoma. We simultaneously estimated the mutation rate and the turnover rate. Both quantities were estimated to be much 15 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ higher than for healthy tissues, mutation rate being 9.3 · 10−8 per site per cell division, and turnover rate t = 0.9989. In other words, the high number of mu- tations in this tumor is caused by a combination of both high mutation and turnover rates. Using the turnover rate, we also estimated the number of cell division rounds in the tumor’s lifetime, and its cell division rate. The results suggest that tumor cells are constantly dividing, but the growth of the tumor is limited by other factors changing on a much longer timescale, e.g., securing sufficient blood supply, which cause most new tumor cells to die and slow down tumor growth. With such a high turnover rate, the ability of limitless replication is essential for tumor growth. It is interesting to note that high turnover rates are able to reproduce sub- clonal peaks in the VAF, using a purely neutral birth-death process. In other words, subclonal peaks are not necessarily the consequence of selection, neutral processes can also produce them, indicating strong cell death. On this basis, it is possible to give a definition of subclones as branches in the lineage tree, close to the root and long enough to appear as peaks on the VAF spectrum. Using this definition, there is no need to explain different parts of the VAF spectrum with different models[24]. In this work, the turnover rate was held constant during the evolutionary process. There are signs that it is more realistic to assume a turnover rate which changes during tumor growth [2]. In our case, the estimated strong cell death suggests that the tumor reached a slowly growing phase, in line with a Gompertzian model of tumor growth [25, 26], which is corroborated by the large sizes of observed tumors (diameter ≥ 1cm) used in the doubling time estimation [19]. It is possible that in the earlier stages of tumor development, cell death was less frequent and doubling time was shorter. It might be the case that the rate of cell division is constant during tumor growth, and doubling time is set by the turnover rate, which is, in turn, limited by external factors. It is an interesting direction for future work to extend the model by allowing the turnover rate to be time-dependent. The combined effect of the estimated mutation and turnover rates is a very high effective mutation rate between cell divisions where both daughter branches survive, µeff being in the order of 10 −4-10−5 per site per surviving cell division. While this value looks suprisingly large, it is logical that the combination of a slowly growing tumor and fast dividing tumor cells leads to a very large number of mutations. Currently, the method uses a simple birth-death model for tumor growth. In the future, a more realistic growth model, including e.g., spatial effects [27, 28], would enhance the applicability of the method. Another possibility for improve- ment is to model spatial sampling of tissues, in which the measured mutation frequencies intertwine the correlated ancestry of sampled cells with the preva- lence of the mutations. Author Contributions (According to https://journals.plos.org/ploscompbiol/s/authorship#loc-author- contributions.) GJSz conceptualized the research project. GT, ID and GJSz performed the formal analysis. Funding was acquired by ID and GJSz. GT carried out the 16 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ investigation. GT, ID and GJSz contributed to the methodology. GT and DS developed the necessary software. GT provided the visualization. GT wrote the draft. DS, ID and GJSz reviewed and commented on the manuscript. Acknowledgements GT and GJSz received funding from the European Research Council under the European Unions Horizon 2020 research and innovation programme under grant agreement no. 714774. GJSz was also supported by the grant GINOP- 2.3.2.–15–2016–00057. References [1] Williams MJ, Werner B, Barnes CP, Graham TA, Sottoriva A. Iden- tification of neutral tumor evolution across cancer types. Nat Gen. 2016;48:238–244. [2] Bozic I, Gerold JM, Nowak MA. Quantifying Clonal and Sub- clonal Passenger Mutations in Cancer Evolution PLoS Comput Biol. 2016;12(2):e1004731. [3] Tomlinson I, Sasieni P and Bodmer W. How Many Mutations in a Cancer? Am J Pathol. 2002;160:755-758. [4] Araten DJ, Golde DW, Zhang RH, Thaler HT, Gargiulo L, Notaro R et al. A Quantitative Measurement of the Human Somatic Mutation Rate. Cancer Research 2005;65:8111. [5] Loeb LA, Bielas JH and Beckman RA. Cancers Exhibit a Mutator Pheno- type: Clinical Implications. Cancer Research 2008;68:3551. [6] Williams MJ, Werner B, Heide T, Curtis C, Barnes CP, Sottoriva A, et al. Quantification of subclonal selection in cancer from bulk sequencing data. Nat Genet. 2018;50:895-901. [7] Werner B, Case J, Williams MJ, Chkhaidze K, Temko D, Fernández-Mateos J, et al. Measuring single cell divisions in human tissues from multi-region sequencing data. Nat Comm. 2020;11:1035. [8] Gernhard T. The conditioned reconstructed process. J Theor Biol. 2008;253:769. [9] Maruvka YE, Kessler DA, Shnerb NM. The Birth-Death-Mutation Process: A New Paradigm for Fat Tailed Distributions. PLoS ONE 2011;6(11):e26480. [10] Kessler DA, Levine H: Scaling solution in the large population limit of the general asymmetric stochastic Luria-Delbrück evolution process. J Stat Phys. 2015;158:783–805. [11] Schrempf D. The ELynx Suite; 2019 [cited 2020 Sept 01] Repository: GitHub [Internet]. Available from: https://github.com/dschrempf/elynx 17 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ [12] Höhna S, May MR, Moore BR. TESS: an R package for efficiently sim- ulating phylogenetic trees and performing Bayesian inference of lineage diversification rates. Bioinformatics 2016;32(5):789-791. [13] Kennedy SR, Schmitt MW, Fox EJ, Kohrn BF, Salk JJ, Ahn EH, et al. Detecting ultralow-frequency mutations by Duplex Sequencing. Nat Protoc. 2014;9:2586. [14] Ling S, Hu Z, Yang Z, Yang F, Li Y, Lin P, et al. Extremely high ge- netic diversity in a single tumor points to prevalence of non-Darwinian cell evolution. PNAS 2015;112:E6496-E6505. [15] Wang K, Lai S, Yang X, Zhu T, Lu X, Wu C, et al. Ultrasensitive and high-efficiency screen of de novo low-frequency mutations by o2n-seq. Nat Comm. 2017;8:15335. [16] Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al. VarScan 2: somatic mutation and copy number alteration discovery in can- cer by exome sequencing. Genome Res. 2012;22:568-76. [17] Zare F, Dow M, Monteleone N, Hosny A, Nabavi S. An evaluation of copy number variation detection tools for cancer using whole exome sequencing data. BMC Bioinformatics 2017;18:286. [18] Lynch M. Evolution of the mutation rate. Trends Genet. 2010;28:345. [19] An C, Chou YA, Choi D, Paik YH, Ahn SH, Kim M-J, et al. Growth rate of early-stage hepatocellular carcinoma in patients with chronic liver disease. Clin Mol Hepatol. 2015;21:279. [20] Stadler T. On incomplete sampling under birth-death models and connect- sion to the sampling-based coalescent. J Theor Biol. 2009;261:58-66. [21] Lou DI, Hussmann JA, McBee RM, Acevedo A, Andino R, Press WH, et al. High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. PNAS 2013;110:19872-19877. [22] Kinde I, Wu J, Papadopoulos N, Kinzler KW, Vogelstein B. Detection and quantification of rare mutations with massively parallel sequencing. PNAS 2011;108:9530-9535. [23] Gerstung M, Beisel C, Rechsteiner M, Wild P, Schraml P, Moch H, et al. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nat Comm. 2012;3:811. [24] Caravagna G, Heide T, Williams MJ , Zapata L, Nichol D, Chkhaidze K, et al. Subclonal reconstruction of tumors by using machine learning and population genetics. Nat Gen. 2020;52:898–907. [25] Laird AK. Dynamics of Tumour Growth: Comparison of Growth Rates and Extrapolation of Growth Curve to One Cell. Br J Cancer 1965;19:278–291. [26] Lo CF. A modified stochastic Gompertz model for tumour cell growth. Comp Math Methods Medicine 2010;11:3-11. 18 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ [27] Antal T, Krapivsky PL, Nowak MA. Spatial evolution of tumors with suc- cessive driver mutations. Phys Rev E 2015;92:022705. [28] Noble R, Burri D, Kather JN, Beerenwinkel N. Spatial structure governs the mode of tumour evolution. BioRxiv [Preprint]. 2019 bioRxiv 586735 [posted 2019 Mar 23; revised 2019 Apr 13, cited 2020 Sept 24]: [19 p.]. Available from: https://doi.org/10.1101/586735 [29] Nelso KP. Assessing Probabilistic Inference by Comparing the Generalized Mean of the Model and Source Probabilities. Entropy 2017;19:286. [30] Nelso KP. Inference assessment on a probability scale. 51st Annual Conference on Information Sciences and Systems (CISS) 2017; 1- 5.10.1109/CISS.2017.7926106. 19 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430830 http://creativecommons.org/licenses/by-nc/4.0/ Introduction Methods Results on synthetic data Results on empirical data Discussion 10_1101-2021_02_12_430923 ---- Kincore: a web resource for structural classification of protein kinases and their inhibitors Kincore: a web resource for structural classification of protein kinases and their inhibitors Vivek Modi Roland Dunbrack Jr. Institute for Cancer Research Fox Chase Cancer Center, Philadelphia PA 19111 USA .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ Abstract Protein kinases exhibit significant structural diversity, primarily in the conformation of the activation loop and other components of the active site. We previously performed a clustering of the conformation of the activation loop of all protein kinase structures in the Protein Data Bank (Modi and Dunbrack, PNAS, 116:6818-6827, 2019) into 8 classes based on the location of the Phe side chain of the DFG motif at the N- terminus of the activation loop. This is determined with a distance metric that measures the difference in the dihedral angles that determine the placement of the Phe side chains (the ,  of X, D, and F of the X-DFG motif and the 1 of the Phe side chain). The nomenclature is based on the regions of the Ramachandran map occupied by the XDF residues and the 1 rotamer of the Phe residue. All active structures are “BLAminus”, while common inactive DFGin conformations are “BLBplus” and “ABAminus”. Type II inhibitors bind almost exclusively to the DFGout “BBAminus” conformation. In this paper, we present Kincore (http://dunbrack.fccc.edu/kincore), a web resource providing access to the conformational assignments based on our clustering along with labels for ligand types (Type I, Type II, etc.) bound to each kinase chain in the PDB. The data are annotated with several properties including PDBid, Uniprotid, gene, protein name, phylogenetic group, spatial and dihedral labels for orientation of DFGmotif residues, C-helix disposition, ligand name and type. The user can browse and query the database using these attributes individually or perform advanced search using a combination of them like a phylogenetic group with specific conformational label and ligand type. The user can also determine the spatial and dihedral labels for a structure with unknown conformation using the web server and standalone program. The entire database can be downloaded as text files and structure files in PyMOL sessions and mmCIF format. We believe that Kincore will help in understanding conformational dynamics of these proteins and guide development of inhibitors targeting specific states. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint http://dunbrack.fccc.edu/kincore https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ Introduction Protein kinases are catalytic molecular switches that regulate signaling pathways in cells by phosphorylating protein substrates [1]. Their catalytic activity is achieved by a remarkably flexible active site which is observed in multiple different conformations when the enzyme is in inactive state but adopts a unique conformation in the catalytically active state. The dysregulation of this mechanism due to a mutation or upregulation of expression can lead to a variety of diseases including cancer [2, 3]. Protein kinases are widely studied as drug targets with molecules targeted to inhibit the active state or stabilize a specific inactive state [4, 5]. Thus, the understanding of conformational dynamics in protein kinases is critical for development of better drugs and novel biological insights. There are 484 typical protein kinase genes with 497 kinase domains in the human genome [6, 7]. This number includes several pseudokinases but excludes atypical protein kinase genes, some of which are distantly related to the typical protein kinase fold [7]. Among the 497 domains, currently the structures of 283 have been experimentally determined either in apo form or in complex with ligands. The protein kinase fold consists of an N-terminal lobe, which is formed by five beta sheets and one alpha helix called the C-helix, and a C-terminal lobe which consists of five or six alpha helices. The two lobes form a deep cleft in the middle region of the protein creating the ATP-binding active site. This site is surrounded by several structural elements critical for catalysis which occupy a unique conformation in the active state and exhibit flexibility across different inactive states of the enzyme. One of the most critical elements is the activation loop which adopts a unique extended orientation in the active state of the kinase and multiple types of folded conformations in inactive states. It begins with a conserved motif called the DFGmotif (Asp-Phe-Gly) whose orientation is tightly coupled with active/inactive status of the protein. In addition, the C-helix displays inwards disposition in the active state while exhibiting a range of positions and orientations in other states. The DFGmotif conformations were previously addressed by using a simple convention of DFGin and DFGout. The DFGin group consists of all the conformations in which DFG-Asp points in ATP pocket and DFG-Phe is adjacent to the C-helix. The structures solved in the active state conformation of the enzyme form a subset of this category. In DFGout conformations, the DFG-Asp and DFG-Phe residues swap their positions so that DFG-Asp is removed from the ATP binding site and replaced with DFG-Phe. All the Type II inhibitors bind to DFGout conformations [8]. The DFGin and DFGout groups, however, provide only a broad description of a more complex conformational landscape [9, 10]. In our previous work, we developed a scheme for clustering and labeling different conformations of protein kinase structures [11]. Our clustering scheme is based on the spatial location and backbone and side-chain dihedrals of the conserved DFGmotif in the activation loop. We clustered all the conformations into three spatial groups (DFGin, DFGinter, DFGout) based on the proximity of the DFG-Phe side chain to two different residues in the N-terminal domain. Within these groups, we further clustered the structures by the dihedral angles that determine the location of the DFG- Phe side chain: the backbone dihedrals of the X, D and F residues (where X is the residue before the DFGmotif) and the χ1 dihedral angle of the Phe side chain. The kinase states are therefore named after the region of the Ramachandran map occupied by the X, D, and F residues (A for alpha, B for beta, L for .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ left-handed) and the Phe χ1 rotamer (plus, minus, or trans for the +60°, -60°, or 180° conformations). As a result, among the DFGin structures, we distinguished between the catalytically active kinase conformation (labeled BLAminus) and five inactive conformations (BLBplus, BLBminus, BLBtrans, ABAminus, BLAplus). Among DFGout structures, we identified one dominant conformation labeled BBAminus, which is strongly correlated with Type II kinase inhibitors, such as imatinib. Finally, among the small set of DFGinter structures, where the Phe side chain is intermediate between the DFGin and DFGout positions, we distinguished one cluster based on clustering the dihedral angles (BABtrans). Our nomenclature strongly correlates with other structural features associated with active and inactive kinases, such as the positions of the C-helix and the activation loop and the presence or absence of the N-terminal domain salt bridge. Since our clustering and nomenclature is based on backbone dihedrals, it is intuitive to structural biologists and easy to apply in a wide variety of experimental and computational studies, as demonstrated recently in identifying the conformation in crystal structure of IRAK3 [12], molecular dynamics simulations of Abl kinase [13] and structural analyses of pseudokinases [14]. Developing small molecule inhibitors is one of the most common therapeutic strategies against protein kinases. These inhibitors occupy the ATP binding pocket and allosteric sites on the surface of the protein. There have been two approaches used to classify inhibitors – a) based on the region of the protein to which the inhibitor binds; b) based on the conformation of the protein to which it binds. The first approach was used by Dar and Shokat [15] who defined three types of inhibitors: Type I – inhibitors which bind to the adenosine pocket but do not require a specific conformation of structural elements including the C- helix and DFGmotif; Type II – inhibitors that occupy the adenosine pocket and induce DFGout conformations because they extend into the pocket adjacent to the C-helix occupied by DFG-Phe in DFGin structures; Type III – inhibitors that block kinase activity but without displacing ATP. This classification was extended by Zuccotto and coworkers who introduced Type I½ inhibitors as molecules which bind to the ATP region like Type I compounds but extend into the back cavity making additional contacts with the residues involved in Type II binding [16]. Rauh et. al. defined Type IV as the allosteric inhibitors which bind to a site distant to the ATP binding region inducing an inactive conformation in the active site [17, 18]. van Linden et al. defined the ligand types by identifying three regions in the active site - a front cleft, the gate area, and the back cleft, which are further divided into subpockets [19] without the use of labels like Type I, II etc. Roskoski used the second approach and redefined all the inhibitors based on the conformation of the protein [20]. According to this scheme, Type I inhibitors bind only to the active conformation; Type I½ are the inhibitors which bind to DFGin inactive conformations and Type II inhibitors bind to DFGout conformation. Each of these categories were divided into two subtypes A and B. However, this scheme is inadequate because, as we have shown, some inhibitors such as Bosutinib and Sunitinib can bind to different conformations across proteins [11]. For example, according to Roskoski’s classification Sunitinib will be labeled Type I in 6NFZ_A (DFGin-BLAminus) and Type IIB in 3G0F_A (DFGout-BBAminus), even though they bind to the kinase domain in an identical manner. In this paper, we present the Kinase Conformation Resource, Kincore – a web resource which automatically collects and curates all protein kinase structures from the Protein Data Bank (PDB) and assigns conformational and inhibitor type labels. The website is designed so that the information for all .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ the structures can be accessed at once using one database table and instances of it through individual pages for kinase phylogenetic groups, genes, conformational labels, PDBids, ligands and ligand types. The database can be searched using unique identifiers such as PDBid or gene, and queried using a combination of attributes such as phylogenetic group, conformational label and ligand type. We also provide several options to download data – database tables as a tab separated files; the kinase structures as PyMOL sessions and coordinate files in mmCIF format. The structures have been renumbered by Uniprot and our common numbering scheme, which is derived from our structure-based alignment of all 497 human protein kinase domains [7]. We have also developed a webserver and standalone program which can be used to determine the spatial and dihedral labels for a structure with unknown conformation. We automatically label ligand types based on the pockets to which an inhibitor binds defined by specific residues in the kinase domain. Thus, we use five labels for different ligand types: Type I – bind to ATP binding region only (both active and inactive DFGin states); Type I½ – ATP binding region and extending into the back pocket (both active and inactive DFGin states); Type II – ATP binding region and extending to back pocket regions exposed only in DFGout structures; Type III – back pocket only without displacing ATP; and Allosteric – outside the active site cleft. Results Kincore provides conformational assignments and ligand type labels to protein kinase structures from PDB. The current update contains structures from 283 kinase genes from humans (7129 chains) and from 55 genes (707 chains) from seven model organisms. The PK structures were identified from the PDB [21] using PSI-BLAST [22] using a kinase PSSM matrix as a query (Methods). The PDB files are split by chain, renumbered by Uniprot numbering [23, 24] and our common residue numbering scheme, and annotated by conformational and ligand type labels as described below. The conformational labels are assigned using the structural features and clusters described in our previous work [11]. The scheme assigns two types of labels to each chain – 1) A spatial label (DFGin, DFGinter, DFGout) by computing the distance of the DFG-Phe-CZ atom from the C atoms of two conserved residues – the strand 3-Lys involved in the N-terminal domain salt bridge formed in active kinase structures (and some inactive structures) and the residue four amino acids past the C-helix-Glu involved in the same salt bridge and assigning a label using distance cutoff criteria (Methods); 2) A dihedral label –the dihedral angles (φ,ψ of X-DFG, Asp, Phe and χ1 for Phe) for each chain in a spatial group are used to calculate the distance of the structure from the precomputed cluster centroids and assigned a label if its distance satisfies defined cutoff criteria (Methods). All the kinase conformations are represented by a set of eight labels: DFGin-BLAminus, DFGin-BLAplus, DFGin-ABAminus, DFGin-BLBminus, DFGin-BLBplus, DFGin- BLBtrans; DFGout-BBAminus; DFGinter-BABtrans. The chains that do not satisfy the dihedral distance cutoff criteria for any cluster or are missing some of the relevant coordinates are labeled as ‘Unassigned’. Additionally, we have also labeled the C-helix disposition by computing the distance between the C-helix- Glu-C atom from the B3-Lys-C-atom (as a proxy for the conserved salt bridge interaction) and labeled it as C-helix-in and C-helix-out (Methods). .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ Figure 1: Representative protein kinase structure (3ETA_A) displaying the residues used to define inhibitor binding regions. To assign labels to ligands, we have used specific residue positions to identify regions of the binding pocket – the ATP binding pocket (including the hinge residues), back pocket and Type II-only region (Figure 1). The structures are first renumbered by our common numbering scheme so that all the aligned residues have the same residue number across all the kinases. A ligand is then assigned a label based on its contacts with different binding regions. We have used the following five ligand type labels to annotate all the ligand-bound structures of protein kinases (Figure 1): 1. Type I – bind to ATP binding region only 2. Type I½ – bind to ATP binding region and extend into the back pocket (subdivided as Type I½-front and Type I½-back depending on contact with N-terminal or C-terminal residues of the C-helix, respectively) 3. Type II – bind to the ATP binding region and extend into the back pocket and Type II-only region 4. Type III – bind only in the back pocket without displacing ATP 5. Allosteric - any pocket outside the ATP-binding region The distribution of different ligand types across kinase conformations is provided in Table 1. It shows that Type I and Type I½ are the most commonly observed inhibitors. However, except Type II, all the inhibitor types are observed in complex with multiple conformational states. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ Table 1: Distribution of ligand types across protein kinase conformations (Number of chains). Spatial label Dihedral label Type I Type I½ (front+back) Type II Type III Allosteric Total (%) DFGin BLAminus (active) 2926 196 - 12 199 3333 (55.0) BLBplus 443 76 - 59 15 593 (9.8) ABAminus 479 36 - 1 19 535 (8.8) BLBminus 162 11 - 5 10 188 (3.1) BLBtrans 175 6 - - 5 186 (3.1) BLAplus 91 86 - - 1 178 (2.9) Noise 282 38 - 1 18 339 (5.6) DFGout BBAminus 20 9 288 69 24 410 (6.8) Noise 43 17 79 26 12 177 (2.9) DFGinter BABtrans 14 1 - - - 15 (0.2) Noise 89 16 - 3 3 111 (1.8) Total (%) 4724 (77.9) 492 (8.1) 367 (6.1) 176 (2.9) 306 (5.0) Many inhibitors are observed in multiple crystal structures bound to one or more different kinases. We counted the number of unique inhibitors that occur bound to kinase chains in two (or more) states across entries in the PDB. In Table 2, we show a table that provides the number of unique inhibitors that occur in each pair of states (excluding the unclassified spatial or dihedral labels). The numbers along the diagonal are the counts of unique inhibitors observed in at least one structure of the given state. A total of 259 inhibitors occur in two or more kinase states. Table 2. Counts of inhibitors that are bound to chains in two or more states. DFGin- BLAminus DFGin- ABAminus DFGin- BLBplus DFGin- BLBminus DFGin- BLBtrans DFGin- BLAplus DFGout- BBAminus DFGinter- BABtrans DFGin-BLAminus 1686 DFGin-ABAminus 48 334 DFGin-BLBplus 39 11 344 DFGin-BLBminus 26 9 11 210 DFGin-BLBtrans 29 4 11 4 134 DFGin-BLAplus 15 6 13 8 2 107 DFGout-BBAminus 7 3 2 2 2 3 254 DFGinter-BABtrans 6 2 4 4 1 3 1 8 Numbers along the diagonal provide the number of unique inhibitors in each state. The off-diagonal values are the number of unique inhibitors bound to chains in the two states shown in the row and column headers. Website The web pages on Kincore are designed in a common format across the website to organize the information in a consistent and uniform way. Each page retrieved from the database is organized in two parts – the top part provides a summary of the number of structures in the queried groups or conformations, with representative structures from each category listed and displayed. This is followed by a table from the database with each unique PDB chain as a row providing different kinds of information including conformational and ligand type labels and C-helix position, kinase family, gene name, Uniprot ID, ligand PDB ID, and ligand type. The kinase group, gene name, PDB code, conformational labels, ligand name and ligand type are hyperlinked to their specific pages. Each page also contains three tabs on the top to list ‘Human’, ‘Non-human’ and ‘All’ structures. There are buttons provided on each page to .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ download the database table as a tab separated file, and to download all of the kinase structures on the page as PyMOL sessions, and renumbered coordinate files. Figure 2: Snapshot of database table displaying entries for PDB chains on Browse page. The information from the database can be accessed using two main pages: 1. Browse page: This page provides statistics and labels for all the kinase structures in the database (Figure 2). The ‘Summary’ table on top of the page displays the distribution of protein kinase chains in the PDB across conformational states and phylogenetic groups. This is followed by ‘Database’ table which contains annotation for all individual PDB chains retrieved from the database. The entire table with additional information like resolution, Rfactor, activation loop residue etc. can be downloaded as a tab separated file. 2. Search page: This page offers two options to query the database: • Unique identifier: The database can be queried by PDB entry code (e.g., 2GS6), UniProt identifier (e.g., EGFR_HUMAN), gene name (e.g., EGFR), and ligand identifier (e.g., STI). The result will take the user to the page dedicated to the specific query item. For reference the list of all genes in the database is provided for the user through a ‘Help’ button above the search box. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ • Advanced query: The database can be queried by selecting kinase phylogenetic group, conformational label, and ligand type using a drop-down menu. If ‘All’ option is selected for all the three categories, then the entire database table can be accessed at once. A subset of chains in the database can be retrieved by selecting a specific group name, conformational label, and ligand type, for example selecting TYR group + DFGout- BBAminus + Type II ligand type will retrieve all the structures which have these three annotations. If all the structures in complex with Type I½ ligand are desired, then the user can select ‘All group’ + ‘All conformations’ + ‘Type I½ ligand’. The website contains several webpages which are dynamically generated and retrieve queried instances of the database. These pages can be accessed as a result of individual queries or by clicking on the hyperlinks on the Browse page table. They are, 1. Phylogenetic group page: typical protein kinases are divided into nine phylogenetic groups – AGC, CAMK, CMGC, CK1, NEK, RGC, STE, TKL and TYR [ref]. Each group is assigned a page on Kincore displaying information about the structures in that group. On each page, the Summary table provides the number of kinase chains in the group across different conformations with their representative structures (best resolution and least missing residues). These representative structures are also displayed on the page in 3D using NGL viewer. 2. Gene page: A page for each kinase gene in the PDB can be accessed through the hyperlinks on Browse page or by unique identifier Search feature and contains information for all the structures of a specific gene. The summary table on the page gives the number of structures available and their distribution across different conformations with representative example for each. It also provides hyperlinks to the phylogenetic group page (described above) for the gene and the corresponding protein entry on the Uniprot website. In addition to the data provided on the Browse page, the Database table on this page also contains for each chain information on mutations, phosphorylation with total length of the structure and number of residues resolved in the activation loop. 3. PDB page: The PDB page provides information on individual PDB entries and can be accessed by the hyperlinks on the Browse page or by the unique identifier Search feature (Figure 3). Each PDB entry is annotated with information on gene, protein name, phylogenetic group, UniProt id, organism, domain boundary, resolution, conformation, and ligand type labels for every chain. Additionally, the page also contains a sequence feature displaying the UniProt sequence of the protein in the structure. The residues which are unresolved in the structure are displayed in lower case letters to distinguish them from residues with coordinates in the entry. Further, mutated and phosphorylated residues are shown in red and green color, respectively. 4. Ligand page: The ligand page provides access to all chains in complex with a specific ligand. For example, all the structures in complex with ATP can be retrieved by querying for ‘ATP’ on the Search page or clicking on the hyperlinks on the Browse page. The Summary table provides the .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ number of chains in complex with the ligand across different conformations. Like other pages, the Database table provides the list of all the PDB chains with conformational labels and ligand annotations. This page facilitates the comparison of conformations and ligand binding mode across structures from one or multiple kinases in complex with the same ligand. For example, Bosutinib (PDB identifier DB8) which is an FDA-approved drug, is found in complex with structures from 10 kinases in 5 different conformations (Figure 4). Figure 3: Snapshot of PDB page with the sequence feature. Alignment Page In our previous work, we developed a structure-based multiple sequence alignment (MSA) for 497 human protein kinase domains [7]. This alignment contains 17 blocks of aligned regions conserved across human kinases with intermittent regions of low sequence similarity in lower case letters. The alignment is annotated with gene name, UniProt id, and protein residue numbers. On Kincore, we provide access to this MSA through the Alignment page which contains basic information about the alignment with a table of conserved regions across human kinases. The alignment can be visualized inside the browser window through ‘Open in browser’ button created using Jalview’s BioJS feature. This feature provides multiple options for quick analysis including buttons to filter, color, or sort the sequences within the browser window. The alignment is also available to download as a Jalview session as well as Clustal- and FASTA- formatted files. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ Phylogeny Page Using our multiple sequence alignment, we also updated the protein kinase phylogenetic tree [7]. This tree was used to assign a set of ten kinases previously categorized as “OTHER” to the CAMK group, consisting of Aurora kinases, Polo-like kinases, and calcium/calmodulin-dependent kinase kinases. On our resource the tree can be accessed through the Phylogeny page. It provides basic information about the tree, the number of kinase genes and domains in different phylogenetic groups, and links to visualize and download the tree. Figure 4: Snapshot of ligand page displaying Bosutinib (PDB ligand identifier DB8) in complex with structures from 10 kinase genes and in 5 different conformations. Download Options We provide multiple data download options on Kincore to assist the user in different kinds of analysis. These download options are created for all the pages or any instance of database retrieved by a query, e.g. structures of a specific gene, ligand etc. or structures from an advanced query like TYR kinases with DFGout state and Type II ligands. These options are: 1. Coordinate Files We provide structure files in mmCIF and PDB format with three different numbering systems: the original author residue numbering; renumbered by Uniprot protein sequence; and a .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ common residue numbering scheme derived from our multiple sequence alignment of kinases [7]. 2. PyMOL Sessions We provide PyMOL [25] sessions for the structures retrieved from any query from the database. Two PyMOL sessions are provided for each query – All chains and Representative chains (best resolution, least missing residues). Across all the PyMOL sessions, the chains are labeled in a consistent format as – PhyloGroup_Gene_SpatialLabel_DihedralLabel_PDBidChainid (e.g., TYR_EGFR_DFGin_BLAminus_2GS6A). Additionally, we also provide PyMOL scripts (.pml format) which the user can download and run on a local machine to create the sessions. 3. Database Files We provide the information retrieved from the database on every page as tab separated files which can be downloaded using ‘Database table as tsv’ button. When clicked on the ‘Browse’ page, this button will download the information in the entire database in one file. On the other pages specific for a gene or conformation, this file will contain only the subset of the information from the database which is queried. The tsv file has the following header, “Organism Group Gene UniprotID PDB Method Resolution Rfac FreeRfac SpatialLabel DihedralLabel C-helix Ligand LigandType DFG_Phe Edia_X_O Edia_Asp_O Edia_Phe_O Edia_Gly_O ProteinName” 4. Bulk download The ‘Download’ page provides different options to download structure files and PyMOL sessions in bulk. The page is divided into two sections – coordinate files and PyMOL sessions. The user can download coordinate files for all the structures in one zip folder or in subsets of specific phylogenetic group, gene, and conformational label. The tab on the top of the page gives the option to download files with original author residue numbering or renumbered by Uniprot protein sequence and common residue numbering from our alignment. The second part of the ‘Download’ page provides PyMOL sessions for phylogenetic groups, genes and ligands. We have developed a webserver which the user can use to upload a kinase structure file in PDB or mmCIF format to determine its conformation. The program extracts the sequence from structures file and identifies residue positions by aligning it with precomputed HMM profiles of kinase groups. It then determines the conformation of the protein by assigning Spatial and Dihedral labels (Methods). On the output page, the server prints the kinase phylogenetic group which is the closest match to the sequence of the input structure, dihedrals of X-DFG, DFG-Asp, DFG-Phe residues, spatial group, dihedral label and C-helix disposition. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ We have written a standalone program using Python3 which the user can download to assign conformational labels to an unannotated structure. The program can be run in two ways: a) with flag align=True: alignment with precomputed HMM profiles is done to identify the residue numbers for B3- Lys, C-helix-Glu and DFG-Phe. The program then computes inter-residue distances and dihedral angles to label the conformation in the structure (Methods); b) with flag align=False: alignment with an HMM profile is not done, and the residue numbers are provided by the user. This option is faster and more useful for identifying conformations in a large number of structures generated from a molecular dynamics simulations. Discussion Experimentally determined protein kinase structures in apo-form or in complex with a ligand display an extremely flexible active site. However, examining the conformational dynamics of kinases and its role in ligand binding require combining two pieces of information – the conformational state of the protein and the type of ligand in complex. Currently, there are two main resources, Kinametrix and KLIFS, that address protein kinase conformations and inhibitors. However, they provide either conformational assignments or ligand type information, but not both. Kinametrix (http://kinametrix.com/) offers a simple scheme of DFGin and DFGout coupled with C-helix conformation [26]. The resource does not provide information on ligands and lacks any download options for structures. This resource has not been updated with structures since May 2017. KLIFS (https://klifs.vu-compmedchem.nl/index.php) – also offers a simple DFGin and DFGout classification [19, 27] and does not distinguish active and inactive DFGin structures. This resource is more focused on providing information about ligand binding to kinases. It is regularly updated and allows bulk downloads for the results of each search. Kincore fills a gap by providing a sophisticated scheme for kinase conformations, with ligand type labels. The information can be accessed as individual queries for example, getting a list of all chains in complex with Type II ligand; or a combination of queries like, AGC group kinases + DFGin-BLBplus conformation + Type I½ ligand. A feature that distinguishes Kincore from many structural bioinformatics resources is the ability to download coordinate files for the result of any query in one click. For example, a search for AURKA produces a list of 191 protein chains from 154 PDB entries. These can be downloaded in mmCIF format with one click with residue numbering in original PDB numbering, renumbered according to the UniProt sequences, or in our common residue numbering scheme from the kinase multiple sequence alignment. Each coordinate file is labeled by spatial label and dihedral angle cluster, e.g. CAMK_AURKA_DFGin_BLAminus_1OL6A.cif. A user can also download a PyMOL session file with all of the structures for a given query. In addition, an important part of our resource is the web server and standalone program which can label the unknown conformation of a new structure. The standalone program can run on structure files with multiple chains and models. We believe it will be extremely useful to batch process the structures generated from a molecular modeling protocol or molecular dynamics simulation. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint http://kinametrix.com/ https://klifs.vu-compmedchem.nl/index.php https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ Several experimental and computational studies have reported applying the nomenclature from our previous work in structural analyses of kinases [11]. Lange and colleagues have solved the crystal structure of the pseudokinase IRAK3 (PDBID 6RUU) and identified its conformation as BLAminus, similar to the active state of a typical protein kinase [12]. Paul et.al. have studied the dynamics of ABL kinase by various simulation techniques with Markov state models and analyzed the transition between different metastable states by using our nomenclature [13]. Kirubakaran et. al. have identified the catalytically primed structures (BLAminus) from the PDB to create a comparative modeling pipeline for the ligand bound structures of CDK kinases [28]. Paul and Srinivasan have done structural analyses of pseudokinases in Arabidopsis thaliana and compared with typical protein kinases by applying our conformational labels [14]. Therefore, we believe that the development of Kincore database and webserver will greatly benefit a larger research community by making the labeled kinase structures more accessible and facilitating identification of kinase conformations in a wide range of studies. Methods Identifying and renumbering protein kinase structures The database contains protein kinase domains from Homo sapiens and seven model organisms consisting Bos taurus, Danio rerio, Drosophila melanogaster, Mus musculus, Rattus norvegicus, Sus scrofa and Xenopus laevis. To identify structures from these organisms the sequence of human Aurora A kinase (residues 125-391) was used to construct a PSSM matrix from three iterations of NCBI PSI-BLAST on the PDB with default cutoff values [22]. This PSSM matrix was used as query to run command line PSI-BLAST on the pdbaa file from the in the PISCES server (http://dunbrack.fccc.edu/pisces) [29]. pdbaa contains the sequence of every chain in every asymmetric unit of the PDB in FASTA format with resolution, R-factors, and SwissProt identifiers (e.g. AURKA_HUMAN). A total of 4908 PDB entries with 7277 kinase chains were identified. Some poorly aligned kinases and non-kinase proteins that were homologous to kinases but distantly related were removed. The structure files were split by individual kinase chains in the asymmetric unit and renumbered by UniProt protein numbering scheme. The mapping between PDB author numbering and UniProt was obtained from Structure Integration with Function, Taxonomy and Sequence (SIFTS) database [24]. The SIFTS files were also used to extract mutation, phosphorylation, and missing residue annotations. The structure files were also renumbered by a common residue numbering scheme using our protein kinase multiple sequence alignment. Each residue in a kinase domain was renumbered by its column number in the alignment. Therefore, aligned residues across different kinase sequences get the same residue number. For example, in these renumbered structure files the residue number of the DFGmotif across all kinases is 1338 – 1340. The conserved motifs for all the structures were identified from the same alignment. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint http://dunbrack.fccc.edu/pisces https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ Assigning conformational labels Each kinase chain is assigned a spatial group and a dihedral label using our previous clustering scheme as a reference [11]. Our clustering scheme has three spatial groups – DFGin, DFGinter, and DFGout. These are sub-divided into dihedral clusters DFGin -- BLAminus, BLAplus, ABAminus, BLBminus, BLBplus, BLBtrans; DFGinter – BABtrans; and DFGout – BBAminus. To determine the spatial group for each chain, the location of DFG-Phe in the active site was identified using the following criteria: 1. D1≤11 Åand D2≥11 Å– DFGin 2. D1>11 Å and D2<=14 Å– DFGout 3. D1≤11 Å and D2≤11 Å – DFGinter, where D1= αC-Glu(+4)-Cα to DFG-Phe-Cζ and D2 = β3-Lys-Cα to DFG-Phe-Cζ Any structure not satisfying the above criteria is considered an outlier and assigned the spatial label “None.” To identify the dihedral label the DFG-Phe rotamer type in each chain was first identified (minus, plus, trans). The chains for each rotamer type were then represented with a set of 6 backbone (Φ, Ψ) dihedrals from X-DFG, DFG-Asp, DFG-Phe residues. Using these dihedrals, the distance of each kinase chain was calculated from precomputed cluster centroid points for each cluster with the same rotamer type in the given spatial group. For example, the dihedral distance for all DFGin with Phe-minus structures was computed against BLAminus, ABAminus and BLBminus. The dihedral angle distance is computed using the following formula, 𝐷(𝑖, 𝑗) = 1 6 (𝐷(∅𝑖 𝑋 , ∅𝑗 𝑋 ) + 𝐷(𝜓𝑖 𝑋 , 𝜓𝑗 𝑋 ) + 𝐷(∅𝑖 𝐷 , ∅𝑗 𝐷 ) + 𝐷(𝜓𝑖 𝐷, 𝜓𝑗 𝐷 ) + 𝐷(∅𝑖 𝐹 , ∅𝑗 𝐹 ) + 𝐷(𝜓𝑖 𝐹 , 𝜓𝑗 𝐹 )) where, 𝐷(𝜃1 , 𝜃2) = 2(1 − cos(𝜃1 − 𝜃2)) A chain is assigned to a dihedral label if the distance from that cluster centroid is less than < 0.45. The chains which have any motif residue missing or are distant from all the cluster centroids are assigned the dihedral label “None.” The C-helix disposition is determined using the distance between Cβ atoms of B3-Lys and C-helix-Glu(+4). A distance of <10 Å indicates that the salt bridge between the two residues is present suggesting a C-helix- in conformation. A value of >10 Å suggests a C-helix-out conformation. Ligand classification The different regions of the ATP binding pocket are identified by specific residues using our common numbering scheme (Supplementary figure 1): • ATP binding region – hinge residues – residues 426-428 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ • Back pocket - C-helix and partial regions of B4 and B5 strands, DFGmotif backbone – residues 106- 147, 150-152, 184, 187-195, 420-422 and 1337-1339 • Type II-only pocket – exposed only in DFGout conformation – residues 153, 149, 959 and 1011 A contact between ligand atoms and protein residues is defined if the distance between any two atoms is ≤ 4.5 Å (hydrogens not included). Based on these contacts we have labeled the ligand types as follows: 1. Allosteric: Any small molecule in the asymmetric unit whose minimum distance from the hinge region and C-helix-Glu(+4) residue is greater than 6.5 Å. 2. Type I½: subdivided as – Type I½-front – at least three or more contacts in the back pocket and at least one contact with the N-terminal region of the C-helix. Type I½_back - at least three or more contacts in the back pocket but no contact with N-terminal region of C-helix. 3. Type II – at least three or more contacts in the back pocket and at least one contact in the Type2- only pocket. 4. Type III – minimum distance from the hinge greater than 6 Å and at least three contacts in the back pocket. 5. Type I – all the ligands which do not satisfy the above criteria. Identify conformation using webserver The program uses the structure file uploaded by the user to extract the sequence of the protein. It aligns the sequence with precomputed HMM profiles of kinase phylogenetic groups (e.g. AGC.hmm, CAMK.hmm). The alignment with the best score is identified and used to determine the positions of the DFGmotif, B3-Lys, and C-helix-Glu(+4) residues. The program then computes the distance between specific atoms and dihedrals to identify spatial and dihedral labels using the assignment method described above. Standalone program The standalone program is written in Python3.7. The program is available to download from https://github.com/vivekmodi/Kincore-standalone and can be run in a MacOS or Linux machine terminal window. The user can provide individual .pdb or .cif (also compressed .gz) file or a list of files as an input. It identifies the unknown conformation from a structure file in the same way as described for the webserver. Software and libraries used All the scripting and analysis is done using Python3 and depends on Pandas (https://pandas.pydata.org), and Biopython [30] libraries. Website and Database Kincore is developed using Flask web framework (https://flask.palletsprojects.com/en/1.1.x/). The webpages are written in HTML5 and style elements created using Bootstrap v4.5.0 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://github.com/vivekmodi/Kincore-standalone https://flask.palletsprojects.com/en/1.1.x/ https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ (https://getbootstrap.com/). The 3D visualization is done by using NGL Viewer (http://nglviewer.org/ngl/api/). PyMOL (v2.3) is used for creating download sessions [25]. The entire application is deployed on the internet using Apache2 webserver. Acknowledgements The authors want to thank Maxim Shapovalov for his help in deploying the server. This work was funded by NIH grant R35 GM122517 to R.L.D. References 1. Adams, J.A., Kinetic and catalytic mechanisms of protein kinases. Chem Rev, 2001. 101(8): p. 2271-90. 2. Blume-Jensen, P. and T. Hunter, Oncogenic kinase signalling. Nature, 2001. 411(6835): p. 355- 365. 3. Lahiry, P., et al., Kinase mutations in human disease: interpreting genotype-phenotype relationships. Nat Rev Genet, 2010. 11(1): p. 60-74. 4. Zhang, J., P.L. Yang, and N.S. Gray, Targeting cancer with small molecule kinase inhibitors. Nat Rev Cancer, 2009. 9(1): p. 28-39. 5. Ferguson, F.M. and N.S. Gray, Kinase inhibitors: the road ahead. Nature Reviews Drug Discovery, 2018. 17(5): p. 353-377. 6. Manning, G., et al., The protein kinase complement of the human genome. Science, 2002. 298(5600): p. 1912-34. 7. Modi, V. and R.L. Dunbrack, Jr., A Structurally-Validated Multiple Sequence Alignment of 497 Human Protein Kinase Domains. Sci Rep, 2019. 9(1): p. 19790. 8. Vijayan, R., et al., Conformational analysis of the DFG-out kinase motif and biochemical profiling of structurally validated type II inhibitors. Journal of medicinal chemistry, 2015. 58(1): p. 466- 479. 9. Möbitz, H., The ABC of protein kinase conformations. Biochimica et Biophysica Acta (BBA)- Proteins and Proteomics, 2015. 1854(10): p. 1555-1566. 10. Ung, P.M.-U., R. Rahman, and A. Schlessinger, Redefining the protein kinase conformational space with machine learning. Cell chemical biology, 2018. 25(7): p. 916-924. e2. 11. Modi, V. and R.L. Dunbrack, Defining a new nomenclature for the structures of active and inactive kinases. Proceedings of the National Academy of Sciences, 2019. 116(14): p. 6818-6827. 12. Lange, S.M., et al., Dimeric Structure of the Pseudokinase IRAK3 Suggests an Allosteric Mechanism for Negative Regulation. Structure, 2020. 13. Paul, F., Y. Meng, and B. Roux, Identification of Druggable Kinase Target Conformations Using Markov Model Metastable States Analysis of apo-Abl. J Chem Theory Comput, 2020. 16(3): p. 1896-1912. 14. Paul, A. and N. Srinivasan, Genome-wide and structural analyses of pseudokinases encoded in the genome of Arabidopsis thaliana provide functional insights. Proteins, 2020. 88(12): p. 1620- 1638. 15. Dar, A.C. and K.M. Shokat, The evolution of protein kinase inhibitors from antagonists to agonists of cellular signaling. Annu Rev Biochem, 2011. 80: p. 769-95. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://getbootstrap.com/ http://nglviewer.org/ngl/api/ https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ 16. Zuccotto, F., et al., Through the "gatekeeper door": exploiting the active kinase conformation. J Med Chem, 2010. 53(7): p. 2681-94. 17. Gavrin, L.K. and E. Saiah, Approaches to discover non-ATP site kinase inhibitors. MedChemComm, 2013. 4(1): p. 41-51. 18. Fang, Z., C. Grutter, and D. Rauh, Strategies for the selective regulation of kinases with allosteric modulators: exploiting exclusive structural features. ACS Chem Biol, 2013. 8(1): p. 58-70. 19. van Linden, O.P., et al., KLIFS: A knowledge-based structural database to navigate kinase-ligand interaction space. J Med Chem, 2013. 20. Roskoski, R., Jr., Classification of small molecule protein kinase inhibitors based upon the structures of their drug-enzyme complexes. Pharmacol Res, 2016. 103: p. 26-48. 21. consortium, w., Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Research, 2018. 47(D1): p. D520-D528. 22. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of database programs. Nucleic Acids Research, 1997. 25: p. 3389-3402. 23. UniProt Consortium, UniProt: a hub for protein information. Nucleic Acids Res, 2015. 43(Database issue): p. D204-12. 24. Velankar, S., et al., SIFTS: Structure Integration with Function, Taxonomy and Sequences resource. Nucleic Acids Research, 2013. 41(D1): p. D483-D489. 25. DeLano, W.L., The PyMOL molecular graphics system. 2002, Schrödinger, Inc.: San Carlos, CA. 26. Rahman, R., P.M.-U. Ung, and A. Schlessinger, KinaMetrix: a web resource to investigate kinase conformations and inhibitor space. Nucleic acids research, 2018. 47(D1): p. D361-D366. 27. Kanev, G.K., et al., KLIFS: an overhaul after the first 5 years of supporting kinase research. Nucleic Acids Research, 2021. 49(D1): p. D562-D569. 28. Kirubakaran, P., et al., Comparative Modeling of CDK9 Inhibitors to Explore Selectivity and Structure-Activity Relationships. bioRxiv, 2020: p. 2020.06.08.138602. 29. Wang, G. and R.L. Dunbrack, Jr., PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res, 2005. 33(Web Server issue): p. W94-8. 30. Cock, P.J., et al., Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 2009. 25(11): p. 1422-3. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430923doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430923 http://creativecommons.org/licenses/by/4.0/ 10_1101-2021_02_12_430963 ---- Streamlining differential exon and 3' UTR usage with diffUTR Streamlining differential exon and 3’ UTR usage with diffUTR Stefan Gerber1,2, Gerhard Schratt2 & Pierre-Luc Germain1,3,4,* 1Group of Computational Neurogenomics, D-HEST Institute for Neurosciences, ETH Zürich 2Lab of Systems Neuroscience, D-HEST Institute for Neurosciences, ETH Zürich 3Lab of Statistical Bioinformatics, DMLS, University of Zürich 4SIB Swiss Institute of Bioinformatics *Correspondence to Pierre-Luc Germain (pierre-luc.germain@hest.ethz.ch) Abstract Background: Despite the importance of alternative poly-adenylation and 3’ UTR length for a1 variety of biological phenomena, there are limited means of detecting UTR changes from standard2 transcriptomic data.3 Results: We present the diffUTR Bioconductor package which streamlines and improves upon4 differential exon usage (DEU) analyses, and leverages existing DEU tools and alternative poly-5 adenylation site databases to enable differential 3’ UTR usage analysis. We demonstrate the6 diffUTR features and show that it is more flexible and more accurate than state-of-the-art alter-7 natives, both in simulations and in real data.8 Conclusions: diffUTR enables differential 3’ UTR analysis and more generally facilitates DEU9 and the exploration of their results.10 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ Background11 Coding sequences in eukaryotic mRNAs are generally flanked by transcribed but untranslated12 regions (UTRs) which can impact RNA stability, translation, and localization [1]. In particular, the13 length of 3’ UTRs often varies even within a given gene due to the use of different poly-adenylation14 (polyA) sites [2], leading especially to the inclusion or not of regulatory elements such as binding15 sites for microRNAs (miRNAs) or RNA-binding proteins [3]. Alternative poly-adenylation (APA)16 is highly prevalent in mammals [4] and has been shown to be important to a variety of biological17 phenomena [5,6,7,8].18 A number of methods for 3’ end sequencing have been developed with the goal to map APA19 sites [9,10,11,12,13,4,14], leading to the development of atlases such as PolyASite [15] or PolyA DB20 [16]. As such methods are only marginally used, however, it would be beneficial to leverage21 the widespread availability of traditional RNA-seq for the purpose of identifying changes in 3’22 UTR usage. A chief difficulty here is that most UTR variants are not catalogued in standard23 transcript annotations, limiting the utility of standard transcript-level quantification based on24 reference transcripts, such as salmon [17]. Nevertheless, a number of methods have been developed25 to this purpose. Methods like DaPars [18] and APAtrap [19] try to infer new polyA sites from read26 coverage changes from RNA-seq experiments, however the depletion of RNAseq coverage at the 3’27 end of transcripts makes the precise inference of polyA sites challenging [20]. Other tools like QAPA28 [8] and APAlyzer [21] use already available polyA site databases but only compare the usage of the29 most proximal polyA sites to distal ones in a pairwise fashion and fail to grasp the full complexity30 of dynamic APA when there are three or more polyA sites, which is the case for approximately half31 of mammalian transcripts [4]. Furthermore they do not make use of the already proven statistical32 frameworks to analyse different exon usage (DEU) from count data [22,23,24,25]. These tools take33 into account the inherent properties of read count distributions and are arguably more appropriate34 to analyse differences in relative polyA site usage, which is conceptually highly similar to DEU. We35 therefore developed diffUTR, which streamlines and improves upon well established DEU tools,36 and leverages them, along with polyA site databases, to infer alternative 3’ UTR usage across37 conditions.38 1 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ Results39 Streamlining differential bin/exon usage analysis40 Popular bin-based DEU methods are provided by the limma [25,24], edgeR [23] and DEXSeq [22]41 packages. However, their usage is not straightforward for non-experienced users, and their results42 often difficult to interpret. We therefore developed a simple workflow (Figure 1A), usable with any43 of the three methods but standardizing inputs and outputs. In particular, bin annotation and quan-44 tification, as well as different usage results, are all stored in a RangedSummarizedExperiment45 [26], which facilitates data storage and exploration, and enables advanced plotting functions irre-46 spective of the underlying method. diffUTR is flexible in its application, and supports the use of47 strand information if available.48 Transcript annotation GRanges / EnsDb / .gtf polyA sites GRanges / .bed prepareBins countFeatures D E U w ra p p e rs DEXseq diffSplice2 edgeR Bins (GRanges) Ranged Summarized Experiment bam files Plotting functions (Rsubread) UTRCDS Transcript annotation + polyA sites Bins A B Figure 1: Overview. A: diffUTR workflow. Bins are prepared from various types of gene anno- tations as well as, optionally, additional APA-driven segmentation and extension, then read counts within bins as well as bin information are stored in a standardized RangedSummarizedExperiment, which can then be used as an input for any of the three DEU methods, producing again a stan- dardized output that can be used with the package’s plotting functions. B: Schematic of bin preparation. APA sites are used to further segment and extend disjoined gene bins. Improvement to diffSplice49 diffUTR also implements an improved version of limma’s diffSplice method which does not50 assume constant residual variance across bins of the same gene (see diffSplice2). To test the effect51 2 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ of these modifications in a standard DEU setting, we ran both versions (as well as the other two52 DEU methods) on simulated data from a previous DEU benchmark [27]. The precision and recall53 results (Figure 2A) confirmed the previously observed superiority of DEXSeq and, more generally,54 the imperfect false discovery rate (FDR) control. Importantly, it also confirmed that our improved55 diffSplice2 method outperforms the original, at no additional computing cost.56 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 FDR T P R Differential exon usage A 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 FDR T P R method APAlyzer APAlyzer2 DaPars DEXSeq diffSplice diffSplice2 edgeR QAPA.dPau QAPA.pval Differential UTR usage B d iff U T R Figure 2: FDR and recall (TPR) on simulated data. A: In the classical DEU context. B: In the differential UTR usage context. The dashed line indicates a real False Discovery Rate (FDR) of 5%, and the dots indicate nominal FDRs of 10, 5 and 1%. diffUTR methods far outperform QAPA and DaPars. In both contexts, our modifications to diffSplice significantly improve its performance. Application to differential UTR usage and benchmark on a simulation57 We next sought to evaluate the methods when applied for differential UTR analysis. For this58 purpose, APA sites are used to further segment and extend UTR bins, as illustrated in Figure 1B59 (see methods for the details). Given the absence of RNAseq data with a differential UTR usage60 ground truth, we simulated reads with known UTR differences from real data (see Simulated61 Data). We then ran the different diffUTR methods (as well as the unmodified diffSplice62 variant), and compared them to alternative methods. While DaPars and APAlyzer provide gene-63 level significance testing, QAPA does not, and our attempts to use its equivalence classes with64 standard transcript usage methods (see methods) gave very poor results. Therefore, for the65 purpose of comparison we tried two alternatives: simply ranked genes according to QAPA’s main66 output, i.e. the absolute difference in polyA site usage between conditions (|∆PAU|), labeled in67 2B as QAPA.dPau, or running t -tests on the log-transformed PAU values, labeled as QAPA.qval.68 3 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ Since APAlyzer produces different analyses for genes’ 3’ end and intronic APA usage, we used69 both the 3’ end results and a combination of the two (the latter shown as APAlyzer2 ). As Figure70 2B shows, all diffUTR methods outperformed alternatives by far. On this test, our improved71 diffSplice2 had comparable performance to DEXSeq, at a fraction of the computing costs.72 Differential UTR usage in real data73 We next sought to test diffUTR in real data. First, since 3’ UTRs are known to generally lengthen74 during neuronal differentiation [28,8], we expected to observe a skew towards positive fold changes75 of 3’ UTR bins when comparing RNAseq experiments from embryonic stem cells (ESC) and ESC-76 derived neurons. We therefore re-analyzed data from [29] and observed clearly the expected skew77 among statistically-significant genes, especially for bins with a higher expression (Figure 3A).78 We next found both 3’ sequencing and standard RNAseq data from samples of mouse hip-79 pocampal slices undergoing Forskolin-induced long-term potentiation [30], which enabled us to use80 the 3’ sequencing data as a truth for analysis performed on the standard RNAseq data (Figure81 3B and Supplementary Figure 1). In this case we represent the results through Receiver-operator82 characteristic (ROC) curves since the Precision-recall curves make the differences less visible due83 to the lower general power. Although power to detect UTR changes is necessarily low with respect84 to 3’ sequencing, we again observed that diffUTR methods clearly outperformed all alternative85 methods.86 Exploring differential exon/UTR usage results87 diffUTR provides three main plot types to explore differential bin usage analyses, each with a88 number of variations. Figure 4 showcases them in the context of long-term potentiation of mouse89 hippocampal neurons [30]. plotTopGenes (Figure 4A) provides gene-level statistic plots (similar90 to a ‘volcano’ plot), which come in two variations. For standard DEU analysis, absolute bin-level91 coefficients are weighted by significance and averaged to produce gene-level estimates of effect92 sizes. For differential 3’ UTR usage, where bins are expected to have consistent directions (i.e.93 lengthening or shortening of the UTR) and where their size is expected to have a strong impact on94 biological function, the signed bin-level coefficients are weighted both by size and significance to95 produce gene-level estimates of effect sizes. By default, the size of the points reflects the relative96 expression of the genes, and the color the relative expression of the significant bins with respect97 to the gene.98 4 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ A 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 FPR T P R method APAlyzer DaPars DEXSeq diffSplice2 QAPA.pval B −5 0 5 0 1 2 3 4 5 6 Bin log2(foldchange) B in m e a n l o g (C P M ) d iff U T R 3' UTR lengthening Figure 3: Differential UTR analysis on real data. A:. 3’ UTR lengthening during neuronal differentiation. Plotted are the UTR bins found statistically significant (bin- and gene-level FDR both ¡ 0.1) by diffUTR (diffSplice2) when comparing in vitro differentiated neurons to mouse embryonic stem cells. The color indicates the point density. The clear skew towards a positive bin- level foldchange (indicative, in most cases, of a UTR lengthening), especially for bins with a higher mean count (CPM=counts per million reads sequenced). B: Receiver-operator characteristic (ROC) curves of differential UTR usage analysis on the LTP dataset, using 3’ sequencing to establish the ground truth. The axes are square-root-transformed to improve visibility, and only a subset of method variations are shown (see Supplementary Figure 1 for all variants). deuBinPlot (Figure 4B) provides bin-level statistic plots for a given gene, similar to those99 produced by DEXSeq and limma, but offering more flexibility. They can be plotted as overall100 bin statistics, per condition, or per sample, and can display various types of values. Importantly,101 since all data and annotation are contained in the object, these can easily be included in the plots.102 Figure 4B shows a lengthening of the Jund 3’ UTR in the LTP group.103 Finally, geneBinHeatmap (Figure 4C) provides a compact, bin-per-sample heatmap represen-104 tation of a gene, allowing the simultaneous visualization of various information. We found these105 representations particularly useful to prioritize candidates from differential bin usage analyses. For106 example, many genes show differential usage of bins which are generally not included in most107 transcripts of that gene (low count density), and are therefore less likely to be relevant.108 Further variations tested109 During implementation, we tested other changes to the method which were ultimately discarded110 as they did not improve performance, but which we here briefly report.111 First, differential UTR analysis differs from typical differential exon usage analysis in that the112 vast majority of UTR bins are consecutively transcribed, meaning that changes in the usage of a113 5 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ Smg6 Ntrk2 Homer1 Nr4a2 Slc2a1 Txndc11Stmn4Rheb FosbRnf217 Dio2 Nr4a3 Syt4Scg2Hmgcs1 Plk2Rbbp7 Nfkb1 Clta Frmd6 Arid5a Eprs Lmna Slc1a3 Grem2 0 10 20 30 40 0.5 1.0 1.5 2.0 Weighted absolute coefficient − lo g 1 0 (q .v a lu e ) geneMeanDensity 0.00 0.25 0.50 0.75 −0.2 0.0 0.2 density.ratio A 0 1 2 3 sqrt−scaled genomic location type UTR CDS condition CTRL LTP JundB Scaled S m g 6 b in s ty p e lo g W id th m e a n L o g D e n s it y lo g 1 0 P V a lu e logcpm condition type CDS/UTR CDS UTR/3UTR UTR 3UTR non−coding log10PValue 0 10 20 30 40 50 condition CTRL LTP scaled logcpm −2 −1 0 1 2 logNormDensity 0 0.05 0.1 0.15 0.2 0.25 C logCPM lo g (C P M ) Figure 4: Plotting functions. A: plotTopGenes provides significance and effect size statistics aggregated at the gene level. B: deuBinPlot provides a more flexible version of the bin-level gene plots generated by common DEU packages. Shown here is the upregulation of Jund 3’ UTR upon LTP. C: geneBinHeatmap provides a compact, bin-per-sample heatmap representation of a gene. 6 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ bin should also be visible in downstream bins. We therefore reasoned that it would be beneficial to114 use this property to improve statistical analysis. We reasoned that connected bins with significant115 fold changes in the same direction could be unified and their p-values aggregated, and tested a116 rudimentary implementation using Fisher’s aggregation. However, this decreased accuracy and led117 to a worse FDR control (Supplementary Figure 2).118 Second, most methods compare bin-level foldchanges to gene-level ones to identify bins be-119 having differently from the others, and we reasoned that, especially for genes with more UTR bins120 than CDS bins, including counts of 3’ UTR when calculating overall gene expression could under-121 estimate the gene expression and possibly mistake the UTR foldchange for the gene foldchange.122 We therefore tried a modification of diffSplice to only calculate the gene foldchange from coding123 sequence (CDS) bins and then compare it to the individual bins. Again, this approach proved124 unsuccessful (Supplementary Figure 3).125 Discussion126 diffUTR streamlines DEU analysis and outperforms alternative methods in inferring UTR changes,127 which demonstrates the utility of harnessing powerful, well-established frameworks for new ends.128 It must be noted that the way in which the simulation was performed, i.e. elongating transcripts129 to the next polyA site(s), is similar to the way diffUTR disjoins the annotation into bins, which130 could cause a bias towards this method (as well as QAPA and APAlyzer, which also makes use of131 alternative polyA sites). However, this is unlikely to be the reason for the observed superiority of132 diffUTR -based methods given the considerable extent by which they outperformed alternatives,133 and the observation of similar results in real data.134 Similar to DEU tools [27], diffUTR fails to control the FDR correctly, and our attempts so far135 to improve this remained unsuccessful. We therefore recommend prudence with results close to136 the significance threshold. In addition, and in contrast to DEU where exons are subject to splicing137 in a potentially independent fashion, 3’ UTRs typically do not undergo splicing and therefore only138 differ in length between conditions. This means that the behavior of a UTR bin is dependent on139 that of upstream bins, a property which could be exploited to improve accuracy at the gene-level.140 However, our simple attempt to do so by combining p-values of consecutive bins did not have the141 desired outcome, pointing to the need of more research in this direction.142 Further, the bin-based approach has the drawback of not pinpointing the exact UTR locations:143 7 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ it is limited to the bin resolution, and the bins themselves are limited by incomplete transcript144 and APA annotations. Additionally, because there is a significant drop off in read coverage at the145 end of transcripts, we have observed that it is often bins upstream of the actual UTR lengthen-146 ing/shortening event which give a statistically-significant signal rather than the one truly affected.147 This is why we have provided tools to enable the further inspection of events in a given gene.148 Finally, the results of bin-based analyses are limited by the overlaps of transcripts from different149 genes, an issue on which differential transcript usage analysis approaches appear superior (e.g.150 [31]). However, transcript usage analysis tools are dependent on the completeness of the transcript151 annotation, while bin-based approaches are more open to the discovery of unannotated transcript152 variants, which is especially relevant for differential UTR usage. Here, we made the choice of153 including ambiguous bins, but flagging them as such, enabling users to interpret them with caution.154 While DEXSeq remains the tool of predilection for relative bin usage analyses, it scales very badly155 to larger sample sizes, and alternatives might be needed in some contexts. Our changes to156 limma’s original diffSplice method consistently result in more accurate predictions, making157 this new method the best compromise for bin-based approaches when DEXSeq is not applicable.158 More generally, it also shows that even with well-established approaches, there is still room for159 incremental, but non-negligible improvement.160 Methods161 0.1 Data and code availability162 The data objects and code used to produce the figures are available through the https://163 github.com/plger/diffUTR_paper repository. The diffUTR source code is available at https:164 //github.com/ETHZ-INS/diffUTR.165 0.2 RNAseq data processing166 For the evaluation of diffSplice2 in a standard DEU case, we used bin count data obtained167 from the authors of the original DEU benchmark [27]. For other datasets, reads were downloaded168 from the SRA, aligned to the GRCm38.p6 genome using STAR 2.7.3a with default parameters169 and the GENCODE M25 annotation as guide. The same gene annotation was used as input for170 bin creation.171 8 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://github.com/plger/diffUTR_paper https://github.com/plger/diffUTR_paper https://github.com/plger/diffUTR_paper https://github.com/ETHZ-INS/diffUTR https://github.com/ETHZ-INS/diffUTR https://github.com/ETHZ-INS/diffUTR https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ 0.3 diffUTR172 diffUTR is implemented as a Bioconductor package making use of the extensive libraries avail-173 able, especially the GenomicRanges package [32] and the different DEU methods (see Differential174 analysis).175 0.3.1 Preparing bins176 Exons are extracted from the genome annotation and flattened into non-overlapping bins (Figure177 1B). In other words, the exon annotation is fragmented into the widest ranges where the set of178 overlapping features is the same. Bins that do not overlap with coding sequences (CDS) and179 belong to a protein coding transcript are labeled as UTR and the rest as CDS. When APA sites180 are also provided as input (for the purpose of this article, polyAsite v2.0 sites were used), bins are181 further segmented and/or extended. For this the closest upstream CDS or UTR is found for every182 poly(A) site and the UTR is defined from this boundary to the polyA site and assigned to the183 corresponding gene and transcript (Figure 1B). If the newly defined UTRs exceeds a predefined184 length specified by maxUTRbinSize (default is 15000bp), it is ignored as unlikely to be a real185 UTR. Moreover, if the start of a gene is the closest upstream sequence before any UTR or CDS186 the newly defined UTR is ignored to avoid assignment problems. In order to later differentiate187 between regions that are 3’ or 5’ UTRs, regions that are downstream of the last CDS of a given188 transcript were labeled as 3’ UTR. The label ‘non-coding’ is assigned to all bins that have no189 protein coding transcript overlapping it.190 If a bin originates from regions belonging to different genes, the bin is duplicated and as-191 signed once to each gene, so that each gene contains the same fragment once. Alternatively, the192 genewise argument can be used so that only exons belonging to the same gene are considered193 when flattening.194 0.3.2 Quantification195 For quantification, countFeatures() uses the featureCounts() function from the Rsubread196 package [33] to count previously mapped reads overlapping each bin. By default every read is197 assigned once to every bin it overlaps with and can therefore be counted multiple times, which is198 needed because many bins are shorter than the read length. Alternative counting methods, such as199 summarizeOverlaps() from the GenomicAlignments package [32] performed considerably worse200 9 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ in the simulation. The function returns a RangedSummarizedExperiment object [26], containing201 the read counts as well as the bin annotation.202 0.3.3 Differential analysis203 Three wrappers implement corresponding DEU methods on the204 RangedSummarizedExperiment object previously generated, returning results as further stan-205 dardized annotation within the object. For differential UTR analysis, gene-level results are ob-206 tained by filtering the bin-level results for those assigned to the type UTR and/or 3’ UTR, and207 setting all other p-values to 1 before aggregation.208 diffSpliceDGE.wrapper() This is a wrapper around edgeR ’s DEU method based on fitting a209 negative binomial generalized linear model [23]. In a first step the bins are filtered to decide which210 have a large enough read count to be kept for the statistical analysis (filterByExpr()), the library211 sizes are normalized (calcNormFactors()) and the dispersion is estimated (estimateDisp()).212 After this the model is fitted (glmFit()). If the option QLF = TRUE (default), an extended model213 is fitted, using quasi-likelihood methods to account for gene specific variability (glmQLFit()).214 In the last step bin fold changes are tested to be different from overall gene fold changes,215 using a likelihood ratio test or a quasi-likelihood F-Test depending on the QLF option chosen216 (diffSpliceDGE()). The gene level p-values are obtained by the Simes’ method [34].217 DEXseq.wrapper() In this method the standard DEXseq differential exon usage pipeline [22] is218 implemented. It is similarly to edgeR based on fitting a negative binomial model but instead of219 comparing fold change differences between bins and genes, DEXseq compares a full model con-220 taining a term corresponding to the change in exon usage between conditions to a reduced model221 without this term. The two fits are compared using a χ2 likelihood-ratio test. The libraries are nor-222 malized (estimateSizeFactor()), the dispersion is estimated (estimateDispersion() and the223 models are fitted (testForDEU()). In a last step the fold changes between the bins are estimated224 ( estimateExonFoldChanges()). To obtain gene level results the function perGeneQValue()225 is used, which is based on the Šidák method [35].226 diffSplice.wrapper() and diffSplice2 This method implements the differential exon usage pipeline227 of limma for RNA-seq data [25]. The pre-processing is identical to diffSpliceDGE.wrapper(),228 then the precision weights are estimated with (limma::voom()) and the linear models are fitted229 10 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ (limma::lmFit()). In the last step, bin fold changes are tested to be different from overall230 gene fold changes, using a moderated t-test (diffSplice() or, by default, diffSplice2() – see231 below). The gene level p-values are obtained by the Simes’ method [34].232 The diffUTR::diffSplice2 function provides an improved version of limma’s original233 diffSplice method. diffSplice works on the bin-wise coefficient of the linear model which234 corresponds to the log2 fold changes between conditions. It compares the log2(fold change) β̂k,g235 of a bin k belonging to gene g, to a weighted average of log2(fold change) of all the other bins236 of the same gene combined B̂k,g (the subscript g will be henceforth omitted for ease of reading).237 The weighted average of all the other bins in the same gene is calculated by238 B̂k = ∑N i,i6=k wiβ̂i∑N i,i6=k wi (1) where wi = 1 u2i and ui refers to the diagonal elements of the unscaled covariance matrix (X T V X)−1.239 X is the design matrix and V corresponds to the weight matrix estimated by voom. The difference240 of log2 fold changes, which is also the coefficient returned by diffSplice() is then calculated241 by Ĉk = β̂k − B̂k. Instead of calculating the t-statistic with Ĉk, this value is scaled again in the242 original code:243 D̂k = Ĉk √ 1 − wk∑N i wi (2) and the t -statistic is calculated as:244 tk = D̂k uksg (3) s2g refers to the posterior residual variance of gene g, which is calculated by averaging the245 sample values of the residual variances of all the bins in the gene, and then squeezing these residual246 variances of all genes using empirical Bayes method. This assumes that the residual variance is247 constant across all bins of the same gene.248 In diffSplice2(), we applied three changes to the above method. First, the residual249 variances are not assumed to be constant across all bins of the same gene. This results in the250 sample values of the residual variances of every bin now being squeezed using empirical Bayes251 method, resulting in posterior variances s2i for every individual bin i. Second, the weights wi, used252 to calculate B̂k, now incorporate the individual variances by wi = 1 s2i u 2 i . Third, the Ĉk value is253 11 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ directly used to calculate the t -statistic, which after all these changes now corresponds to254 tk = Ĉk uksi . (4) 0.4 Simulated Data255 The simulation was done using the Polyester R package [36] using parameters obtained from the256 control samples of mouse hippocampus RNAseq [30]. Using salmon [17] with a decoy-aware tran-257 scriptome index for the mm10 genome from [37], the abundances for each transcript were first esti-258 mated to learn parameters for the simulation. 1000 transcripts from different genes were randomly259 chosen. The last exon of all these transcripts was lengthened to the next, second next or third next260 downstream APA site annotated in the polyAsite database [15]. Duplicates of these transcripts were261 generated, which had less or no lengthening of their last exon, generating pairs of transcripts with262 different UTR lengths. For each transcript pair, one transcript was up and the other one down reg-263 ulated by the same sampled fold change between 1.3 and 5. To make it more realistic, fold changes264 were also assigned to 300 genes from the set with differential UTR, and 300 genes that did not have265 differences in UTR usage. Reads were then generated for two conditions with three replicates each266 using the simulate experiment() function with the options paired = FALSE, error model =267 "illumina5", bias = "cdnaf" and strand specific = TRUE. The simulated reads are avail-268 able on figshare at https://dx.doi.org/10.6084/m9.figshare.13726143.269 0.5 3’-seq analysis270 To establish a set of true relative differences in UTR usage from the 3’ sequencing data [30], we271 downloaded the authors’ counts per cluster from the Gene Expression Omnibus (file272 GSE84643 3READS count table.txt.gz). We used the 3h treatment because we observed it273 to have the strongest signal, and excluded one sample (A6) that appeared like a strong outlier274 based on PCA and MDS plots. We kept only clusters with at least 50 reads in at least 2 samples,275 and used DEXSeq to fit a negative binomial on each gene and estimate the significance of the276 cluster:condition term. We considered as true positives genes with a gene-level and bin-level277 q-value ≤ 0.1, and true negatives genes with a gene-level q-value ≥ 0.8. Genes for which all278 tested methods produced a p-value of 1 or NA (i.e. genes filtered out as too lowly expressed in279 the standard RNAseq) were excluded for the benchmark.280 12 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://dx.doi.org/10.6084/m9.figshare.13726143 https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ 0.6 Comparisons with alternatives281 For the comparison of methods, all functions were used with their default parameters and run282 according to their manual. As QAPA and DaPars do not provide means to aggregate the results283 to the gene level this was implemented separately. For DaPars the p-values were aggregated to284 the gene level by using Simes’ method [34] for comparability with diffUTR. Aggregation by taking285 the minimum p-value of all the transcripts in a gene produced extremely similar results. For QAPA286 |∆PAU| was calculated and aggregated to a gene level by taking the maximum from all transcripts287 of a gene and the genes were ranked by this value. Alternatively, we also tested applying a t -test288 on the log-transformed PAU values (log-transforming had a negligible effect), followed by Simes’289 gene-level aggregation. Attempts to complement QAPA with p-values estimated from established290 statistical tests working with its equivalence classes, such as BANDITS [31], did not improve the291 results and were therefore discarded so as not to distort the original method. Finally, for APAlyzer2292 we combined the 3’ UTR and intronic APA analyses by using the minimum of the two p-values.293 See the https://github.com/plger/diffUTR_paper repository for details.294 We used the following software versions for comparisons: Polyester 1.24.0, DEXSeq 1.34.0,295 edgeR 3.30.0, limma 3.44.0, DaPars 0.9.1, APAlyzer 1.5.5. For QAPA, we used salmon 1.3.0296 with validateMappings.297 Competing interests298 The authors declare no competing interests beside being the developers of the described package.299 Author’s contributions300 SG developed the bin preparation and the diffSplice modification, and ran most of the analyses.301 PLG and SG wrote the package and paper. PLG and GS supervised the project.302 Acknowledgements303 SG performed this research as part of his bachelor thesis in the Interdisciplinary Sciences program at304 ETH. PLG’s position is co-funded by Prof. Mark Robinson (Institute of Molecular Life Sciences,305 University of Zurich) and Professors Gerhard Schratt, Johannes Bohacek and Isabelle Mansuy306 (Institute of Neuroscience, ETH Zurich). GS is supported by grants from the SNF (SNF 179651,307 13 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://github.com/plger/diffUTR_paper https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ SNF 189486) and the ETH (ETH-24 18-2 (NeuroSno)). We thank the Robinson group (UZH) for308 feedback.309 References310 1. Lewis, J. D., Gunderson, S. I. & Mattaj, I. W. The influence of 5′ and 3′ end structures on pre-mRNA311 metabolism. Journal of Cell Science. issn: 00219533 (1995).312 2. Tian, B. & Manley, J. L. Alternative polyadenylation of mRNA precursors. Nature Reviews Molecular Cell313 Biology. issn: 14710080 (2016).314 3. Fabian, M. R., Sonenberg, N. & Filipowicz, W. Regulation of mRNA translation and stability by microRNAs.315 Annual Review of Biochemistry. issn: 00664154 (2010).316 4. Derti, A. et al. A quantitative atlas of polyadenylation in five mammals. Genome Research. issn: 10889051317 (2012).318 5. Sandberg, R., Neilson, J. R., Sarma, A., Sharp, P. A. & Burge, C. B. Proliferating cells express mRNAs with319 shortened 3′ untranslated regions and fewer microRNA target sites. Science. issn: 00368075 (2008).320 6. Mayr, C. & Bartel, D. P. Widespread Shortening of 3UTRs by Alternative Cleavage and Polyadenylation321 Activates Oncogenes in Cancer Cells. Cell. issn: 00928674 (2009).322 7. Miura, P., Shenker, S., Andreu-Agullo, C., Westholm, J. O. & Lai, E. C. Widespread and extensive lengthening323 of 39 UTRs in the mammalian brain. Genome Research. issn: 10889051 (2013).324 8. Ha, K. C., Blencowe, B. J. & Morris, Q. QAPA: A new method for the systematic analysis of alternative325 polyadenylation from RNA-seq data. Genome Biology. issn: 1474760X (2018).326 9. Fox-Walsh, K., Davis-Turak, J., Zhou, Y., Li, H. & Fu, X. D. A multiplex RNA-seq strategy to profile poly(A327 +) RNA: Application to analysis of transcription response and 3′ end formation. Genomics. issn: 08887543328 (2011).329 10. Fu, Y. et al. Differential genome-wide profiling of tandem 3 UTRs among human breast cancer and normal330 cells by high-throughput sequencing. Genome Research. issn: 10889051 (2011).331 11. Zheng, D., Liu, X. & Tian, B. 3READS+, a sensitive and accurate method for 3 end sequencing of polyadeny-332 lated RNA. RNA. issn: 14699001 (2016).333 12. Jan, C. H., Friedman, R. C., Ruby, J. G. & Bartel, D. P. Formation, regulation and evolution of Caenorhabditis334 elegans 3′UTRs. Nature. issn: 00280836 (2011).335 13. Shepard, P. J. et al. Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq. RNA.336 issn: 13558382 (2011).337 14. Hwang, H. W. et al. cTag-PAPERCLIP Reveals Alternative Polyadenylation Promotes Cell-Type Specific338 Protein Diversity and Shifts Araf Isoforms with Microglia Activation. Neuron. issn: 10974199 (2017).339 15. Herrmann, C. J. et al. PolyASite 2.0: A consolidated atlas of polyadenylation sites from 3 end sequencing.340 Nucleic Acids Research. issn: 13624962 (2020).341 14 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ 16. Wang, R., Nambiar, R., Zheng, D. & Tian, B. PolyA DB 3 catalogs cleavage and polyadenylation sites342 identified by deep sequencing in multiple genomes. Nucleic Acids Research 46, D315–D319. issn: 0305-1048.343 https://doi.org/10.1093/nar/gkx1000 (2021) (Jan. 2018).344 17. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware345 quantification of transcript expression. en. Nature Methods 14. Number: 4 Publisher: Nature Publishing346 Group, 417–419. issn: 1548-7105. https://www.nature.com/articles/nmeth.4197 (2021) (Apr. 2017).347 18. Xia, Z. et al. Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 3′-UTR landscape across348 seven tumour types. Nature Communications. issn: 20411723 (2014).349 19. Ye, C., Long, Y., Ji, G., Li, Q. Q. & Wu, X. APAtrap: Identification and quantification of alternative polyadeny-350 lation sites from RNA-seq data. Bioinformatics. issn: 14602059 (2018).351 20. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews352 Genetics. issn: 14710056 (2009).353 21. Wang, R. & Tian, B. APAlyzer: a bioinformatics package for analysis of alternative polyadenylation isoforms.354 Bioinformatics (Oxford, England). issn: 13674811 (2020).355 22. Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-seq data. Genome Research.356 issn: 10889051 (2012).357 23. Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: A Bioconductor package for differential expression358 analysis of digital gene expression data. Bioinformatics. issn: 14602059 (2009).359 24. Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. Voom: Precision weights unlock linear model analysis tools360 for RNA-seq read counts. Genome Biology. issn: 1474760X (2014).361 25. Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies.362 Nucleic Acids Research. issn: 13624962 (2015).363 26. Morgan, M., Obenchain, V., Hester, J. & Pagès, H. SummarizedExperiment: SummarizedExperiment con-364 tainer. R package version 1.12.0 (2018).365 27. Soneson, C., Matthes, K. L., Nowicka, M., Law, C. W. & Robinson, M. D. Isoform prefiltering improves perfor-366 mance of count-based methods for analysis of differential transcript usage. Genome Biology. issn: 1474760X367 (2016).368 28. Blair, J. D., Hockemeyer, D., Doudna, J. A., Bateup, H. S. & Floor, S. N. Widespread Translational Remodeling369 during Human Neuronal Differentiation. Cell Reports. issn: 22111247 (2017).370 29. Whipple, A. J. et al. Imprinted Maternally Expressed microRNAs Antagonize Paternally Driven Gene Programs371 in Neurons. English. Molecular Cell 78. Publisher: Elsevier, 85–95.e8. issn: 1097-2765. https://www.cell.372 com/molecular-cell/abstract/S1097-2765(20)30041-1 (2021) (Apr. 2020).373 30. Fontes, M. M. et al. Activity-Dependent Regulation of Alternative Cleavage and Polyadenylation during Hip-374 pocampal Long-Term Potentiation. Scientific Reports. issn: 20452322 (2017).375 31. Tiberi, S. & Robinson, M. D. BANDITS: Bayesian differential splicing accounting for sample-to-sample vari-376 ability and mapping uncertainty. Genome Biology. issn: 1474760X (2020).377 15 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://doi.org/10.1093/nar/gkx1000 https://www.nature.com/articles/nmeth.4197 https://www.cell.com/molecular-cell/abstract/S1097-2765(20)30041-1 https://www.cell.com/molecular-cell/abstract/S1097-2765(20)30041-1 https://www.cell.com/molecular-cell/abstract/S1097-2765(20)30041-1 https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ 32. Lawrence, M. et al. Software for Computing and Annotating Genomic Ranges. PLoS Computational Biology.378 issn: 1553734X (2013).379 33. Liao, Y., Smyth, G. K. & Shi, W. FeatureCounts: An efficient general purpose program for assigning sequence380 reads to genomic features. Bioinformatics. issn: 14602059 (2014).381 34. Simes, R. J. An improved bonferroni procedure for multiple tests of significance. Biometrika. issn: 00063444382 (1986).383 35. Šidák, Z. Rectangular Confidence Regions for the Means of Multivariate Normal Distributions. Journal of the384 American Statistical Association. issn: 1537274X (1967).385 36. Frazee, A. C., Jaffe, A. E., Langmead, B. & Leek, J. T. Polyester: Simulating RNA-seq datasets with differential386 transcript expression. Bioinformatics. issn: 14602059 (2015).387 37. Stolarczyk, M., Reuter, V. P., Smith, J. P., Magee, N. E. & Sheffield, N. C. Refgenie: a reference genome388 resource manager. GigaScience. issn: 2047217X (2020).389 16 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430963doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430963 http://creativecommons.org/licenses/by-nd/4.0/ Data and code availability RNAseq data processing diffUTR Preparing bins Quantification Differential analysis Simulated Data 3'-seq analysis Comparisons with alternatives 10_1101-2021_02_12_430979 ---- StrainFLAIR: Strain-level profiling of metagenomic samples using variation graphs StrainFLAIR: Strain-level profiling of1 metagenomic samples using variation2 graphs3 Kévin Da Silva1,2,*, Nicolas Pons2, Magali Berland2, Florian Plaza Oñate2,4 Mathieu Almeida2, and Pierre Peterlongo15 1Univ Rennes, Inria, CNRS, IRISA - UMR 6074, F-35000 Rennes, France6 2Université Paris-Saclay, INRAE, MGP, 78350 Jouy-en-Josas, France7 Corresponding author:8 ∗Kévin Da Silva kevin.da-silva@inria.fr9 Email address:10 ABSTRACT11 Current studies are shifting from the use of single linear references to representation of multiple genomes organised in pangenome graphs or variation graphs. Meanwhile, in metagenomic samples, resolving strain-level abundances is a major step in microbiome studies, as associations between strain variants and phenotype are of great interest for diagnostic and therapeutic purposes. 12 13 14 15 We developed StrainFLAIR with the aim of showing the feasibility of using variation graphs for indexing highly similar genomic sequences up to the strain level, and for characterizing a set of unknown sequenced genomes by querying this graph. 16 17 18 On simulated data composed of mixtures of strains from the same bacterial species Escherichia coli, results show that StrainFLAIR was able to distinguish and estimate the abundances of close strains, as well as to highlight the presence of a new strain close to a referenced one and to estimate its abundance. On a real dataset composed of a mix of several bacterial species and several strains for the same species, results show that in a more complex configuration StrainFLAIR correctly estimates the abundance of each strain. Hence, results demonstrated how graph representation of multiple close genomes can be used as a reference to characterize a sample at the strain level. 19 20 21 22 23 24 25 Availability: http://github.com/kevsilva/StrainFLAIR26 INTRODUCTION27 The use of reference genomes has shaped the way genomics studies are currently conducted. Reference28 genomes are particularly useful for reference guided genomic assembly, variant calling or mapping29 sequencing reads. For the later, they provide a unique coordinate system to locate variants, allowing30 to work on the same reference and easily share information. However, the usage of reference genomes31 represented as flat sequences reaches some limits (Ballouz et al., 2019).32 Close reference genomes or genomes of strains from the same species show a high sequence similarity.33 Mapping sequencing reads on similar reference genomes results in mis-mapped reads or ambiguous34 alignments generating noise in the downstream analysis, that has yet to be clarified (Na et al., 2016). This35 has led recent methods to provide a representation of multiple genomes as genome graphs, also called36 variation graphs, in which each path is a different known variation. Such graph representations are well37 defined, and tools to build and manipulate graphs are under active development (Garrison et al., 2017;38 Kim et al., 2019; Rakocevic et al., 2019; Li et al., 2020).39 This graph structure provides obvious advantages such as the reduction of the data redundancy, while40 highlighting variations (Garrison et al., 2018). However, it also introduces novel difficulties. Updating41 a graph with novel sequences, adapting existing efficient algorithms for read mapping, and, mainly,42 developing new ways to analyse sequence-to-graph mapping results for downstream analyses are among43 those new challenges. The work presented here primarily focuses on this latest point and proposes to44 show the feasibility of using a variation graph for identifying and estimating abundances, at the strain45 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ level, from an unknown metagenomic read set.46 In the context of metagenomics, representing genomes in graphs is of particular interest for indexing47 microorganism genomes. Microorganisms are predominant in almost every ecosystems from ocean48 water (Sunagawa et al., 2015) to human body (Clemente et al., 2012), and play major functioning roles49 in them (New and Brito, 2020). While studies in microbial ecology are facing a bottleneck due to the50 difficulty of isolating and cultivating most of those microbes in laboratory, preventing the analysis of51 the complex structure and dynamics of the microbial communities (Stewart, 2012), high-throughput52 sequencing in metagenomics offers the opportunity to study a whole ecosystem. In particular, shotgun53 sequencing allows a resolution up to the species level (Jovel et al., 2016), and enable samples analysis in54 terms of population stratification, microbial diversity or bio-markers identification (Quince et al., 2017).55 Understanding of microbial communities structure and dynamics is usually revealed by resolving the56 species present in samples and their relative abundances, which can then be associated with phenotypes,57 notably in the field of human health (Ehrlich, 2011; Vieira-Silva et al., 2020; Solé et al., 2021). Now,58 characterizing samples at the strain level has a growing interest, as it may highlight new associations with59 phenotypes, and a better understanding of the functional impact of strains in host-microbe interactions60 is crucial to new therapeutic strategies and personalized medicine. Escherichia coli, which has a highly61 variable genome, is a well-known example since some strains are harmless commensals in the human62 gut microbiota while others are harmful pathogens (Rasko et al., 2008; Loman et al., 2013). Current63 approaches to handle multiple similar genomes as with strains use gene clustering and then select the64 representative sequence of each cluster, getting rid of the redundancy but also the variations, yet crucial65 to distinguish the strains of a species (Qin et al., 2010). Hence, indexation of a set of known strains is a66 good framework for testing the ability of a variation graph to capture the diversity while offering a way to67 correctly assign sequenced data to the strains they belong to.68 In this work, we present StrainFLAIR, a novel method and its implementation that uses variation69 graph representation of gene sequences for strain identification and quantification. We proposed novel70 algorithmic and statistical solutions for managing ambiguous alignments and computing an adequate71 abundance metric at the graph node level. Results have shown that we could correctly identify and quantify72 strains present in a sample. Notably, we could also identify close strains not present in the reference.73 StrainFLAIR is available at http://github.com/kevsilva/StrainFLAIR.74 METHODS75 We propose here a description of our tool StrainFLAIR (STRAIN-level proFiLing using vArIation76 gRaph). This method exploits various state-of-the-art tools and proposes novel algorithmic solutions77 for indexing bacterial genomes at the strain-level. It also permits to query metagenomes for assessing78 and quantifying their content, in regards to the indexed genomes. An overview of the index and query79 pipelines are presented on Fig. 1.80 Rational for the choice of third-party tools and their detailed usages are given in Supplementary81 Materials, Section S1.1.82 Indexing strains83 Gene prediction84 As non-coding DNA represents 15% in average of bacterial genomes and is not well characterized in85 terms of structure, StrainFLAIR focuses on protein-coding genes in order to characterize strains by86 their gene content and nucleotidic variations of them. Moreover, non-coding DNA regions can be highly87 variable (Thorpe et al., 2017) and taking into account complete genomes would then lead to highly88 complex graphs, and combinatorial explosions when mapping reads. Additionally, complete genomes89 are not always available. Focusing on the genes allows to use also drafts and metagenome-assembled90 genomes or a pre-existing set of known genes (Qin et al., 2010; Li et al., 2014). Hence, StrainFLAIR91 indexes genes instead of complete genomes in graphs.92 Genes are predicted using Prodigal, a tool for prokaryotic protein-coding genes prediction (Hyatt93 et al., 2010).94 Knowing that some reads map at the junction between the gene and intergenic regions, by conserving95 only gene sequences, mapping results are biased towards deletions and drastically lower the mapping96 score. In order to alleviate this situation, we extend the predicted gene sequences at both ends. Hence,97 2/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ Figure 1. StrainFLAIR overview. a. Indexation. Input is a set of known reference genomes of various bacterial species and strains. StrainFLAIR uses a graph for indexing genes of those reference genomes. b. Read mapping on the previously mentioned graph. c. Mapped reads analysis. StrainFLAIR assigns and estimates species and strain abundances of a bacterial metagenomic sample represented as short reads. StrainFLAIR conserves predicted genes plus their surrounding sequences. By default, and if the98 sequence is long enough, we conserve 75 bp on the left and on the right of each gene.99 Gene clustering100 Genes are clustered into gene families using CD-HIT (Li and Godzik, 2006). For the clustering step, the101 genes without extensions are used in order to strictly cluster according to the exact gene sequences and102 no parts of intergenic regions. CD-HIT-EST is used to realize the clustering with an identity threshold103 of 0.95 and a coverage of 0.90 on the shorter sequence. The local sequence identity is calculated as the104 number of identical bases in alignment divided by the length of the alignment. Sequences are assigned to105 the best fitting cluster verifying these requirements.106 Graph construction107 Each gene family is represented as a variation graph (Fig. 2). Variation graphs are bidirected DNA108 sequence graphs that represents multiple sequences, including their genetic variation. Each node of the109 graph contains sub-sequences of the input sequences, and successive nodes draw paths on the graph.110 Paths corresponding to reference sequences are specifically called “colored paths”. Each colored path111 corresponds to the original sequences of a gene in the cluster.112 Figure 2. Illustration of a variation graph structure and colored paths. Each node of the graph contains a sub-sequence of the input sequences and is integer-indexed. A path corresponding to an input sequence is called a colored path, and is encoded by its succession of node ids, e.g. 1,3,5,6 for the colored path 1 in this example. 3/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ In the case of a cluster composed of only one sequence, vg toolkit (Garrison et al., 2017)113 is used to convert the sequence into a flat graph. Alternatively, when a cluster is composed of two114 sequences or more, minimap2 (Li, 2018) is used to generate a multiple sequence alignment. Then115 seqwish (Garrison, 2021) is used to convert this multiple sequence alignment into a variation graph.116 All the so-computed graphs (one per input cluster) are then concatenated to produce a single variation117 graph where each cluster of genes is a connected component.118 The index is created once for a set of reference genomes. Afterward, any set of sequenced reads can119 to be profiled at the strain-level based on this index.120 Querying variation graphs121 Mapping reads122 For mapping reads on the previously described reference graph, we use the sequence-to-graph mapper vg123 mpmap from vg toolkit. It produces a so-called “multipath alignments”. A multipath alignment is a124 graph of partial alignments and can be seen as a sub-graph (a subset of edges and vertices) of the whole125 variation graph (see Fig. 3 for an example). The mapping result describes, for each read, the nodes of the126 variation graph traversed by the alignment and the potential mismatches or indels between the read and127 the sequence of each traversed node.128 Reads attribution129 When mapping a read on a graph with colored path, two key issues arise, as illustrated Fig. 3. As mapping130 generates a sub-graph per mapped read, the most probable mapped path(s) has / have to be defined. In the131 meanwhile, the most probable mapped path(s) corresponding to a colored path also have to be defined.132 Hence we developed an algorithm to analyse and convert, when possible, a mapping result into one or133 several continuous path(s) (successive nodes joined by only one edge) per mapped read. In addition we134 propose an algorithm to attribute such path to most probable colored path(s).135 Path attribution136 A breadth first search on the multipath alignment is proposed. It starts at each node of the alignment137 with a user-defined threshold on the mapping score. A single path alignment with a mapping score138 below this threshold is ignored, and the single path alignment with the best mapping score is retained.139 Additionally, for each alignment, nodes are associated with a so-called “horizontal coverage” value. The140 horizontal coverage of a node by a read corresponds to the proportion of bases of the node covered by the141 read. Hence, a node has an horizontal coverage of 1 if all its nucleotides are covered by the read with or142 without mismatches or indels.143 Because of possible ties in mapping score, the search can result in multiple single path alignments, as144 illustrated Fig. 3(A). This situation corresponds to a read which sequence is found in several different145 genes or to a read mapping onto the similar region of different versions of a gene.146 To take into account ambiguous mapping affectations, as shown below, the parsing of the mapping147 output is decomposed into two steps. The first step processes the reads that mapped only a unique colored148 path (called “unique mapped reads” here), corresponding to a single gene. The second step processes the149 reads with multiple alignments (called “multiple mapped reads” here).150 Colored path attribution151 Once a read is assigned to one or several path alignment(s), it still has to be attributed, if possible, to a152 colored path. The following process attributes each mapped read to a colored path and various metrics for153 downstream analyses are computed. In particular, an absolute abundance for each node of the variation154 graph, called the “node abundance”, is computed, first focusing on unique mapped reads (first step). For a155 given alignment, the successive nodes composing the path are compared to the existing colored paths of156 the variation graph. If the alignment matches part of a colored path, the number of mapped reads on this157 path is incremented by one (i.e. reads raw count). The node abundance for each node of the alignment is158 incremented with its horizontal node coverage defined by this alignment. Alignments with no matching159 colored paths are skipped.160 Then, we focus on multiple mapped reads (second step), as illustrated Fig. 3(B). During this step, the161 alignment matches multiple colored paths. Hence, the abundance is distributed to each matching colored162 path relatively to the ratio between them. This ratio is determined from the reads raw count of each path163 from the first step. For example, if 70 unique mapped reads were found for path1 and 30 for path2 during164 the first step, a read matching ambiguously both path1 and path2 during the second step counts as 0.7 for165 4/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ Figure 3. Illustration of the multipath alignment concept and the read attribution process. (A) Path attribution. The region of the read in blue aligns un-ambiguously to a node of the graph while the dark and light red parts can either align to the top or the bottom nodes of their respective mapping localization (due to mismatches that can align on both nodes for example), drawing an alignment as a sub-graph of the reference variation graph, and thus opening the possibility of four single path alignments. (B) Colored path attribution. First, from the multipath alignment (all four read sub-paths), the breadth search finds the possible corresponding single path alignments while respecting the mapping score threshold imposed by the user. Here, for the example, all four possible paths are considered valid. Second, each single path is compared to the colored paths from the reference variation graph. Two single path alignments matched the colored paths (4-6-8 and 5-6-7). As it mapped equally more than one colored path, this read falls in the multiple mapped reads case and is processed during the second step of the algorithm. path1 and 0.3 for path2. This ratio is applied to increment both the raw count of reads and the coverage of166 the nodes.167 Gene-level and strain-level abundances168 StrainFLAIR output is decomposed into an intermediate result describing the queried sample and169 gene-level abundances, and the final result describing the strain-level abundances.170 Gene-level171 After parsing the mapping result, the first output provides information for each colored path, i.e.172 each version of a gene. Thereby, this first result proposes gene-level information including abundances.173 Exhaustive description of these intermediate results is provided in Section S1.2 in Supplementary Materials.174 We describe here three major metrics outputted by StrainFLAIR:175 The mean abundance of the nodes composing the path. Instead of solely counting reads, we make176 full use of the graph structure and we propose abundances computation for each node as previously177 explained, and as already done for haplotype resolution (Baaijens et al., 2019). Hence, for each colored178 path, the gene abundance is estimated by the mean of the nodes abundance.179 In order to not underestimate the abundance in case of a lack of sequencing depth (which could result180 in certain nodes not to be traversed by sequencing reads), the mean abundance without the nodes of181 5/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ the path never covered by a read is also outputted.182 The mean abundance with and without these non-covered nodes are computed using unique mapped183 reads only or all mapped reads.184 The ratio of covered nodes, defined as the proportion of nodes from the path which abundance is185 strictly greater than zero.186 Strain-level187 Strain-level abundances are then obtained by exploiting the specific genes of each reference genome188 from these intermediate results. First, for each genome, the proportion of detected genes is computed,189 as the proportion of specific genes on which at least one read maps. Then, the global abundance of the190 genome is computed as the mean or median of all its specific gene abundances. However, if the proportion191 of detected genes is less than a user-defined threshold, the genome is considered absent and hence its192 abundance is set to zero.193 StrainFLAIR final output is a table where each line corresponds to one of the reference genomes,194 containing in columns the proportion of detected specific genes, and our proposed metrics to estimate their195 abundances (using mean or median, with or without never covered nodes as described for the gene-level196 result).197 Results presented Section S1.3 in Supplementary Materials validate and motivate the proposed198 abundance metric by comparing it to the expected abundances and other estimations using linear models.199 RESULTS200 We validated our method on both a simulated and a real dataset. All computations were performed using201 StrainFLAIR, version 0.0.1, with default parameters. The relative abundances estimation was based202 on the mean of the specific gene abundances, computed by taking into account all the nodes (including203 non-covered nodes), and using a threshold on the proportion of detected specific genes of 50%.204 Results were compared to Kraken2 (Wood et al., 2019) considered as one of the state-of-the-art tool205 dedicated to the characterization of read set content, and based on flat sequences as references. Read206 counts given by Kraken2 were normalized by the genome length and converted into relative abundances.207 Computing setup and performances are indicated in Supplementary Materials, Section S1.4.208 Validation on a simulated dataset209 We first validated our method on simulated data, focusing on a single species with multiple strains. Our210 aim was to validate the StrainFLAIR ability to identify and quantify strains given sequencing data211 from a mixture of several strains of uneven abundances, and with one of them absent from the index.212 Reference variation graph213 We selected complete genomes of Escherichia coli, a predominant aerobic bacterium in the gut micro-214 biota (Tenaillon et al., 2010), and a species known for its phenotypic diversity (pathogenicity, antibiotics215 resistance) mostly resulting from its high genomic variability (Dobrindt, 2005).216 Eight strains of E. coli were selected for this experiment from the NCBI1. Seven were used to construct217 a variation graph (E. coli IAI39, O104:H4 str. 2011C-3493, str. K-12 substr. MG1655, SE15, O157:H16218 str. Santai, O157:H7 str. Sakai, O26 str. RM8426), and one was used as an unknown strain in a strains219 mixture (E. coli BL21-DE3).220 Mixtures and sequencing simulations221 Our aim was to simulate the co-presence of several E. coli strains. Two simulations with sequencing222 errors were conducted in order to highlight the detection and quantification of strains in a mixture. For223 each one, we tested our approach with various read coverage, as described below.224 We simulated the sequencing of three strains to mimic complex single species composition in225 metagenomic samples. One of the strain was in equal abundance of one of the two others, potentially226 making it more difficult to distinguish, or in lower abundance, potentially making it more difficult to227 detect at all. The first simulation was a mixture composed of three strains contributing in the reference228 graph: E. coli O104:H4 2011c-3493, IAI39, and K-12 MG1655. The second simulation was a mixture229 composed of three strains: E. coli O104:H4 2011c-3493, IAI39, and BL21-DE3. The later being absent230 from the reference variation graph thus simulating a new strain to be identified and quantified.231 1https://www.ncbi.nlm.nih.gov/genome/?term=txid562[orgn] 6/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ For both simulations, short sequencing reads of 150 bp were simulated using vg sim from vg232 toolkit with a probability of errors set to 0.1% : 300,000 reads for E. coli O104:H4 2011c-3493233 (representing ≈8.5x), 200,000 reads for E. coli IAI39 (representing ≈5.8x). For both simulations, various234 quantities of reads were generated for K-12 MG1655 or BL21-DE3: 200,000, 100,000, 50,000, 25,000,235 10,000, 5,000 or 1,000 reads, representing approximately 6.5x, 3x, 1.6x, 0.8x, 0.3x, 0.2x, and 0.03x236 respectively for these two strains.237 Strain-level abundances238 As explained in Methods, we computed the strain-level abundances using the specific gene-level abundance239 table obtained by mapping the simulated reads onto the variation graph. We compared our results to the240 expected simulated relative abundances.241 #reads K-12 Method O104:H4 IAI39 K-12 Sakai SE15 Santai RM8426 Expected 59.88 39.92 0.2 0 0 0 0 1,000 StrainFLAIR 56.45 43.55 0 0 0 0 0 Kraken2 38.91 60.72 0.22 0.04 0.07 0.03 0.02 Expected 57.14 38.1 4.76 0 0 0 0 25,000 StrainFLAIR 52.1 40.58 7.32 0 0 0 0 Kraken2 37.23 58.1 4.51 0.04 0.07 0.03 0.02 Expected 42.86 28.57 28.57 0 0 0 0 200,000 StrainFLAIR 38.12 29.83 32.05 0 0 0 0 Kraken2 28.31 44.18 27.35 0.04 0.08 0.03 0.02 Table 1. Reference strains relative abundances expected and computed by StrainFLAIR or Kraken2 for each simulated experiment with variable coverage of the K-12 MG1655 strain. Best results are shown in bold. Complete results are presented Section S1.6 in Supplementary Materials. Simulation 1: mixtures with K-12 MG1655, present in the reference graph242 StrainFLAIR successfully estimated the relative abundances of the three strains present in the243 mixture (Table 1), the sum of squared errors between the estimation given by our tool and the expected244 relative abundance was between 25 and 45 for all the experiments. However, it did not detect the very245 low abundant strain in the case of the mixture with 1,000 simulated reads for K-12 MG1655 (coverage of246 ≈0.03x). With our methodology, the threshold on the proportion of detected genes (see Methods) lead247 to set relative abundance to zero of likely absent strains. This reduces both the underestimation of the248 relative abundances of the present strains and the overestimation of the absent strains.249 In comparison, Kraken2 did not provide this resolution. Applied to our simulated mixtures, while250 Kraken2 was slightly better for K-12 MG1655 abundance estimation, it overestimated IAI39 relative251 abundance and underestimated O104’s one, leading to an overall higher sum of squared errors (between252 456 and 872) compared to the expected abundances. Moreover, it set relative abundances to all the seven253 reference strains whereas four of them were absent from the mixture. This was expected as some reads254 (from intergenic regions for example) can randomly be similar to regions of genes from absent strains.255 Simulation 2: mixtures with BL21-DE3, absent from the reference graph256 Here, BL21-DE3 was considered an unknown strain, not contributing to the variation graph. The closest257 strain of BL21-DE3 in the graph, according to fastANI (Jain et al., 2018), was K-12 MG1655 (98.9%258 of identity, see Supplementary Materials, Section S1.5). Thus we expected to find signal of BL21-DE3259 through the results on K-12 MG1655.260 As with the K-12 MG1655 mixtures, StrainFLAIR successfully estimated the relative abundances261 of the two known strains present in the mixture (Table 2), the sum of squared errors between the estimation262 given by our tool and the expected relative abundance was between 22 and 180 for all the experiments.263 Labelled as K-12, it also gave close estimations for BL21-DE3. Again, it did not detect the very low264 abundant strain in the case of the mixture with 1,000, 5,000, and 10,000 simulated reads for BL21-DE3.265 Also similarly to the K-12 MG1655 mixtures experiments, Kraken2 overestimated IAI39 relative266 abundance and underestimated O104’s one (sum of squared errors between 751 and 873), even less267 7/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ #reads BL21-DE3 Method O104:H4 IAI39 K-12 Sakai SE15 Santai RM8426 Expected 59.88 39.92 (0.2) 0 0 0 0 1,000 StrainFLAIR 56.47 43.53 0 0 0 0 0 Kraken2 38.93 60.76 0.11 0.05 0.08 0.04 0.03 Expected 57.14 38.1 (4.76) 0 0 0 0 25,000 StrainFLAIR 54.09 41.71 4.2 0 0 0 0 Kraken2 37.75 58.93 2.16 0.28 0.34 0.25 0.29 Expected 42.86 28.57 (28.57) 0 0 0 0 200,000 StrainFLAIR 46.95 35.34 17.72 0 0 0 0 Kraken2 31.14 48.83 13.53 1.57 1.67 1.58 1.68 Table 2. Reference strain relative abundances expected and computed by StrainFLAIR or Kraken2 for each simulated experiment with variable coverage of the BL21-DE3 strain, absent from the reference variation graph. BL21-DE3 strain expected abundances are given in parentheses in the K-12 column. Best results are shown in bold. Complete results are presented Section S1.6 in Supplementary Materials. precisely than in the previous experiment. With sufficient coverage (here from the 0.8x for BL21-DE3),268 StrainFLAIR was closer to the expected values for all the reference strains than Kraken2.269 Interestingly, the proportion of detected specific genes for each strain (Fig. 4) seems to highlight a270 pattern allowing to distinguish present strains, absent strains and likely new strains close to the reference271 in the graph. According to the experiments with enough coverage (from 25,000 simulated reads for272 BL21-DE3), three groups of proportions could be observed: proportion of almost 100% (O104:H4 and273 IAI39 : strains present in the mixtures and in the reference graph), proportion under 30-35% (Sakai, SE15,274 Santai, and RM8426 : strains absent from the mixtures), and an in-between proportion around 60-70% for275 K-12 MG1655 (closest strain to BL21-DE3).276 It was expected that an absent strain would have specific genes detected as StrainFLAIR detects a277 gene once only one read mappped on it. However, all absent strains had a proportion at around 30% except278 K-12 MG1655 which proportion was twice higher. Conjointly with the non-null abundance estimated for279 the reference K-12 MG1655, this suggests the presence of a new strain whose genome is highly similar to280 K-12 MG1655.281 Validation on a real dataset282 We used a mock dataset available on EBI-ENA repository under accession number PRJEB42498, in order283 to validate our method on real sequencing data from samples composed of various species and strains.284 The mock dataset is composed of 91 strains of bacterial species for which complete genomes or sets of285 contigs are available, including plasmids. Among the species, two of them contained each two different286 strains. Three mixes had been generated from the mock, and we used the “Mix1A” in the following287 results.288 Even though 20 out of 91 strains were absents in this mix, we indexed the full set of 91 genomes.289 This was done in order to mimic a classical StrainFLAIR use case where the queried data is mainly290 unknown, and the reference graph contains species or strains not existing in these queried data. The291 metagenomic sample was sequenced using Illumina HiSeq 3000 technology and resulted in 21,389,196292 short paired-end reads.293 We compared our results to the expected abundances of each strain in the sample defined as the294 theoretical experimental DNA concentration proportion. As such, it has to be noted that potential295 contamination and/or experimental bias could have occurred and affected the expected abundances.296 Strain detection297 Among the 91 strains used in the reference variation graph, StrainFLAIR detected 65 strains. All of298 these 65 strains were indeed sequenced in Mix1A. Hence, StrainFLAIR produced no false positive.299 From the 26 strains considered absent by StrainFLAIR, 20 were not present in the sample (true300 negatives) and 6 should have been detected (false negatives). However, the term false negative has to be301 soften as the ground truth remains uncertain. Among those 6 undetected strains, all of them had theoretical302 abundance below 0.1%.303 8/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ Figure 4. Proportion of detected specific genes for each simulated experiment with variable coverage of the BL21-DE3 strain, absent from the reference graph. More precisely, among the 6 strains undetected by StrainFLAIR, 5 had some detected genes,304 but below the 50% threshold. In this case, by default, StrainFLAIR discards these strains. Finally,305 only one of the undetected strains (Desulfovibrio desulfuricans ND 132) should have been theoretically306 detected (even if its expected coverage was below 0.1%), but no specific gene was identified. Considering307 that StrainFLAIR uses a permissive definition of detected gene (at least one read maps on the gene),308 having strictly no specific genes detected for Desulfovibrio desulfuricans ND 132 suggests that this strain309 might in fact be absent from Mix1A. This is also supported by the result from Kraken2 which estimated310 a relative abundance of ≈ 9e−5, almost 500 times lower than the theoretical result.311 As in the simulated dataset validation, Kraken2 affected non-null abundances to all the references312 and thus could not be used to definitely conclude on presence/absence of strains in the sample.313 Strain relative abundances314 For the estimated relative abundances, StrainFLAIR gave more similar results compared to the315 state-of-the-art tool Kraken2 than the experimental values (Fig. 5). The sum of squared error between316 StrainFLAIR and Kraken2 was around 11. StrainFLAIR and Kraken2 gave similar results317 compared to the experimental values, with sum of squared errors of around 209 and 211 respectively.318 Interestingly, Thermotoga petrophila RKU-1 is the only case where results from StrainFLAIR319 and Kraken2 differs greatly, with, in addition, the theoretical abundance being in-between. Moreover,320 Thermotoga sp. RQ2 is the strain expected to be absent that Kraken2 estimates with the highest relative321 abundance among the other expected absent strains, and the only one exceeding the relative abundances322 of two present strains. Considering the previous results on the simulated mixtures and that Thermotoga323 9/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ Figure 5. Experimental relative abundance compared to relative abundance as computed by StrainFLAIR and Kraken2. A selection of relevant results is shown here, see Supplementary Materials (Section S1.7) for the complete results. (A) Represents a case where StrainFLAIR and Kraken2 give similar results to the experimental value (18 cases over 91). (B) Represents a case where StrainFLAIR and Kraken2 give similar results, but lower than the experimental value (26 cases over 91). (C) Represents a case where StrainFLAIR and Kraken2 give similar results, but greater than the experimental value (16 cases over 91). (D, E, F, G) Represent the two species represented by two strains each. (H, I) Represent two atypical cases. petrophila RKU-1 and Thermotoga sp. RQ2 are close species (fastANI around 96.6%) it could be an324 additional indicator of how tools like Kraken2 can be mislead by too close species or strains.325 In the sample, the species Methanococcus maripaludis was represented by two strains (S2 and C5) and326 the species Shewanella baltica likewise (OS223 and OS185). StrainFLAIR successfully distinguished327 and estimated the relative abundances of each strain of these two genomes. In this very situation and328 contrary to results on E. coli strains, Kraken2 was also able to correctly estimate the abundances.329 DISCUSSION330 Recent advances in sequencing technologies have provided large reference genome resources. Represen-331 tation and integration of those multiple genomes, often highly similar, are under active development and332 led to genome graphs based tools. Integrating multiple genomes from the same species is particularly333 interesting as it provides new opportunities to characterize strains, a key resolution, for instance opening334 the field of precision medicine (Albanese and Donati, 2017; Marchesi et al., 2016).335 In this context, we developed StrainFLAIR, a new computational approach for strain level profiling336 of metagenomic samples, using variation graphs for representing all reference genomes. Our intention was337 10/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ in the one hand to test whether or not indexing highly similar genomes in a graph enables to characterize338 queried samples at the strain level, and, in the other hand, to provide a end-user tool able to perform the339 indexation of genomes and the query of reads including the analyses of mapping results.340 The method exploits state-of-the art-tools additionally to novel algorithmic and statistical solutions.341 By indexing microbial species and/or strains in a graph, it enables the identification and quantification of342 strains from a sequenced sample, mapped onto this graph.343 We have demonstrated on simulated and on real datasets the ability of our method to identify and cor-344 rectly estimate the abundance of microbial strains in metagenomic samples. In addition, StrainFLAIR345 was able to highlight the presence and also to estimate a relative abundance for a strain similar to existing346 references, but absent from these references.347 We also showed that StrainFLAIR tended to set to zero the predicted abundance of low abundant348 strains, while a tool like Kraken2 was able detect them. As a result, it seemed that StrainFLAIR349 looses the ability to detect very low abundant strains. However, in our simulations, this situation350 corresponded to coverages of 0.03x or less, hence simulating a strain for which not all genomic content351 was present. Eventually, it might be more relevant to define this strain as absent. Overall, there is a need to352 distinguish between low abundant strains, insufficient sequencing depth, and reads from intergenic regions353 or other genes randomly matching genes. In this regard, StrainFLAIR integrated a threshold on the354 proportion of specific genes detected that can be further explored to refine which strain abundances are set355 to zero. Importantly, results also showed that our graph-based tool had no false positive call, contrary to356 general purpose tool Kraken2 that detected 100% of strains that were indexed but absent from queried357 reads.358 From the validation on real datasets, we showed that StrainFLAIR was still able to correctly359 estimate the relative abundances in a more complex context mixing both different species and different360 strains, without being biased by references absent in the sample.361 Our methodology taking into account all mapped reads and imposing a threshold that sets some strains362 abundances to zero seems more adequate and closer to what is expected in reality. Moreover, being able363 to detect some queried strains as absent is particularly interesting in the metagenomics context. Unlike364 mock datasets that are of controlled and known compositions, no prior knowledge is available for real365 metagenomic samples. They require the most exhaustive references - including unnecessary genomes -366 hence strains absent from the sample. StrainFLAIR is a new step towards the objective to take into367 account those unnecessary genomes without biasing the downstream analysis.368 Measured computation time performances show that StrainFLAIR enables to analyse million reads369 in a few hours. Even if this opens the doors to routine analyses of small read sets, new development370 efforts will have to be made for reducing computation time in order to scale-up to very large datasets.371 While StrainFLAIR focuses on profiling metagenomic samples at the strain level based on genes, it372 opens the way to pangenomic studies. Genome graphs are used to capture all the information on variation373 or similarity of sequences, which is particularly adapted to represent the gene repertoire diversity and the374 set of nucleotidic variations found between the different genomes of a species. This work highlights the375 importance to keep up working on pangenome graph representation.376 The presence of queried unknown strain(s) is revealed both by reads mapping non-colored paths and377 by the amount of nucleotidic variations (indels and substitutions). The natural continuation will be related378 to the dynamical update of the graph when novel strains are detected in this way. This dynamicity will also379 be particularly useful considering the future flow of new sequenced metagenomes and the development of380 clinical metagenomics that will help to quickly and efficiently characterize in silico emerging strains of381 human health interest.382 ACKNOWLEDGMENTS383 This work used the GenOuest bioinformatics core facility (https://www.genouest.org).384 We acknowledge Mircea Podar for the providing of the mock dataset in premium access. Finally, we385 thank Mahendra Mariadassou, Rayan Chikhi, Olivier Jaillon and David Vallenet for all their advice along386 this work.387 11/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ REFERENCES388 Albanese, D. and Donati, C. (2017). Strain profiling and epidemiology of bacterial species from metage-389 nomic sequencing. Nature Communications, 8(1):1–14.390 Baaijens, J. A., der Roest, B. V., Köster, J., Stougie, L., and Schönhuth, A. (2019). Full-length de novo391 viral quasispecies assembly through variation graph construction. bioRxiv, page 287177.392 Ballouz, S., Dobin, A., and Gillis, J. (2019). Is it time to change the reference genome? bioRxiv, page393 533166.394 Clemente, J. C., Ursell, L. K., Parfrey, L. W., and Knight, R. (2012). The impact of the gut microbiota on395 human health: An integrative view.396 Dobrindt, U. (2005). (Patho-)Genomics of Escherichia coli.397 Ehrlich, S. D. (2011). MetaHIT: The European Union project on metagenomics of the human intestinal398 tract. In Metagenomics of the Human Body, pages 307–316. Springer New York.399 Garrison, E. (2021). ekg/seqwish: alignment to variation graph inducer. https://github.com/400 ekg/seqwish.401 Garrison, E., Novak, A., Hickey, G., Eizenga, J., Dawson, E., Jones, W., Buske, O., and Lin, M. (2017).402 Sequence variation aware references and read mapping with vg : the variation graph toolkit. bioRxiv.403 Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E. T., Jones, W., Garg, S.,404 Markello, C., Lin, M. F., Paten, B., and Durbin, R. (2018). Variation graph toolkit improves read405 mapping by representing genetic variation in the reference.406 Hyatt, D., Chen, G. L., LoCascio, P. F., Land, M. L., Larimer, F. W., and Hauser, L. J. (2010). Prodigal:407 Prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11:119.408 Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T., and Aluru, S. (2018). High throughput409 ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nature Communications,410 9(1):1–8.411 Jovel, J., Patterson, J., Wang, W., Hotte, N., O’Keefe, S., Mitchel, T., Perry, T., Kao, D., Mason, A. L.,412 Madsen, K. L., and Wong, G. K. (2016). Characterization of the gut microbiome using 16S or shotgun413 metagenomics. Frontiers in Microbiology, 7(APR):459.414 Kim, D., Paggi, J. M., Park, C., Bennett, C., and Salzberg, S. L. (2019). Graph-based genome alignment415 and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology, 37(8):907–915.416 Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–417 3100.418 Li, H., Feng, X., and Chu, C. (2020). The design and construction of reference pangenome graphs with419 minigraph. Genome Biology, 21(1):265.420 Li, J., Wang, J., Jia, H., Cai, X., Zhong, H., Feng, Q., Sunagawa, S., Arumugam, M., Kultima, J. R.,421 Prifti, E., Nielsen, T., Juncker, A. S., Manichanh, C., Chen, B., Zhang, W., Levenez, F., Wang, J., Xu,422 X., Xiao, L., Liang, S., Zhang, D., Zhang, Z., Chen, W., Zhao, H., Al-Aama, J. Y., Edris, S., Yang,423 H., Wang, J., Hansen, T., Nielsen, H. B., Brunak, S., Kristiansen, K., Guarner, F., Pedersen, O., Doré,424 J., Ehrlich, S. D., and Bork, P. (2014). An integrated catalog of reference genes in the human gut425 microbiome. Nature Biotechnology, 32(8):834–841.426 Li, W. and Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or427 nucleotide sequences. Bioinformatics, 22(13):1658–1659.428 Loman, N. J., Constantinidou, C., Christner, M., Rohde, H., Chan, J. Z.-M., Quick, J., Weir, J. C., Quince,429 C., Smith, G. P., Betley, J. R., Aepfelbacher, M., and Pallen, M. J. (2013). A Culture-Independent430 Sequence-Based Metagenomics Approach to the Investigation of an Outbreak of Shiga-Toxigenic431 Escherichia coli O104:H4. JAMA, 309(14):1502.432 Marchesi, J. R., Adams, D. H., Fava, F., Hermes, G. D., Hirschfield, G. M., Hold, G., Quraishi, M. N.,433 Kinross, J., Smidt, H., Tuohy, K. M., Thomas, L. V., Zoetendal, E. G., and Hart, A. (2016). The gut434 microbiota and host health: A new clinical frontier. Gut, 65(2):330–339.435 Na, J. C., Kim, H., Park, H., Lecroq, T., Léonard, M., Mouchard, L., and Park, K. (2016). FM-index of436 alignment: A compressed index for similar strings. Theoretical Computer Science, 638:159–170.437 New, F. N. and Brito, I. L. (2020). What Is Metagenomics Teaching Us, and What Is Missed?438 Paten, B., Eizenga, J. M., Rosen, Y. M., Novak, A. M., Garrison, E., and Hickey, G. (2018). Superbubbles,439 Ultrabubbles, and Cacti. In Journal of Computational Biology, volume 25, pages 649–663. Mary Ann440 Liebert Inc.441 Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K. S., Manichanh, C., Nielsen, T., Pons, N., Levenez,442 12/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ F., Yamada, T., Mende, D. R., Li, J., Xu, J., Li, S., Li, D., Cao, J., Wang, B., Liang, H., Zheng, H., Xie,443 Y., Tap, J., Lepage, P., Bertalan, M., Batto, J.-M., Hansen, T., Le Paslier, D., Linneberg, A., Nielsen,444 H. B., Pelletier, E., Renault, P., Sicheritz-Ponten, T., Turner, K., Zhu, H., Yu, C., Li, S., Jian, M., Zhou,445 Y., Li, Y., Zhang, X., Li, S., Qin, N., Yang, H., Wang, J., Brunak, S., Doré, J., Guarner, F., Kristiansen,446 K., Pedersen, O., Parkhill, J., Weissenbach, J., MetaHIT Consortium, M., Bork, P., Ehrlich, S. D.,447 and Wang, J. (2010). A human gut microbial gene catalogue established by metagenomic sequencing.448 Nature, 464(7285):59–65.449 Quince, C., Walker, A. W., Simpson, J. T., Loman, N. J., and Segata, N. (2017). Shotgun metagenomics,450 from sampling to analysis.451 Rakocevic, G., Semenyuk, V., Lee, W. P., Spencer, J., Browning, J., Johnson, I. J., Arsenijevic, V., Nadj, J.,452 Ghose, K., Suciu, M. C., Ji, S. G., Demir, G., Li, L., Toptaş, B., Dolgoborodov, A., Pollex, B., Spulber,453 I., Glotova, I., Kómár, P., Stachyra, A. L., Li, Y., Popovic, M., Källberg, M., Jain, A., and Kural, D.454 (2019). Fast and accurate genomic analyses using genome graphs. Nature Genetics, 51(2):354–362.455 Rasko, D. A., Rosovitz, M. J., Myers, G. S., Mongodin, E. F., Fricke, W. F., Gajer, P., Crabtree, J.,456 Sebaihia, M., Thomson, N. R., Chaudhuri, R., Henderson, I. R., Sperandio, V., and Ravel, J. (2008).457 The pangenome structure of Escherichia coli: Comparative genomic analysis of E. coli commensal and458 pathogenic isolates. Journal of Bacteriology, 190(20):6881–6893.459 Solé, C., Guilly, S., Da Silva, K., Llopis, M., Le-Chatelier, E., Huelin, P., Carol, M., Moreira, R.,460 Fabrellas, N., De Prada, G., Napoleone, L., Graupera, I., Pose, E., Juanola, A., Borruel, N., Berland,461 M., Toapanta, D., Casellas, F., Guarner, F., Doré, J., Solà, E., Ehrlich, S. D., and Ginès, P. (2021).462 Alterations in Gut Microbiome in Cirrhosis as Assessed by Quantitative Metagenomics: Relationship463 With Acute-on-Chronic Liver Failure and Prognosis. Gastroenterology, 160(1):206–218.e13.464 Stewart, E. J. (2012). Growing unculturable bacteria.465 Sunagawa, S., Coelho, L. P., Chaffron, S., Kultima, J. R., Labadie, K., Salazar, G., Djahanschiri, B., Zeller,466 G., Mende, D. R., Alberti, A., Cornejo-Castillo, F. M., Costea, P. I., Cruaud, C., D’Ovidio, F., Engelen,467 S., Ferrera, I., Gasol, J. M., Guidi, L., Hildebrand, F., Kokoszka, F., Lepoivre, C., Lima-Mendez, G.,468 Poulain, J., Poulos, B. T., Royo-Llonch, M., Sarmento, H., Vieira-Silva, S., Dimier, C., Picheral, M.,469 Searson, S., Kandels-Lewis, S., Boss, E., Follows, M., Karp-Boss, L., Krzic, U., Reynaud, E. G., Sardet,470 C., Sieracki, M., Velayoudon, D., Bowler, C., De Vargas, C., Gorsky, G., Grimsley, N., Hingamp, P.,471 Iudicone, D., Jaillon, O., Not, F., Ogata, H., Pesant, S., Speich, S., Stemmann, L., Sullivan, M. B.,472 Weissenbach, J., Wincker, P., Karsenti, E., Raes, J., Acinas, S. G., and Bork, P. (2015). Structure and473 function of the global ocean microbiome. Science, 348(6237).474 Tenaillon, O., Skurnik, D., Picard, B., and Denamur, E. (2010). The population genetics of commensal475 Escherichia coli.476 Thorpe, H. A., Bayliss, S. C., Hurst, L. D., and Feil, E. J. (2017). Comparative analyses of selection477 operating on nontranslated intergenic regions of diverse bacterial species. Genetics, 206(1):363–376.478 Vieira-Silva, S., Falony, G., Belda, E., Nielsen, T., Aron-Wisnewsky, J., Chakaroun, R., Forslund, S. K.,479 Assmann, K., Valles-Colomer, M., Nguyen, T. T. D., Proost, S., Prifti, E., Tremaroli, V., Pons, N.,480 Le Chatelier, E., Andreelli, F., Bastard, J. P., Coelho, L. P., Galleron, N., Hansen, T. H., Hulot, J. S.,481 Lewinter, C., Pedersen, H. K., Quinquis, B., Rouault, C., Roume, H., Salem, J. E., Søndertoft, N. B.,482 Touch, S., Alves, R., Amouyal, C., Galijatovic, E. A. A., Barthelemy, O., Batisse, J. P., Berland, M.,483 Bittar, R., Blottière, H., Bosquet, F., Boubrit, R., Bourron, O., Camus, M., Cassuto, D., Ciangura,484 C., Collet, J. P., Dao, M. C., Debedat, J., Djebbar, M., Doré, A., Engelbrechtsen, L., Fellahi, S.,485 Fromentin, S., Giral, P., Graine, M., Hartemann, A., Hartmann, B., Helft, G., Hercberg, S., Hornbak,486 M., Isnard, R., Jaqueminet, S., Jørgensen, N. R., Julienne, H., Justesen, J., Kammer, J., Kerneis, M.,487 Khemis, J., Krarup, N., Kuhn, M., Lampuré, A., Lejard, V., Levenez, F., Lucas-Martini, L., Massey,488 R., Maziers, N., Medina-Stamminger, J., Moitinho-Silva, L., Montalescot, G., Moutel, S., Le Pavin,489 L. P., Poitou-Bernert, C., Pousset, F., Pouzoulet, L., Schmidt, S., Silvain, J., Svendstrup, M., Swartz, T.,490 Vanduyvenboden, T., Vatier, C., Verger, E., Walther, S., Dumas, M. E., Ehrlich, S. D., Galan, P., Gøtze,491 J. P., Hansen, T., Holst, J. J., Køber, L., Letunic, I., Nielsen, J., Oppert, J. M., Stumvoll, M., Vestergaard,492 H., Zucker, J. D., Bork, P., Pedersen, O., Bäckhed, F., Clément, K., and Raes, J. (2020). Statin therapy493 is associated with lower prevalence of gut microbiota dysbiosis. Nature, 581(7808):310–315.494 Wood, D. E., Lu, J., and Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome495 Biology, 20(1):257.496 13/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ S1 SUPPLEMENTARY MATERIALS497 S1.1 Third-party tools usage and rational498 We propose here a the motivations and precise usage of the third-party tools that are employed in499 StrainFLAIR.500 S1.1.1 Graph construction501 vg toolkit allows to modify the graph including a normalization step. Normalization consists in502 deleting redundant nodes (nodes containing the same sub-sequence and having the same parent and child503 nodes), removing edges that do not introduce new paths, and merging nodes separated by only one edge.504 For each cluster, if the colored paths of the corresponding graph still describe their respective input505 sequences, the graph is normalized.506 After the concatenation of all computed graphs (one for each cluster), the final single variation graph507 is indexed using vg toolkit. Indexing a graph allows a fast querying of the graph when mapping508 reads. Indexation uses two file formats: XG, which is a succinct graph index which presents a static509 index of nodes, edges and paths of a variation graph, and GCSA, a generalized FM-index to directed510 acyclic graphs. A SNARLS file is also generated, describing snarls (a generalization of the superbubble511 concept (Paten et al., 2018)) in the variation graph and similarly allowing faster querying.512 S1.1.2 Mapping reads513 vg toolkit offers two sequence-to-graph mappers. The first one, vg map, outputs one or several514 final paths for each alignment. However, in case of several alignments with equal mapping scores, only515 one is randomly chosen. In order to get more exhaustive and accurate results, StrainFLAIR uses vg516 mpmap to map reads on the variation graph.517 The mapping results are given in GAMP format, then converted into JSON format with vg toolkit,518 describing, for each read, the nodes of the graph traversed by the alignment.519 S1.2 Gene-level output by StrainFLAIR520 Here we present the exhaustive description of information provided by StrainFLAIR at the gene level521 (before strain-level computations). For each colored path StrainFLAIR provides the following items:522 • The corresponding gene identifier.523 • For each reference genome, the number of copies of the gene. Since each unique version of a gene524 is represented once in the graph, whereas it can exist in several copies in the genome (duplicate525 genes), the counts and abundances computed correspond to the sum of those copies. Keeping track526 of the number of copies is important to normalize the counts.527 • The cluster identifier to which the colored path belongs.528 • For unique mapped reads: their raw number and their number normalized by the sequence length529 (see Section Querying variation graphs in Methods).530 • For unique plus multiple mapped reads: their raw number and their number normalized by the531 sequence length (see Section Querying variation graphs in Methods).532 • The mean abundance of the nodes composing the path, as defined in the manuscript.533 • The mean abundance without the nodes of the path never covered by a read, as defined in the534 manuscript.535 • The ratio of covered nodes, as defined in the manuscript.536 S1.3 Abundance metrics validation537 The output of StrainFLAIR provides several metrics to estimate the abundance of the genes detected538 in the sample.539 For validation, we used a combination of LASSO (least absolute shrinkage and selection operator)540 model and linear model on the simulated dataset to estimate the abundances at the strain-level, as the541 abundance of a gene is a linear combination of the abundances of the strains it belongs to. As such,542 14/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ we expect no intercept value for those models and have forced the intercept at zero for the following543 modeling.544 First, a LASSO model was used to perform strain selection. The response variable of the model was545 the presence or absence of the genes according to the selected metric while the strains, described as their546 genes content (number of copies), were the predictors. Then, a linear model was constructed with the547 raw selected metric as the response variable, and only the strains selected by the LASSO model as the548 predictors. The estimate of the strains relative abundance was thus the coefficients of the linear model549 associated to the strains and transformed into relative values. For each metric, the sum of squared errors550 between the real relative abundances and the estimated relative abundances from the linear model was551 computed. The best metric was then defined as the one minimizing this sum of squared errors.552 For the mixtures containing E. coli K-12 MG1655, the three expected strains were selected and thus553 detected using LASSO, except for the mixture containing only 1,000 reads of K-12 MG1655 (representing554 0.002% of the mixture, hence very negligible). For all the mixtures, the best metric was the mean555 abundance computed from the node abundances and by taking into account the multiple mapped reads.556 For the mixtures containing E. coli BL21-DE3, BL21-DE3 being absent from the reference but very557 close to K-12 MG1655, we expected to get some detection of K-12 in the results. The three expected558 strains were selected and thus detected using LASSO, except for the mixture containing only 1,000 reads559 of BL21-DE3 (representing 0.002% of the mixture, hence very negligible). For the mixtures at 200,000,560 100,000, and 50,000 reads of BL21-DE3, the best metric was the mean abundance computed from the561 node abundances without the abundances at zero, and by taking into account the multiple mapped reads.562 While for the others, the best metric was the mean abundance computed from the node abundances563 (including the abundances at zero), and by taking into account the multiple mapped reads.564 This approach using linear models was particularly appropriate for this situation where the reference565 variation graph and the sample contained a small number of strains and thus a small number of predictors566 for the model. However, this can hardly transpose to a whole metagenomic sample with various species567 and various strains that would lead to too many predictors and probably confusing the heuristics behind568 the models. This was confirmed by applying the same methodology above on the mock dataset leading569 to abundances estimation hardly comparable to expected. Compared to Kraken2 results, the sum of570 squared errors of our methodology was approximately 6 whereas for the results with the LASSO model it571 was around 236. Nevertheless, those results highlighted the relevance of (i) using a metric taking into572 account the multiple mapped reads and not only the unique mapped reads, and (ii) using our metric of573 abundance based on the node abundances over raw read counts.574 15/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ S1.4 Performances575 Our benchmarks were performed on the GenOuest platform on a machine with 48 Xeon E5-2670 2.30576 GHz with 500 GB of memory and 16 CPUs. Time results (Table S1) are the wall-clock times. We577 provided rough computation time, mainly in the purpose to show that StrainFLAIR can be applied on578 usual datasets.579 Dataset Step Items processed Time Disk used (GB) Max mem. (GB) Gene prediction 7 genomes 0m20 0 1.2 Gene clustering 34,011 genes 0m22 0 0.36 Graph construction 8,596 clusters 2m44 0.04 1.31 Graph concatenation 8,596 graphs 0m51 0 0.25 Simulated Graph indexation 1 graph 6m23 0.16 4.24 Mapping reads 350,000 short reads 15m15 0.16 0.99 JSON conversion 1 GAMP file 3m58 4.2 0.03 JSON parsing 1 JSON file + 1 GFA file + 1 pickle file 12m44 0 0.55 Abundance computing 1 Gene abundances table 0m2 0 0.04 Gene prediction 91 genomes 1m43 1.02 6.7 Gene clustering 280,174 genes 3m38 0.14 0.98 Graph construction 270,712 clusters 41m54 1.12 9.1 Graph concatenation 270,712 graphs 14m38 0 1.05 Mock Graph indexation 1 graph 75m19 1.98 30.4 Mapping reads 21,389,196 short read pairs 147m28 7 17.5 JSON conversion 1 GAMP file 53m21 75 0.12 JSON parsing 1 JSON file + 1 GFA file + 1 pickle file 110m44 0 5.7 Abundance computing 1 Gene abundances table 0m4 0 0.68 Table S1. StrainFLAIR performances on simulated and mock datasets. 16/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ S1.5 Distance between the selected genomes in the simulated experiment580 We estimated the distance between the complete genomes of the selected strains using fastANI (Average581 Nucleotide Identity). FastANI uses an alignment-free algorithm to estimate the average nucleotide identity582 between pairs of sequences.583 K-12 IAI39 O104:H4 Sakai SE15 Santai BL21-DE3 RM8426 K-12 100 97.0652 98.3769 97.8703 96.8716 98.0362 98.9365 98.3657 IAI39 97.037 100 96.9742 96.7417 97.1289 96.9295 97.0197 96.8987 O104:H4 98.3059 96.9521 100 97.4788 96.8007 97.8896 98.249 98.7212 Sakai 97.7497 96.8627 97.5094 100 96.6657 98.1523 97.7455 97.6125 SE15 96.8453 97.1064 96.9211 96.7362 100 96.7575 96.8141 96.7763 Santai 98.0073 97.0372 97.9584 98.1797 96.8199 100 97.9279 97.9077 BL21-DE3 98.9983 97.1721 98.4048 97.8227 96.8448 97.9616 100 98.3204 RM8426 98.306 96.9037 98.6801 97.5815 96.6907 97.8353 98.2567 100 Table S2. Distance between each pair of complete genome sequences from eight strains of E. coli as computed by fastANI. All pairs showed a distance at least greater than 95%, highlighting the strong similarities between584 the strains. As a threshold, we although considered that beyond 99%, sequences were too similar to be585 considered and distinguished, additionally to the effect of sequencing errors. The fastANI results showed586 that none of the pairs exceeded this similarity threshold.587 The strain E. coli BL21-DE3 was chosen as the unknown strain while the seven others would be used588 to build the reference pangenome graph. According to the results of fastANI, the strain BL21-DE3 closest589 genome in the present references is the strain K-12 with a similarity of 98.9%. Hence we expected to find590 evidences of the strain K-12 while analyzing a sample containing the unknown strain BL21-DE3.591 17/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ S1.6 Detailed results from simulated datasets592 #reads K-12 Method O104:H4 IAI39 K-12 Sakai SE15 Santai RM8426 Expected 59.88 39.92 0.2 0 0 0 0 1,000 StrainFLAIR 56.45 43.55 0 0 0 0 0 Kraken2 38.91 60.72 0.22 0.04 0.07 0.03 0.02 Expected 59.41 39.6 0.99 0 0 0 0 5,000 StrainFLAIR 54.89 42.46 2.65 0 0 0 0 Kraken2 38.61 60.25 0.99 0.04 0.07 0.03 0.02 Expected 58.82 39.22 1.96 0 0 0 0 10,000 StrainFLAIR 54.08 41.96 3.96 0 0 0 0 Kraken2 38.26 59.69 1.9 0.04 0.07 0.03 0.02 Expected 57.14 38.1 4.76 0 0 0 0 25,000 StrainFLAIR 52.1 40.58 7.32 0 0 0 0 Kraken2 37.23 58.1 4.51 0.04 0.07 0.03 0.02 Expected 54.55 36.36 9.09 0 0 0 0 50,000 StrainFLAIR 49.23 38.51 12.26 0 0 0 0 Kraken2 35.63 55.6 8.62 0.04 0.07 0.03 0.02 Expected 50 33.33 16.67 0 0 0 0 100,000 StrainFLAIR 44.66 35.05 20.29 0 0 0 0 Kraken2 32.8 51.19 15.85 0.04 0.07 0.03 0.02 Expected 42.86 28.57 28.57 0 0 0 0 200,000 StrainFLAIR 38.12 29.83 32.05 0 0 0 0 Kraken2 28.31 44.18 27.35 0.04 0.08 0.03 0.02 Table S3. Reference strains relative abundances expected and computed by StrainFLAIR or Kraken2 for each simulated experiment with variable coverage of the K-12 MG1655 strain. Best results are shown in bold. Table S3 provides exhaustive results on simulated datasets when all queried strains are indexed in the593 variation graph. Table S4 provides exhaustive results on simulated datasets when one of the queried strain594 (BL21-DE3) is not indexed and highly similar to strain K-12.595 18/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ #reads BL21-DE3 Method O104:H4 IAI39 K-12 Sakai SE15 Santai RM8426 Expected 59.88 39.92 (0.2) 0 0 0 0 1,000 StrainFLAIR 56.47 43.53 0 0 0 0 0 Kraken2 38.93 60.76 0.11 0.05 0.08 0.04 0.03 Expected 59.41 39.6 (0.99) 0 0 0 0 5,000 StrainFLAIR 56.45 43.55 0 0 0 0 0 Kraken2 38.72 60.42 0.5 0.09 0.13 0.08 0.07 Expected 58.82 39.22 (1.96) 0 0 0 0 10,000 StrainFLAIR 56.45 43.55 0 0 0 0 0 Kraken2 38.47 60.05 0.92 0.14 0.19 0.12 0.13 Expected 57.14 38.1 (4.76) 0 0 0 0 25,000 StrainFLAIR 54.09 41.71 4.2 0 0 0 0 Kraken2 37.75 58.93 2.16 0.28 0.34 0.25 0.29 Expected 54.55 36.36 (9.09) 0 0 0 0 50,000 StrainFLAIR 52.74 40.62 6.65 0 0 0 0 Kraken2 36.59 57.17 4.15 0.51 0.57 0.48 0.53 Expected 50 33.33 (16.67) 0 0 0 0 100,000 StrainFLAIR 50.47 38.64 10.89 0 0 0 0 Kraken2 34.53 54.03 7.68 0.91 0.98 0.91 0.96 Expected 42.86 28.57 (28.57) 0 0 0 0 200,000 StrainFLAIR 46.95 35.34 17.72 0 0 0 0 Kraken2 31.14 48.83 13.53 1.57 1.67 1.58 1.68 Table S4. Reference strains relative abundances expected and computed by StrainFLAIR or Kraken2 for each simulated experiment with variable coverage of the BL21-DE3 strain, absent from the reference graph. BL21-DE3 being similar at 98.9% to K-12 strain (highest similarity compared to the other references), we expect that reads from BL21-DE3 will map this strain, hence its expected values are given in parentheses, as they correspond to BL21-DE3 strain abundances and not K-12. Best results are shown in bold. 19/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ S1.7 Detailed results for validation on mock datasets596 Figure S1. Experimental relative abundance compared to relative abundance computed by StrainFLAIR and Kraken2. Figure S1 shows full results obtained on the mock dataset.597 20/20 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430979 http://creativecommons.org/licenses/by/4.0/ 10_1101-2021_02_12_430989 ---- Benchmarking Association Analyses of Continuous Exposures with RNA-seq in Observational Studies 1 Benchmarking Association Analyses of Continuous Exposures with RNA-seq in Observational Studies Tamar Sofer1,2,*, Nuzulul Kurniansyah1,*, François Aguet3, Kristin Ardlie3, Peter Durda4, Deborah A. Nickerson5, Joshua D. Smith5, Yongmei Liu6, Sina A. Gharib7, Susan Redline1, Stephen S. Rich8, Jerome I. Rotter9, Kent D. Taylor9 1Division of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Boston, MA, USA 2Departments of Medicine and of Biostatistics, Harvard University, Boston, MA, USA 3The Broad Institute of MIT and Harvard, Cambridge, MA, USA 4Department of Pathology and Laboratory Medicine, Larner College of Medicine, University of Vermont, Burlington, VT, USA 5Department of Genome Sciences, University of Washington, Seattle, WA, USA 6Duke Molecular Physiology Institute, Department of Medicine, Division of Cardiology, Duke University Medical Center, Durham, NC, USA 7Computational Medicine Core, Center for Lung Biology, UW Medicine Sleep Center, Department of Medicine, University of Washington, Seattle, WA, USA 8Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA 9The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA USA *These authors contributed equally to the work. Correspondence: Tamar Sofer Email: tsofer@bwh.harvard.edu 221 Longwood Ave .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 2 Boston, MA 02115 Abstract Large datasets of hundreds to thousands of individuals measuring RNA-seq in observational studies are becoming available. Many popular software packages for analysis of RNA-seq data were constructed to study differences in expression signatures in an experimental design with well-defined conditions (exposures). In contrast, observational studies may have varying levels of confounding of the transcript-exposure associations; further, exposure measures may vary from discrete (exposed, yes/no) to continuous (levels of exposure), with non-normal distributions of exposure. We compare popular software for gene expression - DESeq2, edgeR, and limma - as well as linear regression-based analyses for studying the association of continuous exposures with RNA-seq. We developed a computation pipeline that includes transformation, filtering, and generation of empirical null distribution of association p-values, and we apply the pipeline to compute empirical p-values with multiple testing correction. We employ a resampling approach that allows for assessment of false positive detection across methods, power comparison, and the computation of quantile empirical p-values. The results suggest that linear regression methods are substantially faster with better control of false detections than other methods, even with the resampling method to compute empirical p-values. We provide the proposed pipeline with fast algorithms in R. Introduction .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 3 Many studies of phenotypes associated with gene expression from RNA-seq consist of small sample sizes (tens of subjects) and are focused on comparisons of transcriptional expression patterns between well-delineated states, such as different experimental conditions, tumor versus non-tumor cells (1; 2), and disease vs non-disease groups (3). Some studies are designed to identify differential expression across hidden, discrete conditions (4). Epidemiological cohorts have recently utilized stored samples to facilitate the use of RNA-seq data in studies of association with subclinical phenotypes such as blood biomarkers, imaging, and other physiological measures, with often continuous measures being used in statistical analyses. High throughput RNA sequencing enables broad assaying of a sample’s transcriptome (5) and has been in increasing use for over a decade (6). A large variety of analytic and statistical approaches have been developed to address scientific questions such as alternative splicing, differential expression, and more (4; 7-11), often building on methods developed for analyses of expression microarrays (12-14); comprehensive reviews are available (15-19). In this work, we are specifically interested in differential expression analysis with continuous exposures, and we assume that count data are already prepared and available to the analyst. Popular software packages for differential expression analysis include the DESeq2 R package (9), which models the expression counts as following a negative binomial distribution, with shrinkage imposed on both the mean and the dispersion parameters, based on estimates from the entire transcriptome, or user-supplied values. EdgeR (7) uses a negative binomial model similar to the DESeq2 model for transcript counts, in combination with overdispersion moderation. EdgeR was primarily designed for differential expression analysis between two groups when at least one of the groups has replicated measurements (20). Limma (21) uses linear models, which are very .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 4 flexible and can effectively accommodate many study designs and hypotheses. Similar to the DESeq2 and edgeR packages, Limma also uses an empirical Bayes method to borrow information across transcripts to estimate a global variance parameter that is applied for the computation of variance parameters of each single transcript. It uses log transformation and weighting, known as the “voom” transformation, in the final linear model that is used for differential expression analysis. We refer to it henceforth as the limma-voom. Prior to differential expression analysis, library normalization is performed (22). Popular approaches are the TMM (trimmed-means of M-values) normalization (23), implemented in edgeR, and the size factors normalization (24), implemented in DESeq2. Sleep disordered breathing phenotypes, such as the Apnea-Hypopnea Index (AHI), the number of apnea and hypopnea events per hour of sleep, provides a quantitative assessment of the severity of the disorder, with no clear threshold above which different biological processes occur (although thresholds are used for clinical decision making and health insurance reimbursement). Association analysis with continuous exposures provides different challenges than those traditionally encountered. The distribution of such exposures may have strong effects on the association analysis results, regardless of the underlying associations, due to the combination of skewed exposure distributions and the distribution of RNA-seq read count data, that are generally over-dispersed with occasional extreme values. As observational study data analyses may include covariates, statistical methods from experimental studies (e.g., exact tests) cannot be applied. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 5 In this manuscript, we compare the DESeq2, edgeR, and limma-voom analysis approaches for differential expression analysis, with linear regression–based approaches that do not use the empirical Bayes approach for estimating variance parameter across the transcriptome. We study the computation of p-values using resampling of phenotype residuals, while preserving the structure of the data. This addresses the limitation of permutation noted by others in the context of differential expression analysis of RNA-seq (21), where permutation may not be tuned to test a specific null hypothesis because in its standard form it “breaks” all relationships between the permuted variable and the rest of the dataset. Finally, we study the use of empirical p-values that tune the original p-values based on the residual resampling scheme. Throughout, we use a dataset with sleep disordered breathing phenotypes and RNA-seq from the Multi-Ethnic Study of Atherosclerosis as a case study. We demonstrate the statistical implications of performing association analysis of RNA-seq with continuous, non-normal exposures, compare analysis methods, and develop recommendations. Methods The Multi-Ethnic Study of Atherosclerosis (MESA) MESA is a longitudinal cohort study, established in 2000, that prospectively collected risk factors for development of subclinical and clinical cardiovascular disease among participants in six field centers across the United States (Baltimore City and Baltimore County, MD; Chicago, IL; Forsyth County, NC; Los Angeles County, CA; Northern Manhattan and the Bronx, NY; and St. Paul, MN). The cohort has been studied every few years. The present analysis considers N = 462 individuals who participated in a sleep ancillary study performed shortly following the .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 6 participants Exam 5 during 2010-2013 (25; 26), with RNA-seq measured via the Trans-Omics in Precision Medicine (TOPMed) program. Here, we used RNA-seq data with RNA extracted from whole blood drawn in Exam 5 (2010-2012). Sleep data were collected using standardized full in- home level-2 polysomnography (Compumedics Somte Systems, Abbotsville, Australia, AU0), as described before (26). Of the 462 participants in the current analysis, there were 196 African- Americans (AA), 259 European-Americans (EA) and 125 Hispanic-Europeans (HA). RNA sequencing in MESA is briefly described in the Supplementary Materials. Sleep disordered breathing measures As examples for continuous exposures from population-based studies, we took three sleep disordered breathing measures: (1) the Apnea-Hypopnea Index (AHI), defined as the number of apnea (breathing cessation) and hypopnea (at least 30% reduction of breath volume, accompanied by 3% or higher reduction of oxyhemoglobin saturation compared to the baseline saturation) per 1 hour or sleep; (2) minimum oxyhemoglobin saturation during sleep (MinO2), and (3) average oxyhemoglobin saturation during sleep (AvgO2). We chose these traits because they are clinically relevant, often used in sleep research studies, and represent exposures that may alter gene expression (via hypoxemia and sympathetic activation). The AHI had the least skewed distribution of the considered phenotypes, and AvgO2 had the longest “tail” of small values in the residual distribution. Residuals were obtained by regression the sleep measures on age, sex, body mass index (BMI), study center, and self-reported race/ethnic group. Compared tests of associations between exposure and transcripts .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 7 We compared the standard packages DESeq2, edgeR, limma, and linear regression-based approaches, in which we always applied log transformation on the transcript counts, and then applied linear regression. Because some of the observed transcript count values are zero, which cannot be log transformed, we compared a few approaches for replacing zero values. For a given transcript �, denote the minimum observed transcript level that is higher than zero by m� � min ����, … , ���: ��� � 0 for � � 1, … , ��. We compare the following approaches, applied on each transcript �, � � 1, … , � separately: A1. SubHalfMin: Replace zero values with �� � . A2. AddHalfMin: Replace all values ��� by ��� � �� � . A3. AddHalf: Replace all values ��� by ��� � � � . Conceptual framework for studying analysis approaches To study performance of various analysis approaches, we performed simulation studies. Simulation study 1 was used to assess type 1 error across methods when using output p-values, and when using “empirical p-values”, which are p-values that account for true distribution of the p-values under the null and are described later. Simulation study 2 was used to assess power in transcriptome-wide analysis settings, when using methods that control the type 1 error according to simulation study 1. In addition, we performed a simulation study (Supplementary Materials) to assess power for testing of individual transcript according to various distributional characteristics of transcript counts. The goal was to identify approaches for filtering transcripts for association analysis that will optimize power. All simulations used a “residual permutation” (below). The reported criteria for declaring differentially expressed transcripts were False Discovery Rate (FDR) controlling p-values <0.05 based on the Benjamini-Hochberg (BH) procedure, and based .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 8 on the local FDR procedure implemented in the qvalue R package, Family-Wise Error Rate (FWER) controlling p-values <0.05 based on the Holms procedure, and an arbitrary threshold of p-value<10-5. Residual permutation approach for simulations and for empirical p-value computation To generate realistic simulation studies in which: (a) the data structure, including the exposure, covariates, and outcome distributions; and (b) their relationships, aside from the exposure- outcome association, are the same as in the real data, we used a residual permutation approach. We regressed each sleep exposure of interest � on the covariates � and estimated their effect �. We then obtained residuals, defined as: � � � � ���. To study type 1 error, we permuted these residuals at random to obtain ��� !�", and generated a sleep exposure unassociated with any of the RNA-seq measures by: ����� � ��� � ��� !�". We repeated this procedure 1000 times for evaluating type 1 error control. We generated simulated data under four power simulations in a similar approach, with the difference that we forced a specific correlation value between the simulated sleep exposure and a specific transcript. To this end, for a given transcript � measured on individuals � � 1, … , �, we computed the rank of each individual: ��!���", … , ��!���". To set a correlation $ between the simulated � and transcript � we sampled $ % � (rounded) indices from 1, … , �, corresponding to $ % � individuals for which we forced their ranks in the permuted residual values, now denoted by ��� !�" , to be the same as their ranks in the transcript values (note that the transcript values are never changed). For the rest of the individuals, the permuted residuals are completely random. When .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 9 multiple individuals have the same transcript counts (i.e., their ranks are tied), we randomly assign their ranks. For example, if 100 people have zero counts for a given transcript, each of these individuals will be equally likely to have the rank of 1, 2, …, or 100. The code for generating this residual permutation approach is provided in the Supplementary Information and in a dedicated GitHub repository https://github.com/nkurniansyah/RNA- Seq_continuous_exposure. Empirical p-values to account for the null distribution of p-values We used the residual permutation approach, under the null hypothesis, to generate a null distribution of p-values and to compute empirical p-values. When the distribution of p-values under the null hypothesis is unknown, and specifically when it is not uniform, their values are not reliable for hypothesis testing. Alternative approaches compute “empirical p-values” with the goal of generating an appropriate p-value distribution, i.e., in which an empirical p-value � satisfies Pr!� ' 0.05|*�" � 0.05 (Supplementary Materials). For computing empirical p-values, we use a relatively small number of residual permutations (in comparison to the number of permutations used for computing permutation p-values) followed by transcriptome-wide association studies. We use the results of these transcriptome-wide tests under permutation to compute the null distribution of p-values, which is then used to compute the empirical p-values. We compare two types of empirical p-values: quantile empirical p-values, and Storey empirical p-values implemented in the qvalue R package (27). The quantile empirical p-value approach is inspired by previously proposed procedures based on permutation (28) of phenotypes (rather than residuals). It estimates the null distribution of p-values non- .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 10 parametrically, and the quantile empirical p-value is the quantile of the raw p-value in this distribution. The Storey empirical p-values uses the null distribution of the test statistics to identify whether a transcript is likely sampled from the null or a non-null distribution. Both implementations assume that the empirical null distribution is the same for all transcripts. We used 100 residual permutations to compute test statistics and p-values under the null and compared the empirical p-values to standard permutation p-values. Resampling approach for binary exposure phenotypes We compared the analysis of a continuous exposure to that of a dichotomized variable. Instead of a sleep measure, we used body mass index (BMI), because it is known to have large impact of gene expression and is therefore a powerful phenotype for such a comparison. BMI was dichotomized to “obese” if BMI + 30kg/m2 and non-obese otherwise. Because obesity is binary and, therefore, the residual permutation approach is not appropriate as proposed for continuous variables, we generated a binomial obesity variable based on BMI probability given covariates. Given a logistic model -./��0�!123� � 1"4 � 5� �6, we estimated the covariates’ association parameters 67 and obtained estimated probabilities for obesity for each person � � 1, … , � by �̂!123� � 1" � �9���!5� �67". Based on these estimated outcome probabilities, we sampled random obesity status as binomial variables. Results MESA participant characteristics are provided in the Supplementary Material, Table S1. The distribution of the raw phenotypes AHI, MinO2, and AvgO2, and their residuals after regression on covariates is provided in Figure 1, demonstrating the high non-normality. Simulations were .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 11 performed after normalizing the data so that each library has the same size (prior to filtering), which we set to the median observed value (i.e., median normalization) in the raw reads, or 23,210,672. Results for some of the settings in simulation study 1 under TMM and size factor normalizations are provided in the Supplementary Materials. Simulation study 1: type 1 error analysis After normalization, we applied filters to remove lowly expressed transcripts. There were 58,311 transcripts. After applying filters requiring that the (a) maximum read count is >10 and that (b) the proportion of individuals with zero counts for a transcript across the sample is not higher than 0.75 (see Supplementary Materials for more information on filters), 23,004 transcripts were available for the simulation study. We used residual permutation to generate simulated SDB phenotypes that are not associated with the transcripts, but maintain the same correlation structure with the transcript and covariates. We generated 100 datasets with simulated SDB phenotypes, and performed analyses. Complete results showing the average number of false positive detection based on the existing packages limma, edgeR, and DESeq2, as well as the three linear regression analyses described here, are provided in Supplementary Figures S3-S5. These results include comparisons of raw p-values, the proposed quantile empirical p-values, and the empirical p-values provided in the qvalue R package (27), and for the three SDB phenotypes. We found that the number of false positives vary with the exposure phenotypes, with analyses of MinO2 (Figure 2) generally resulting in more false positive detections than analyses of the AHI, with intermediate numbers for AvgO2 (Figures S3-S5 in the Supplementary Materials). Figure 2 compares the average number of falsely discovered transcript associations when using simulated .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 12 sleep phenotypes mimicking MinO2 using the residual permutation approach by focusing on limma, edgeR, DESeq2, and linear regression applied on log2 of expression counts with SubHalfMin. For each method, type I error was determined using raw p-values and Storey empirical p-values, with significance thresholds based on Benjamini-Hochberg (BH) FDR, local FDR (qvalue package), and Holms Family-Wise Error Rate (FWER). Empirical p-values usually reduced the number of false detections, with the method in the qvalue package being usually more conservative than the quantile-based empirical p-values method. Compared to linear regression-based approaches, DESeq2, edgeR and limma-voom had many false detections when using the raw p-values, even after applying multiple testing corrections. The three linear regression-based methods described here were quite similar, with the AddHalf approach often resulting in slightly more false detections. Based on these results, we chose to move forward for the next set of simulations with linear regression with SubHalfMin for handling of zero counts. Simulation study 2: power analysis We performed simulations that mimic transcriptome-wide analysis to assess power. Based on simulations comparing power by transcript distributional characteristics (see Supplementary Materials), we only considered 19,742 transcripts for which no more than 50% of the sample had zero counts. We chose two transcripts, and for each of these and each of the sleep phenotypes, we performed 100 simulations in which we used the residual permutation approach to generate association between the sleep phenotype and the transcript with correlation $ � 0.3. We performed transcriptome-wide association analysis using DESeq2, edgeR, and linear regression with SubHalfMin transformation (limma-voom was not used, given its high rate of false .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 13 positive detections in some of the settings in simulation study 1). For power, we always used empirical p-values (both types) and determined whether the specific transcript of interest passed the significance threshold based on FDR-adjusted (29) empirical p-value < 0.05. Power was defined as the proportion of the simulations in which the associations was significant, and was consistently higher for the linear regression-based approach compared to DESeq2 or edgeR. For linear regression, the quantile empirical p-values performed essentially the same as Storey’s empirical p-values, while Storey’s empirical p-values resulted in substantially higher statistical power when using DESeq2 and edgeR. We illustrate power comparisons in Figure 3 using Storey’s empirical p-values. Power comparisons using quantile empirical p-values are provided in the Supplementary Materials Figure S8. Proposed analysis approach Based on the above simulation studies, we developed an analytic pipeline as depicted in Figure 4: (a) the raw read count are normalized; (b) filters are applied to remove lowly expressed transcripts and those for which the statistical power is low, as determined by simulations, (c) AddHalfMin transformation is applied for each individual separately, then log transformation is applied on all transcripts, (d) association analyses is performed using linear regression to compute effect sizes and p-values, (e) permutations are computed 100 times on exposure residuals after regressing on covariates, to generate simulated traits that maintain the data structure, (f) each of 100 vectors of simulated traits are analyzed using the same approach as the raw trait, generating p-values, (g) p-values from the analysis of the 100 simulated traits are combined to generate an empirical null distribution of p-values, that are used to generate .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 14 empirical p-values for the raw trait using the qvalue package, and (h) multiple testing correction is applied on the empirical p-values. Comparison of analysis of continuous BMI with analysis of dichotomous obesity status We compared the differential expression of transcripts in analysis of BMI and obesity. Residual permutation procedure was used and quantile-empirical p-values generated for both analyses. A total of 925 MESA individuals had BMI measure available and, for analysis, at least 50% non- zero transcripts were required. For obesity, several non-zero transcript thresholds were examined: 50%, 40%, and 30%. The results were similar for all thresholds, resulting in many more identified transcript associations (446 vs. 251) with continuous BMI compared to using a dichotomous trait (Supplementary Information Figure S9). Computing time comparison The compute time for transcriptome-wide association study was obtained for analyses using DESeq2, edgeR, and our linear regression implementation. Using our linear regression implementation on a single core, a single transcriptome-wide association study applied on ~19K transcripts and N=462 individuals took less than a minute; when 100 transcriptome-wide association studies applied to residual permutations were included to compute empirical p- values, the time reached 7 minutes, and the maximum memory used was 1.3GB. In comparison, DESeq2 took 53.5 minutes and edgeR took 18.8 minutes for a single transcriptome-wide association study. The maximum memory used for DESeq2 and edgeR was similar at 3.1GB. R package .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 15 Code for implementing the proposed procedure and for a shiny app is provided in the GitHub repository https://github.com/nkurniansyah/Olivia. The code also provides test of multiple exposure variable at the same time, which applies the multivariate-Wald test, and an efficient implementation of a permutation test when considering a single transcript, rather than a transcriptome-wide analysis. The repository also includes code used for simulations. Data availability MESA data are available through application to dbGaP. Phenotypes are available in MESA study accession phs000209.v13.p3, and RNA-seq data has been deposited and will become available through the TOPMed-MESA study accession phs001416.v2.p1. Discussion We systematically assessed the approaches for studying the association of gene expression, estimated using RNA sequencing, with continuous and non-normally distributed exposure phenotypes. We found that linear regression-based analysis performs well for continuous phenotype associations, and is computationally highly efficient. We used a residual permutation approach to study the distribution of p-values under the null of no association between the phenotypes and RNA-seq, and used this approach to further study power, and to compute empirical p-values. Notably, the residual permutation approach allows for the dataset to have the same correlation structures and associations between the phenotypes and the transcripts and covariates, while eliminating the transcript-phenotype associations. We implemented this approach in an R package and developed an R shiny app, to make our pipeline easily accessible to the research community. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 16 Recently, van Rooij et al (30) also performed a benchmarking study comparing analysis approaches for transcriptome-wide analysis of RNA-seq in population-based studies, including when using continuous phenotypes in association testing. While we used similar statistical methods to theirs, we took a different analytical approach. Van Rooij et al. used multiple datasets to apply association analysis between a phenotype and transcripts, and assessed replication between analyses. We, on the other hand, leveraged simulations to generate data under a known association structure. In addition, we were motivated by a specific problem: highly non-normal sleep exposure measures, often leading to suboptimal control of Type 1 error. Thus, it was critical to assess control of false discovery under the null hypothesis. Notably, sleep phenotypes are less often available and there are no other large observational studies data sets to our knowledge with both RNA-seq measures and similar SDB phenotypes. Some of our findings are similar to those of van Rooij et al.: they also recommend using linear regression analysis, and they also found that using a continuous phenotype is generally more powerful than dichotomizing it (in agreement with what is known from statistical literature). Similarly, they found that normalization method had very little effect on the results. However, they recommend testing all genes, while we recommend filtering transcripts with at least 50% zero counts, based on our power simulations. Additional future work is needed to evaluate various filtering criteria, and to develop methods that allow for flexible, non-linear modeling of the association between phenotype and gene expression while remaining computationally efficient to allow for permutation analysis. We propose to compute p-values under the null hypothesis of no association between the transcript and the exposure phenotype by permuting residuals of the exposure phenotype after .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 17 regressing on covariates, and re-structuring the exposure by summing the permuted residuals with the estimated mean, and thus maintain the overall data structure except for the exposure- outcome association of interest. Outside the gene expression literature, others have proposed to permute residuals rather than the outcome. For example, previous permutation methods proposed to permute residuals of the outcome after regressing on covariates (31), or to permute the residuals of the exposure phenotypes without constructing a new exposure phenotype by summing the permuted residuals with the estimated mean (32). It will be interesting to perform a more comprehensive study of statistical permutation approaches for RNA-seq association analyses, as well as studying them in the context of mixed models. We recommend using empirical p-values, which require 100 residual permutation, and therefore, performing 101 transcriptome-wide association analyses instead of one. Considering Figures S3- S5 in the Supplementary Information, one can see that in most settings, linear regression methods do not have many false positive detections even when raw p-values are used. However, we chose to be more conservative by strongly protecting the analysis from false positive detections. Importantly, the linear regression analysis with empirical p-values had higher power than the other common approaches (DESeq2, edgeR), indicating simultaneous improvement in controlling false positives and increasing power. Unfortunately, we cannot effectively estimate the FDR in these simulations. FDR is defined as the proportion of false discoveries out of all discovered (significant) associations. In simulation study 1, none of the transcripts were associated with the outcomes, so that any estimated FDR would be 100%. Under the alternative, one can suggest to use the number of wrongly discovered associations to estimate the FDR. However, many transcripts are highly correlated with the one simulated to be associated with the .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 18 exposures, and are therefore associated with the exposure by design, and thus the number of transcripts falsely detected as associated with the exposure cannot be easily determined. The empirical p-values procedure uses p-values from the entire tested transcriptome to compute the empirical null distribution. This encapsulates the assumption that the null distribution of p- values is the same for all transcripts, which is generally a limitation, but has been shown to be often acceptable since it will lead to less power, rather than increasing the number of false detections (33; 34). An approach that does not require this assumption estimates the null distribution for p-value for each transcript separately, which is a standard permutation approach. We investigated this issue by comparing the quantile empirical p-values with the permutation p- values that use 100,000 residual permutations to estimate the null distribution of the p-value of each transcript separately (Figure S2 in the Supplementary Materials). The two p-value distributions are very similar. Therefore, a computationally expensive permutation approach, as well as other approaches proposed by investigators, such as estimating null distributions across sets of transcripts with similar properties (33; 35), are likely unnecessary and not superior to the computationally efficient empirical p-values method. Another approach for estimating the null distribution of p-values uses the primary results, without any permutation (36; 37). These approaches also use the assumption that the null p-value distribution is the same across transcripts (i.e. a shared null distribution exists). Given the computationally fast implementation of the transcriptome-wide association study, we believe that using residual permutation is beneficial because it allows for a more precise quantification of the null p-value distribution. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 19 Batch effects are important to account for in studies of RNA-seq. Here, we did not study their effect because it was beyond the scope of our investigation. van Rooij et al (30) in their benchmarking study focusing on replication across cohorts, compared a few approaches for adjusting for technical covariates, including estimating and adjusting for latent confounders (38). They concluded that inclusion of more technical adjusting covariates, including hidden confounders, increases the rate of replication between studies. To summarize, we highlighted the problem of high false positive findings in RNA-seq data when studying the association of continuous exposure phenotypes that are highly non-normal. We developed a computationally efficient pipeline to address the false positive detection problem, and studied strategies to optimize statistical power. Our approach will be particularly useful for epidemiological studies with RNA-seq data that were not designed as disease-focused case-control studies. Acknowledgements This work was supported by the National Heart Lung and Blood Institute grant R35HL135818. MESA and the MESA SHARe projects are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA is provided by contracts 75N92020D00001, HHSN268201500003I, N01-HC- 95159, 75N92020D00005, N01-HC-95160, 75N92020D00002, N01-HC-95161, 75N92020D00003, N01-HC-95162, 75N92020D00006, N01-HC-95163, 75N92020D00004, N01-HC-95164, 75N92020D00007, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC- 95168, N01-HC-95169, UL1-TR-000040, UL1-TR-001079, UL1-TR-001420. Also supported in .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 20 part by the National Center for Advancing Translational Sciences, CTSI grant UL1TR001881, and the National Institute of Diabetes and Digestive and Kidney Disease Diabetes Research Center (DRC) grant DK063491 to the Southern California Diabetes Endocrinology Research Center. Molecular data for the Trans-Omics in Precision Medicine (TOPMed) program was supported by the National Heart, Lung and Blood Institute (NHLBI). RNA-Seq for “NHLBI TOPMed: Multi- Ethnic Study of Atherosclerosis (MESA)” (phs001416.v1.p1) was performed at the Northwest Genomics Center (HHSN268201600032I). Core support including centralized genomic read mapping and genotype calling, along with variant quality metrics and filtering were provided by the TOPMed Informatics Research Center (3R01HL-117626-02S1; contract HHSN268201800002I). Core support including phenotype harmonization, data management, sample-identity QC, and general program coordination were provided by the TOPMed Data Coordinating Center (R01HL-120393; U01HL-120393; contract HHSN268201800001I). We gratefully acknowledge the studies and participants who provided biological samples and data for TOPMed. Author contributions TS conceptualized and drafted the manuscript and supervised the analysis. NZ performed all statistical analysis and data visualization and developed the R package and R shiny app. DN, and JS performed RNA sequencing. FA and KA generated the MESA processed the sequenced RNA to generate the RNA-seq dataset. PD, YL, SSR, JIR, and KDT designed the RNA-seq study in .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 21 MESA. SR designed and supervised the MESA sleep ancillary study. All authors critically reviewed and approved the manuscript. References 1. Zhai W, Yao XD, Xu YF, Peng B, Zhang HM, et al. 2014. Transcriptome profiling of prostate tumor and matched normal samples by RNA-Seq. Eur Rev Med Pharmacol Sci 18:1354-60 2. Peng L, Bian XW, Li DK, Xu C, Wang GM, et al. 2015. Large-scale RNA-Seq Transcriptome Analysis of 4043 Cancers and 548 Normal Tissue Controls across 12 TCGA Cancer Types. Sci Rep 5:13413 3. Kim WJ, Lim JH, Lee JS, Lee SD, Kim JH, Oh YM. 2015. Comprehensive Analysis of Transcriptome Sequencing Data in the Lung Tissues of COPD Subjects. Int J Genomics 2015:206937 4. Klambauer G, Unterthiner T, Hochreiter S. 2013. DEXUS: identifying differential expression in RNA-Seq studies with unknown conditions. Nucleic Acids Res 41:e198 5. Auer PL, Doerge RW. 2010. Statistical design and analysis of RNA sequencing data. Genetics 185:405-16 6. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. 2008. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621-8 7. Law CW, Alhamdoosh M, Su S, Dong X, Tian L, et al. 2016. RNA-seq analysis is easy as 1- 2-3 with limma, Glimma and edgeR. F1000Research 5 8. Liu R, Holik AZ, Su S, Jansz N, Chen K, et al. 2015. Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses. Nucleic Acids Res 43:e97 9. Love MI, Huber W, Anders S. 2014. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15:550 10. Pimentel H, Bray NL, Puente S, Melsted P, Pachter L. 2017. Differential analysis of RNA- seq incorporating quantification uncertainty. Nat Methods 14:687-90 11. Wolf JBW. 2013. Principles of transcriptome analysis and gene expression quantification: an RNA-seq tutorial. Molecular Ecology Resources 13:559-72 12. Kathleen Kerr M, A. Churchill G. 2001. Statistical design and the analysis of gene expression microarray data. Genetical Research 77:123-8 13. Durbin BP, Hardin JS, Hawkins DM, Rocke DM. 2002. A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 18:S105-S10 14. Mostafavi S, Battle A, Zhu X, Urban AE, Levinson D, et al. 2013. Normalizing RNA- Sequencing Data by Modeling Hidden Covariates with Prior Knowledge. PLOS ONE 8:e68141 15. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, et al. 2016. A survey of best practices for RNA-seq data analysis. Genome Biol 17:13 16. Costa-Silva J, Domingues D, Lopes FM. 2017. RNA-Seq differential expression analysis: An extended review and a software tool. PLoS ONE 12 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 22 17. Ge SX, Son EW, Yao R. 2018. iDEP: an integrated web application for differential expression and pathway analysis of RNA-Seq data. BMC Bioinformatics 19:534 18. Hrdlickova R, Toloue M, Tian B. 2017. RNA-Seq methods for transcriptome analysis. Wiley Interdisciplinary Reviews: RNA 8:e1364 19. Li WV, Li JJ. 2018. Modeling and analysis of RNA-seq data: a review from a statistical perspective. Quantitative Biology 6:195-209 20. Robinson MD, McCarthy DJ, Smyth GK. 2009. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139-40 21. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, et al. 2015. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43:e47 22. Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, et al. 2012. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics 14:671-83 23. Robinson MD, Oshlack A. 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11:R25 24. Anders S, Huber W. 2010. Differential expression analysis for sequence count data. Nature Precedings 25. Bild DE, Bluemke DA, Burke GL, Detrano R, Diez Roux AV, et al. 2002. Multi-Ethnic Study of Atherosclerosis: objectives and design. Am J Epidemiol 156:871-81 26. Chen X, Wang R, Zee P, Lutsey PL, Javaheri S, et al. 2015. Racial/Ethnic Differences in Sleep Disturbances: The Multi-Ethnic Study of Atherosclerosis (MESA). Sleep 38:877-88 27. Storey J, Bass A, Dabney A, Robinson D. 2019. qvalue: Q-value estimation for false discovery rate control. In R package version 2.18.0. 28. van der Laan MJ, Hubbard AE. 2006. Quantile-function based null distribution in resampling based multiple testing. Stat Appl Genet Mol Biol 5:Article14 29. Benjamini Y, Hochberg Y. 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B 57:289-300 30. van Rooij J, Mandaviya PR, Claringbould A, Felix JF, van Dongen J, et al. 2019. Evaluation of commonly used analysis strategies for epigenome- and transcriptome-wide association studies through replication of large-scale population studies. Genome Biology 20:235 31. Anderson MJ, Legendre P. 1999. An empirical comparison of permutation methods for tests of partial regression coefficients in a linear model. Journal of Statistical Computation and Simulation 62:271-303 32. Werft W, Benner A. 2010. glmperm: A Permutation of Regressor Residuals Test for Inference in Generalized Linear Models. The R Journal 2:39 33. Yang H, Churchill G. 2006. Estimating p-values in small microarray experiments. Bioinformatics 23:38-43 34. Storey JD, Tibshirani R. 2003. SAM Thresholding and False Discovery Rates for Detecting Differential Gene Expression in DNA Microarrays. In The Analysis of Gene Expression Data: Methods and Software, ed. G Parmigiani, ES Garrett, RA Irizarry, SL Zeger:272-90. New York, NY: Springer New York. Number of 272-90 pp. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 23 35. Fan J, Chen Y, Chan HM, Tam PKH, Ren Y. 2005. Removing intensity effects and identifying significant genes for Affymetrix arrays in macrophage migration inhibitory factor-suppressed neuroblastoma cells. Proceedings of the National Academy of Sciences of the United States of America 102:17751 36. van Iterson M, van Zwet EW, Heijmans BT, the BC. 2017. Controlling bias and inflation in epigenome- and transcriptome-wide association studies using the empirical null distribution. Genome Biology 18:19 37. Efron B. 2004. Large-Scale Simultaneous Hypothesis Testing. Journal of the American Statistical Association 99:96-104 38. Wang J, Zhao Q, Hastie T, Owen AB. 2017. CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING. Annals of statistics 45:1863-94 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 24 Figure legends Figure 1: Distributions of the three sleep-disordered breathing exposure phenotypes used as case studies in this manuscript. The left column provides the empirical density functions of the raw phenotypes, the right column provides the empirical density function of their residuals after regressing on age, sex, BMI, self-reported race/ethnic group, and study center. AvgO2: average oxyhemoglobin saturation during sleep. MinO2: minimum oxyhemoglobin saturation during sleep. AHI: Apnea Hypopnea Index. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 25 Figure 2: Average number of false positive transcript associations detected by various methods used in simulation study 1 and computed over 100 repetitions. We used the residual permutation approach to mimic the MESA data set with the sleep phenotype MinO2. The methods reported here are linear regression (applied on log2-transformed transcript counts, with zero values replaced with SubHalfMin); DESeq2, edgeR, and limma-voom. The left column provides results when using raw p-values, the middle corresponds to use of quantile- empirical p-values, and the right corresponds to Storey empirical p-values. We report false positive detections as those with Benjamini-Hochberg (BH) False Discovery Rate adjusted (FDR) adjusted p-value < 0.05, Local FDR <0.05 (qvalue package) and with Holms Family-Wise Error Rate (FWER) adjusted p-values < 0.05. Error bars reflect the mean ; standard error. In Supplementary Figures S3-S5, we provide complete results, including for additional sleep phenotypes: AHI and AvgO2. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 26 Figure 3: Estimated power for detecting a transcript simulated as associated with the three sleep traits when using Storey empirical p-values, and association is determined significant if its BH FDR-adjusted p- value is <0.05. The transcripts were randomly selected out of available transcripts (after filtering of transcripts with 50% or higher zero counts across the sample). We compared linear regression, DESeq2, and edgeR in transcriptome-wide association analysis for each of the sleep phenotypes. For each transcript used in simulations, we show both power and the box plot of its distribution in the sample after Median normalization. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 27 Figure 4: Analysis pipeline for association transcriptome-wide association analysis of continuous exposure phenotypes. The raw data is normalized using library-size normalization, followed by filtering of transcripts, transformation of transcript expression values, then single-transcript testing to obtain raw p-values. In parallel, residual permutation is applied under the null 100 times, and p-values are used to construct an empirical p-value distribution under the null, and to compute empirical p-values. Finally, the quantile empirical p-values are corrected for multiple testing. 27 s .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 10_1101-2021_02_12_431018 ---- HaVoC, a bioinformatic pipeline for reference-based consensus assembly and lineage assignment for SARS-CoV-2 sequences. 1 Title 1 HaVoC, a bioinformatic pipeline for reference-based consensus assembly and lineage 2 assignment for SARS-CoV-2 sequences. 3 4 Authors and institutional addresses 5 Phuoc Truong Nguyen 1, Ilya Plyusnin 2,3, Tarja Sironen 1,3, Olli Vapalahti 1,3,4, Ravi Kant †1,3, 6 Teemu Smura †1,4 7 8 1. Department of Virology, Faculty of Medicine, University of Helsinki, Helsinki, Finland 9 2. Institute of Biotechnology, University of Helsinki, Helsinki, Finland 10 3. Department of Veterinary Biosciences, University of Helsinki, Helsinki, Finland 11 4. Department of Virology, University of Helsinki and Helsinki University Hospital, Helsinki, 12 Finland 13 †Correspondence to: Ravi.Kant@helsinki.fi or Teemu.Smura@helsinki.fi 14 15 Abstract 16 Background: SARS-CoV-2 related research has increased in importance worldwide since 17 December 2019. Several new variants of SARS-CoV-2 have emerged globally, of which the 18 most notable and concerning currently are the UK variant B.1.1.7, the South African variant 19 B1.351 and the Brazilian variant P.1. Detecting and monitoring novel variants is essential in 20 SARS-CoV-2 surveillance. While there are several tools for assembling virus genomes and 21 performing lineage analyses to investigate SARS-CoV-2, each is limited to performing singular 22 or a few functions separately. 23 24 Results: Due to the lack of publicly available pipelines, which could perform fast reference-25 based assemblies on raw SARS-CoV-2 sequences in addition to identifying lineages to detect 26 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 2 variants of concern, we have developed an open source bioinformatic pipeline called HaVoC 27 (Helsinki university Analyzer for Variants Of Concern). HaVoC can reference assemble raw 28 sequence reads and assign the corresponding lineages to SARS-CoV-2 sequences. 29 30 Conclusions: HaVoC is a pipeline utilizing several bioinformatic tools to perform multiple 31 necessary analyses for investigating genetic variance among SARS-CoV-2 samples. The 32 pipeline is particularly useful for those who need a more accessible and fast tool to detect and 33 monitor the spread of SARS-CoV-2 variants of concern during local outbreaks. HaVoC is 34 currently being used in Finland for monitoring the spread of SARS-CoV-2 variants. HaVoC user 35 manual and source code are available at https://www.helsinki.fi/en/projects/havoc and 36 https://bitbucket.org/auto_cov_pipeline/havoc, respectively. 37 38 Keywords 39 SARS-CoV2, variant detection, reference assembly, lineage identification, coronavirus, 40 sequence analysis. 41 42 Background 43 Emerging pathogens pose a continuous threat to mankind, as exemplified by the Ebola virus 44 epidemic in West Africa in 2014 [1], Zika virus pandemic in 2015 [2], and the ongoing 45 Coronavirus disease 2019 (COVID-19) pandemic. These viruses are zoonotic, i.e. have crossed 46 species barriers from animals to humans, alike the majority of emerging human pathogens [3, 47 4]. The likelihood of this host switching is enhanced by several factors, e.g. global movement of 48 people and animals, environmental changes, increased proximity of humans, wildlife and 49 livestock, and population expansion into new environments [5]. 50 51 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 3 The mutation and evolution rate of RNA viruses is considerably higher than their hosts, which is 52 advantageous for viral adaptation. Mutations in the viral genome are most of the time silent or, if 53 affecting phenotype, related to attenuation, although mutations can also lead to more 54 pathogenic strains. A new virus variant may have one or more mutations that separate it from 55 the wild-type virus already circulating among the general population. 56 57 Coronaviruses (family Coronaviridae) are enveloped single-stranded RNA viruses, which cause 58 respiratory, enteric, hepatic, and neurological diseases of a broad spectrum of severity among 59 different animals and humans. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-60 2), a novel evolutionary divergent virus responsible for the present pandemic, has devastated 61 societies and economies globally. The SARS-CoV-2 pandemic has already infected more than 62 100 million people in 221 countries, causing over 2.2 million global deaths as of 3rd February 63 2021 [6]. In autumn 2020, a new variant of SARS-CoV-2 known as 20B/501Y.V1 (B.1.1.7) was 64 detected in south-eastern England, Wales, and Scotland [7]. This variant has since spread 65 globally to more than 80 countries. The variant has undergone 23 mutations with 13-66 nonsynonymous mutations, four amino acid deletions, and six synonymous mutations making 67 the virus more transmissible [8]. Another variant 20C/501Y.V2 (B.1.351) was detected in South 68 Africa which was genetically distant from the UK 20B/501Y.V1 variant [9]. This South African 69 variant with its two mutations in the receptor-binding motif that mainly forms the interface with 70 the human ACE2 receptor has also been widely spreading to circulate globally. It has been 71 noticed that some existing vaccines against SARS-CoV-2 are less effective against the 72 20C/501Y.V2 variant [10–12]. A third variant being closely monitored is P.1 detected first in 73 Brazil [13]. Interestingly, all these three variants have a mutation in the receptor binding domain 74 (RBD) of the spike protein at position 501, where the amino acid asparagine (N) has been 75 replaced with tyrosine (Y) enabling specific PCR to detect the N501Y mutation [14]. 76 77 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 4 As more transmissible coronavirus variants are circulating worldwide, the role of researchers 78 and technology specialists in controlling the pandemic has received more emphasis. The 79 surveillance of virus variants by sequencing the SARS-CoV-2 genomes would provide a fast 80 way to monitor variants and their spread, however, there are only few publicly available 81 methods for quick reference-based consensus assembly and lineage assignment for SARS-82 CoV-2 samples. For this purpose, we have developed a simple pipeline, called HaVoC (Helsinki 83 university Analyzer for Variants Of Concern), for quick reference-based consensus assembly 84 and lineage assignment for SARS-CoV-2 samples. This will provide the end user a quick and 85 accessible method of variant identification and monitoring. The pipeline was developed to be 86 run on Unix/Linux operating systems, and thus can also be used in remote servers, e.g. CSC – 87 IT Center for Science, Finland. 88 89 Implementation 90 HaVoC consists of a single shell script, which performs reference-based consensus assemblies 91 to query SARS-CoV-2 fastq sequence libraries and assigns lineages to them individually in 92 succession. The script can be started by typing the following line into your command line 93 terminal: 94 95 sh HaVoC.sh [FASTQ directory] 96 97 The computing of consensus sequences starts with the tool detecting FASTQ files generated 98 via paired end sequencing in a given input directory and checking that each query FASTQ file 99 has its corresponding counterpart, i.e. mates file. The names of the files are modified to be more 100 concise, e.g. Query-Seq:1_X123_Y000_R1_000.fastq.gz to Query-Seq:1_R1.fastq.gz. The 101 pipeline accepts FASTQ files both in gzipped and uncompressed format. 102 103 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 5 For the analyses, the user can choose which bioinformatic tools to utilize. This can be done by 104 typing the tool wanted (tools_prepro, tools_aligner and tools_sam) within the options section in 105 the beginning of the script file. For example, if the user wants to deploy Trimmomatic to pre-106 process FASTQ files, the following line can be changed as follows: 107 108 From 109 tools_prepro="fastp" 110 To 111 tools_prepro="trimmomatic" 112 113 Other options include the number of threads, minimum coverage below which a region is 114 masked (min_coverage), and whether to run Pangolin to assign lineages to the consensus 115 genome (run_pangolin). An additional option allows HaVoC to be run in the CSC servers 116 (run_in_csc). 117 118 The pre-alignment quality control, e.g. removing and trimming low quality reads and bases, 119 removing adapter sequences, can be done with either fastp [15] or Trimmomatic [16]. The reads 120 are then aligned to a reference genome of SARS-CoV-2 isolate Wuhan-Hu-1 (Genbank 121 accession code: NC_045512.2) with BWA-MEM [17] or Bowtie 2 [18]. The resulting SAM and 122 BAM files are processed (includes sorting, filling in mate coordinates, marking duplicate 123 alignments, and indexing reads) with Sambamba [19] or Samtools [20] and the low coverage 124 regions are masked with BEDtools [21]. After masking a variant call is done with Lofreq [22] 125 before computing the consensus sequence via BCFtools of Samtools [20]. Finally, the 126 consensus sequence is analyzed with pangolin [23] to assign a lineage. The whole process is 127 depicted in figure 1. 128 129 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 6 130 Fig. 1 Flowchart describing processes and steps performed by HaVoC pipeline. The pipeline 131 constructs consensus sequences from all FASTQ files in an input directory and then compares 132 the resulting sequences to other established SARS-CoV-2 genomes to assign them the most 133 likely lineages. The pipeline requires a FASTA file of adapter sequences for FASTQ pre-134 processing and a reference genome of SARS-CoV-2 in a separate FASTA file. The adapter file 135 is not required when running the pipeline with fastp option. Input files are highlighted in green 136 and the outputs in red. 137 138 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 7 Usage example 139 We are going to demonstrate a common use case for HaVoC with FASTQ files containing reads 140 for SARS-CoV-2 sequences, provided by the Viral zoonoses research unit at University of 141 Helsinki, Finland. The test files within the Example_FASTQs folder contain paired-end FASTQ 142 files for the UK variant (UK-variant-1) and the South African variant (S-Africa-variant-1). To 143 analyse these example files, the aforementioned command needs to be deployed as follows: 144 145 sh HaVoC.sh Example_FASTQs 146 147 Results 148 The FASTQ files are processed and analyzed with the default options utilizing faster 149 bioinformatic tools (fastp, BWA-MEM and Sambamba) in ca. 2–4 minutes, depending on the 150 performance of the platform (local or server). After HaVoc has finished the analyses, each 151 FASTQ file is moved to their respective result folders within the FASTQ directory. Each result 152 folder contains a FASTA file for the consensus sequence (e.g. UK-variant-1_consensus.fa) and 153 a CSV file with the lineage information produced by pangolin (e.g. UK-variant-154 1_pangolin_lineage.csv). In addition to these main result files, each directory contains the 155 original FASTQ files, BAM files (original, indexed and sorted), variant call files (VCF) with 156 mutation data, BED file used for masking regions, and fastp report files with the results of 157 FASTQ processing. The resulting directory and file structure with the example files will look as 158 follows: 159 Example_FASTQs/ 160 UK-variant-1/ 161 UK-variant-1.bam 162 UK-variant-1_R1.fastq.gz 163 UK-variant-1_R2.fastq.gz 164 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 8 UK-variant-1_consensus.fa 165 UK-variant-1_fixmate.bam 166 UK-variant-1_indel.bam 167 UK-variant-1_indel.vcf 168 UK-variant-1_indel_flt.vcf 169 UK-variant-1_lowcovmask.bed 170 UK-variant-1_markdup.bam 171 UK-variant-1_namesort.bam 172 UK-variant-1_pangolin_lineage.csv 173 UK-variant-1_sorted.bam 174 fastp.html 175 fastp.json 176 S-Africa-variant-1/ 177 S-Africa-variant-1.bam 178 S-Africa-variant-1_R1.fastq.gz 179 S-Africa-variant-1_R2.fastq.gz 180 S-Africa-variant-1_consensus.fa 181 S-Africa-variant-1_fixmate.bam 182 S-Africa-variant-1_indel.bam 183 S-Africa-variant-1_indel.vcf 184 S-Africa-variant-1_indel_flt.vcf 185 S-Africa-variant-1_lowcovmask.bed 186 S-Africa-variant-1_markdup.bam 187 S-Africa-variant-1_namesort.bam 188 S-Africa-variant-1_pangolin_lineage.csv 189 S-Africa-variant-1_sorted.bam 190 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 9 fastp.html 191 fastp.json 192 193 Each of the example UK variants should have been categorized as B.1.1.7 and the South 194 African variants as B.1.351 (with pangoLEARN release 2021-02-06). It is important to note 195 however, that as more sequences are uploaded and the pangolin lineage nomenclature 196 updated, the assigned lineages may differ from the expected ones described in this paper. 197 Regions with low coverages (with default setting under 30) are marked with the letter N during 198 masking and represent gaps in the final consensus sequences. 199 200 HaVoC is comparable to alternative combinations of tools, e.g. Jovian and pangolin, in both 201 speed and accuracy. These tools however operate separately, and as of publishing, there are 202 no single public tools that can both perform a reference-based consensus assembly and a 203 lineage identification in an easily accessible manner. 204 205 Conclusions 206 Early detection and understanding of the potential impact of emerging variants of SARS-CoV-2 207 is of primary importance and can assist in more efficient surveillance and control of the disease. 208 The likelihood of emergence of novel SARS-CoV-2 variants of concern is increased and 209 accelerated by the high mutation rates typical in RNA viruses and the growing number of 210 transmissions and infections both locally and globally. 211 212 With the rising number of variants detected worldwide and with many of them associated with 213 increased transmissibility and lower vaccine efficacy, there is an emerging need for fast, 214 efficient and reliable pipelines to help detect, identify and trace SARS-CoV-2 lineages. These 215 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 10 pipelines should in addition be accessible to researchers who may not be familiar with utilizing 216 complex bioinformatic tools or scripting pipelines. 217 218 Due to these challenges, we have developed HaVoC, a simple, reliable and user-friendly 219 pipeline, which can be simply downloaded from our repository and run without being installed. 220 All its dependencies can be installed via existing package managers, of which we recommend 221 Bioconda. HaVoC could help in the current pandemic situation by detecting variants of concern 222 in the sequencing centers and public health or other organisations currently running and tracing 223 variants of concern worldwide. HaVoC is currently utilized for detecting and tracing SARS-CoV-224 2 variants of concern, mainly B.1.1.7, B1.351 and P.1, in Finland. 225 226 Availability and requirements 227 Project name: HaVoC (Helsinki university Analyzer for Variants Of Concern) 228 Project home page: https://www.helsinki.fi/en/projects/havoc and 229 https://bitbucket.org/auto_cov_pipeline/havoc 230 Operating system(s): Linux, Mac 231 Programming language: Shell script 232 Other requirements: Trimmomatic or Fastp, BWA-MEM or Bowtie2, Samtools, BEDtools, 233 BCFtools, Lowfreq and Pangolin. 234 License: GNU GPL 235 Any restrictions to use by non-academics: license needed 236 237 List of abbreviations 238 SARS-CoV-2 - Severe acute respiratory syndrome coronavirus 2 239 COVID-19 - Coronavirus disease 2019 240 HaVoC - Helsinki university Analyzer for Variants Of Concern 241 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 11 242 References 243 1. Dixon MG, Schafer IJ, Centers for Disease Control and Prevention (CDC). Ebola viral 244 disease outbreak--West Africa, 2014. MMWR Morb Mortal Wkly Rep. 2014;63:548–51. 245 2. Kindhauser MK, Allen T, Frank V, Santhana RS, Dye C. Zika: the origin and spread of a 246 mosquito-borne virus. Bull World Health Organ. 2016;94:675-686C. 247 doi:10.2471/BLT.16.171082. 248 3. Taylor LH, Latham SM, Woolhouse ME. Risk factors for human disease emergence. Philos 249 Trans R Soc Lond B Biol Sci. 2001;356:983–9. doi:10.1098/rstb.2001.0888. 250 4. Woolhouse MEJ, Gowtage-Sequeria S. Host range and emerging and reemerging 251 pathogens. Emerging Infect Dis. 2005;11:1842–7. doi:10.3201/eid1112.050997. 252 5. Morens DM, Fauci AS. Emerging Pandemic Diseases: How We Got to COVID-19. Cell. 253 2020;182:1077–92. doi:10.1016/j.cell.2020.08.021. 254 6. Worldometer - COVID-19 Virus Pandemic. https://www.worldometers.info/coronavirus/. 255 Accessed 3 Feb 2021. 256 7. Rambaut A, Loman N, Pybus O, Barclay W, Barrett J, Carabelli A, et al. Preliminary genomic 257 characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike 258 mutations. Virological. 2020. https://virological.org/t/preliminary-genomic-characterisation-of-an-259 emergent-sars-cov-2-lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/563. 260 Accessed 2 Feb 2021. 261 8. Leung K, Shum MH, Leung GM, Lam TT, Wu JT. Early transmissibility assessment of the 262 N501Y mutant strains of SARS-CoV-2 in the United Kingdom, October to November 2020. Euro 263 Surveill. 2021;26. doi:10.2807/1560-7917.ES.2020.26.1.2002106. 264 9. Tegally H, Wilkinson E, Giovanetti M, Iranzadeh A, Fonseca V, Giandhari J, et al. Emergence 265 and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-266 CoV-2) lineage with multiple spike mutations in South Africa. medRxiv. 2020. 267 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 12 doi:10.1101/2020.12.21.20248640. 268 10. Mahase E. Covid-19: Novavax vaccine efficacy is 86% against UK variant and 60% against 269 South African variant. BMJ. 2021;:n296. doi:10.1136/bmj.n296. 270 11. Kupferschmidt K. Vaccine 2.0: Moderna and other companies plan tweaks that would 271 protect against new coronavirus mutations. Science. 2021. doi:10.1126/science.abg7691. 272 12. Edwards E. J&J says vaccine effective against Covid, though weaker against South Africa 273 variant. NBC News. 2021. https://www.nbcnews.com/health/health-news/j-j-vaccine-effective-274 against-covid-though-weaker-against-south-n1255400. Accessed 10 Feb 2021. 275 13. Faria NR, Claro IM, Candido D, Franco LAM, Andrade PS, Coletti TM, et al. Genomic 276 characterisation of an emergent SARS-CoV-2 lineage in Manaus: preliminary findings. 277 Virological. 2021. https://virological.org/t/genomic-characterisation-of-an-emergent-sars-cov-2-278 lineage-in-manaus-preliminary-findings/586. Accessed 3 Feb 2021. 279 14. Centers for Disease Control and Prevention (CDC). Emerging SARS-CoV-2 Variants. 280 https://www.cdc.gov/coronavirus/2019-ncov/more/science-and-research/scientific-brief-281 emerging-variants.html. Accessed 12 Feb 2021. 282 15. Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. 283 Bioinformatics. 2018;34:i884–90. doi:10.1093/bioinformatics/bty560. 284 16. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. 285 Bioinformatics. 2014;30:2114–20. doi:10.1093/bioinformatics/btu170. 286 17. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 287 arXiv. 2013. 288 18. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 289 2012;9:357–9. doi:10.1038/nmeth.1923. 290 19. Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P. Sambamba: fast processing of NGS 291 alignment formats. Bioinformatics. 2015;31:2032–4. doi:10.1093/bioinformatics/btv098. 292 20. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence 293 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 13 Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–9. 294 doi:10.1093/bioinformatics/btp352. 295 21. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. 296 Bioinformatics. 2010;26:841–2. doi:10.1093/bioinformatics/btq033. 297 22. Wilm A, Aw PPK, Bertrand D, Yeo GHT, Ong SH, Wong CH, et al. LoFreq: a sequence-298 quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from 299 high-throughput sequencing datasets. Nucleic Acids Res. 2012;40:11189–201. 300 doi:10.1093/nar/gks918. 301 23. pangolin. https://github.com/cov-lineages/pangolin. Accessed 12 Feb 2021. 302 303 Declarations 304 Ethics approval and consent to participate 305 Not Applicable. 306 307 Consent for publication 308 Not Applicable. 309 310 Availability of data and materials 311 Publicly available at https://bitbucket.org/auto_cov_pipeline/havoc. 312 313 Competing interests 314 The authors declare that they have no competing interests. 315 316 Funding 317 This study was supported by the Academy of Finland (grant number 336490), VEO - European 318 Union’s Horizon 2020 (grant number 874735) and the Jane and Aatos Erkko Foundation. 319 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 14 320 Authors' contributions 321 Conceptualization: PTN IP RK TS TSi OV. Development: PTN IP RK TS. Testing/Formal 322 analysis: PTN IP RK TS. Funding acquisition: TSi OV. Investigation: PTN IP RK TS. 323 Methodology: PTN IP RK TS. Project administration: RK TS OV. Resources: PTN RK IP TS TSi 324 OV. Validation: PTN IP RK TS. Writing – original draft: PTN RK. Writing – review & editing: IP 325 TS TSi OV. 326 327 Acknowledgements 328 None. 329 330 Authors' information 331 None. 332 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 10_1101-2021_02_13_429885 ---- A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing Jacob Househam, ​Barts Cancer Institute, Queen Mary University of London, UK William CH Cross , ​UCL Cancer Institute, University College London, UK (★) Giulio​ ​Caravagna ,​ ​Department of Mathematics and Geosciences, University of Trieste, Italy (★) Joint last authors. (★) Corresponding: ​(GC) ​gcaravagna@units.it​. Abstract. ​Cancer is a global health issue that places enormous demands on healthcare systems. Basic research, the development of targeted treatments, and the utility of DNA sequencing in clinical settings, have been significantly improved with the introduction of whole genome sequencing. However the broad applications of this technology come with complications. To date there has been very little standardisation in how data quality is assessed, leading to inconsistencies in analyses and disparate conclusions. Manual checking and complex consensus calling strategies often do not scale to large sample numbers, which leads to procedural bottlenecks. To address this issue, we present a quality control method that integrates point mutations, copy numbers, and other metrics into a single quantitative score. We demonstrate its power on 1,065 whole-genomes from a large-scale pan-cancer cohort, and on multi-region data of two colorectal cancer patients. We highlight how our approach significantly improves the generation of cancer mutation data, providing visualisations for cross-referencing with other analyses. Our approach is fully automated, designed to work downstream of any bioinformatic pipeline, and can automatise tool parameterization paving the way for fast computational assessment of data quality in the era of whole genome sequencing. Introduction Cancer remains an unsolved problem, and a key factor is that tumours develop as heterogeneous cellular populations ​(Greaves and Maley 2012; McGranahan and Swanton 2017, 2015)​. Cancer genomes can harbour multiple types of mutations compared to healthy cells ​(Macintyre et al. 2018; Martincorena et al. 2018, 2015; Nik-Zainal et al. 2012)​, and many of these events contribute to the pathogenesis of the disease, and therapeutic resistance. A popular design of studies intending to .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint mailto:gcaravagna@units.it https://paperpile.com/c/rqVmzs/Pf2t+5LH8+ZoHM https://paperpile.com/c/rqVmzs/Pf2t+5LH8+ZoHM https://paperpile.com/c/rqVmzs/P1Yv+uG2X+4mqr+bHGV https://paperpile.com/c/rqVmzs/P1Yv+uG2X+4mqr+bHGV https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. understand tumour development involves collecting tumour and matched-normal biopsies, and generating so-called “bulk” DNA sequencing data for both ​(Barnell et al. 2019)​. Using bioinformatic tools to cross reference the normal genome against the aberrant one, the mutations and heterogeneity thereof found in the tumour sample can be derived and used in other analyses. These analyses include, but are not limited to, driver mutation identification ​(Bailey et al. 2018; Gonzalez-Perez et al. 2013)​, which aims to discern the key aberrations that cause a tumour to grow, patient clustering, which aims to identify treatment groups with similar biological characteristics, and evolutionary inference ​(Gerstung et al. 2020; Nik-Zainal et al. 2012; Caravagna et al. 2020)​, which informs us how a particular tumour developed from normal cells. There are several types of mutations that we can retrieve from DNA sequencing data (Campbell et al. 2020)​. Broadly these can be categorized as single nucleotide variants (SNVs), copy number alterations (CNAs) and other more complex changes such as structural variants ​(Li et al. 2020)​. All types of mutations can drive tumour progression, and are therefore important entities to study ​(Kent and Green 2017-4; Levine, Jenkins, and Copeland 2019)​. Luckily, the steady drop in sequencing costs is fueling the creation of large amounts of data, which are becoming increasingly available for researchers to access through public databases. Notably, we are entering the era of high-resolution whole-genome sequencing (WGS), a technology that can read out the majority of a tumour genome, providing major improvements over whole-exome counterparts. Generating some of these data, however, poses challenges. While SNVs are the simplest type of mutations to detect using bioinformatic analysis and perhaps have the most well established supporting tools ​(Li et al. 2020)​, CNAs are particularly difficult to call since the baseline ploidy of the tumour (i.e., the number of chromosome copies) is usually unknown and has to be inferred from the data. CNAs are important types of cancer mutations; large-scale gain and loss of chromosome arms or sections of arms can confer tumour cells with large-scale phenotypic changes, and are often important clinical targets ​(Gerstung et al. 2020; Watkins et al. 11 2020)​. SNVs and CNAs are intertwined mutation groups. They can overlap within a tumour cell’s genome, meaning the number of copies of an SNV can be amplified or indeed reduced by CNAs. This depends on the ploidy of the genome regions overlapping with the variants. For instance, for a clonal - meaning present in every cell of the tumour sample - heterozygous SNV in a diploid tumour genome the expected variant allele frequency (VAF) is 50% (i.e., half of the reads from tumour cells will harbour the SNV). Alternatively, if each chromosome is present in three copies (triploid), the expected VAF is 33% - if the SNV occurred after the amplification - or 66% - if the SNV is on the amplified chromosome and occurred before the amplification. The theoretical .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://paperpile.com/c/rqVmzs/j5j7 https://paperpile.com/c/rqVmzs/j5j7 https://paperpile.com/c/rqVmzs/UEke+Glz6 https://paperpile.com/c/rqVmzs/vQgD+bHGV+chqB https://paperpile.com/c/rqVmzs/vQgD+bHGV+chqB https://paperpile.com/c/rqVmzs/CxXa https://paperpile.com/c/rqVmzs/tMOu https://paperpile.com/c/rqVmzs/df7V+SxXl https://paperpile.com/c/rqVmzs/df7V+SxXl https://paperpile.com/c/rqVmzs/tMOu https://paperpile.com/c/rqVmzs/vQgD+NCPJ https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. frequencies are observed with a Binomial noise model that depends on the depth of sequencing and the actual VAF ​(Nik-Zainal et al. 2012; Caravagna et al. 2020)​. We note that these VAFs hold for pure bulk tumour samples (100% tumour cells). Realistically, most bulk samples contain normal cells, the percentage of which shifts these theoretical frequencies towards lower values. These ideas are leveraged by methods that seek to compute the Cancer Cell Fractions (CCFs) of the tumour, i.e., a normalisation of the observed tumour VAF for the CNA, the number of copies of a mutation (mutation multiplicity) and tumour purity ​(Nik-Zainal et al. 2012)​. Many bioinformatics pipelines are designed to start from a BAM formatted input file and, following variant calling, extract the VAF of mutations while calling CNAs in parallel (Boeva et al. 2011; Cmero et al. 2020; Zaccaria and Raphael 2020; Van Loo et al. 2010)​. These analyses are nearly always decoupled, and can return inconsistent variant calls; i.e., CNAs and purity that mismatch the empirical VAF from the BAMs. Since CNAs and purity are inferred through various measurements that are subject to noise - i.e., mutation allele ratios, tumour-normal depth ratios and B-allele frequencies are prime examples - they are the most likely cause of error. While in some cases these errors can be spotted and fixed by manual intervention, this process is also subject to inconsistencies in the absence of a proper statistical framework, and does not scale in studies seeking to generate datasets with millions of data points ​(Campbell et al. 2020; Priestley et al. 2019; Turnbull et al. 2018)​. The intrinsic performance of a variant caller and sequencing noise therefore massively impacts CNA calling and purity inferences, propagating errors in downstream analysis that eventually lead to incorrect biological conclusions, becoming a crucial computational bottleneck in the era of high-resolution whole-genome sequencing. To solve these problems we developed CNAqc ( ​Data Availability​), a computational framework with a de novo statistical model to assess the conformance of expected SNVs, CNAs, and purity estimates. We strived to make the tool as simple to implement as possible, maximising compatibility across differing pipelines. CNAqc computes a quantitative quality check (QC) score for the overall agreement of the calls, which can be used to tune the parameters of callers (e.g., decrease purity or increase ploidy), or select among multiple CNA profiles (e.g., tetraploid versus diploid tumours) until a fit is achieved. In CNAqc we also integrate these measures to determine CCF values (Dentro, Wedge, and Van Loo 2017)​. CNAqc is implemented as a highly optimised R package that can be used downstream of any cancer mutation calling pipeline. It can be run on WGS data, and can automatically compute a QC score in a matter of seconds, which is an extremely useful .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://paperpile.com/c/rqVmzs/bHGV+chqB https://paperpile.com/c/rqVmzs/bHGV https://paperpile.com/c/rqVmzs/IX1R+ydMa+rmmC+yAgN https://paperpile.com/c/rqVmzs/IX1R+ydMa+rmmC+yAgN https://paperpile.com/c/rqVmzs/CxXa+67up+mWfz https://paperpile.com/c/rqVmzs/CxXa+67up+mWfz https://paperpile.com/c/rqVmzs/Uxwc https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. feature for large-scale genomics consortia that analyse many samples per day. To demonstrate the tool we analysed 11 bulk WGS datasets from two multi-region colorectal cancers, and analysed high-quality whole-genomes from the Pan 0651 Cancer Analysis of Whole Genomes (PCAWG) cohort ​(Campbell et al. 2020)​. Results The CNAqc framework CNAqc can perform different types of operations on CNAs and somatic mutation calls obtained from bulk WGS. In what follows, we will refer explicitly to SNVs as the main type of mutation used, but in principle other types of substitutions such as insertions or deletions also apply. The package supports the most common CNA copy types found in cancers: heterozygous normal states (1:1 chromosome complement), loss of heterozygosity (LOH) in monosomy (1:0) and copy-neutral (2:0) form, trisomy (2:1) or tetrasomy (2:2) gains. The tool also works with exome data, but the reduced mutational burden can, in general, lower the reliability of the QC score (​Supplementary Figure S1​). Many metrics output by CNAqc are derived from the link between copy-state profiles (i.e., the copies of the major and minor alleles, which sum up to the ploidy of a segment) and allele frequencies that are explicit from read counts. Combinatorial equations and frequency spectrum analysis can quantitatively determine if CNAs and purity are consistent with the VAF distribution ( ​Online methods ​). This score also suggests “corrections” to automatically fine-tune and repeat CNA calling runs. This works for tools that use either Bayesian priors or point estimates of the parameters. The key equations for a somatic mutation link its VAF and CCF , to sample purity , tumour ploidy , and ​, the number of copies of a mutation ( ​Figure 1a ​). Effectively, for complex 2:0, 2:1 and 2:2 copy states, phases mutations that were acquired before or after the copy number event ( ​Figure 1b ​). We remark that we observe , and infer , ​ and , finally deriving , which is difficult to estimate ( ​Figure 1c ​). In CNAqc we use the following formula for VAF (​Figure 1d​) and CCF .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://paperpile.com/c/rqVmzs/CxXa https://www.codecogs.com/eqnedit.php?latex=v#0 https://www.codecogs.com/eqnedit.php?latex=c#0 https://www.codecogs.com/eqnedit.php?latex=%5Cpi#0 https://www.codecogs.com/eqnedit.php?latex=p#0 https://www.codecogs.com/eqnedit.php?latex=m%5Cin%5C%7B1%2C2%5C%7D#0 https://www.codecogs.com/eqnedit.php?latex=m#0 https://www.codecogs.com/eqnedit.php?latex=v#0 https://www.codecogs.com/eqnedit.php?latex=%5Cpi#0 https://www.codecogs.com/eqnedit.php?latex=p#0 https://www.codecogs.com/eqnedit.php?latex=m#0 https://www.codecogs.com/eqnedit.php?latex=c#0 https://www.codecogs.com/eqnedit.php?latex=v%20%3D%20%5Cdfrac%7B%5Cpi%7D%7B2(1-%5Cpi)%20%2B%20%5Cpi%20p%7D#0 https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. These formulas lead to other interesting quantities ( ​Online methods ​). For instance, if we know tumour purity and the ploidy of a CNA segment, then the VAF mutations mapped to the segment must peak at a known location . The value for follows from x x combinatorial arguments relating all other variables ​(Nik-Zainal et al., 2012)​. From a QC perspective, the euclidean distance between the theoretical expectation and the x peaks observed from data is an error score that approaches 0 for perfect calls, and grows otherwise. CNAqc can visualise the input segments ( ​Figure 2a ​) and read counts ( ​Figure 2b-d ​). Other analysis such as CCFs computation and genome fragmentation analysis are also available, and have other visualisations (​Figure 2e​). The scores of CNAqc can be used to determine a QC PASS or FAIL status for every copy state within a tumour genome, weighting different evidence from the data. One score is for the quality of CNA segmentation and tumour purity, and one for CCF values. The former is based on a density-based analysis of the VAF distribution, and uses both a non-parametric kernel density and a univariate Binomial mixture to match peaks in the VAF data ( ​Figure 3a-d ​). The latter is based on the entropy of the latent variables in a Binomial mixture model, whose components are peaked at the expected VAF. From this density we identify VAF ranges for which it is hard to estimate the mutation multiplicity, and therefore the CCF of the mutation ( ​Figure 3e-h ​). To the best of our understanding, this is the only framework providing quantitative metrics for all the most widespread types of tumour mutations. Multi-region colorectal cancer data We have run CNAqc on previously published WGS multi-region data ​(Cross et al. 10 2018; Caravagna et al. 2020)​, which was collected from multiple regions of primary colorectal adenocarcinomas across two distinct patients. For all these samples we have high quality somatic mutation calls ​(Cross et al. 10 2018) that were obtained using CloneHD ​(Fischer et al. 2014)​. We have re-called CNAs with the Sequenza CNA caller (Favero et al. 2015)​, and sought out to check the inferred copy states and tumour purity with CNAqc, along with SNVs generated by Mutect2 ​(Benjamin et al. 2019)​. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://www.codecogs.com/eqnedit.php?latex=c%20%3D%20%5Cdfrac%7Bv%5B(p-2)%5Cpi%20%2B%202%5D%7D%7Bm%5Cpi%7D%20%5C%2C%20.#0 https://www.zotero.org/google-docs/?YaP3DC https://paperpile.com/c/rqVmzs/IC0y+chqB https://paperpile.com/c/rqVmzs/IC0y+chqB https://paperpile.com/c/rqVmzs/IC0y https://paperpile.com/c/rqVmzs/A7Vg https://paperpile.com/c/rqVmzs/tCb6 https://paperpile.com/c/rqVmzs/bD5o https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Sequenza was run using distinct parameterizations. We begun with the default range proposals for purity and ploidy , which we then improved in a final run following CNAqc 1 analysis. We also forced a Sequenza fit with constrained tetraploid genome (ploidy equal 4), and one with low purity. All these steps could have been easily automatised in a procedure that runs the caller, obtains score metrics for the solution from CNAqc, and re-run the fits with adjusted parameters if required. The results for one sample of patient Set7 - Cancer 7 in the original manuscript ​(Cross et al. 10 2018) - are in ​Figure 4 ​; the other samples for patient Set7 are in ​Supplementary Figures S2-S4 ​. All samples for patient Set6 are in ​Supplementary Figures S5-S10. The peak detection scores produced by CNAqc invariably fail both the tetraploid and low-purity solutions, passing the others; the little adjustment suggested to the default parameters slightly improves the purity, but the overall quality is high even with just default parameters ( ​Figure 4b ​). The whole-genome CNA profile for this sample shows some degree of aneuploidy ( ​Figure 4c ​), and it is easy with CNAqc to assess miscalled CNA segments ahead of the VAF data ( ​Figure 4d ​). The analysis of all the samples available for Set7 shows an overall CNA profile with many diploid regions and mild aneuploidy ( ​Figure 4e ​), consistent with a microsatellite stable colorectal cancer ​(Cross et al. 10 2018)​. Large-scale pan cancer PCAWG calls We have run CNAqc on a subset of the full PCAWG cohort, which contains thousands of samples from multiple tumour types ​(Campbell et al. 2020)​. The median coverage of this cohort is 45x, with purity ~65% ​(Caravagna et al. 2020)​; a much lower resolution than the data available for the multi-region samples discussed in the previous section. Because of this, peak detection from the VAF distribution across some of the samples would be challenged by signal quality; in practice, for genomes with complex aneuploidy and massive drops in purity and coverage the VAF distribution is unsuitable for peak-detection, leading to false-positives in the QC process. To avoid this and work with suitable samples, we identified cases adopting the following conditions: (i) the 065n = 1 tumour type contains >20 samples, (ii) the tumour genome used for QC contains >30% of the overall SNVs in the tumour - so a substantial part of the overall mutational burden - and (iii) the purity of the sample is >60% - so the signal is suitable for peak detection. On a standard cluster CNAqc ran in less than 1 hour for these samples; notably the 1 Technically the default Sequenza values for ploidy reach maximum at 7; being unrealistic for our cases we limited the maximum ploidy to be 5. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://paperpile.com/c/rqVmzs/IC0y https://paperpile.com/c/rqVmzs/IC0y https://paperpile.com/c/rqVmzs/IC0y https://paperpile.com/c/rqVmzs/CxXa https://paperpile.com/c/rqVmzs/chqB https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. completion time (per sample) on a laptop is less than 1 minute, meaning that preliminary analysis can be carried out very quickly and without large computing infrastructures. The calls in PCAWG were obtained by consensus with multiple bioinformatics tools, and for this reason we expected them to be reliable. Manual inspections of some patient data showed indeed many high-quality calls, but also highlighted a variety of interesting cases. For instance, tumours with extremely low mutational burden but high quality calls still yielded a useful report, suggesting that CNAqc can work also with mutational burden from whole-exome sequencing ( ​Supplementary Figure S1 ​). For other tumours, we found high purity levels >90%, which are probably overestimated ( ​Supplementary Figure S11 ​) compared to others where purity is genuinely very high ( ​Supplementary Figure S12 ​). Overall, the scores from peak detection are reliable for the majority of the analysed samples ( ​Figure 5a ​) - the diploid 85% purity tumour in ​Figures 2 ​and 3 is taken from this list - with only a few cases requiring further checks ( ​Figure 5b ​). The peak detection by CNAqc therefore confirms the calls reliability in terms of breakpoints, segments ploidy and tumour purity. CCF computations showed a higher rate of failures with CNAqc analysis ( ​Figure 5a ​). This is inevitably due to the lack of signal separability stemming from low coverage of these samples, even for high-quality genomes. Therefore while peaks could be determined for these data, mutation multiplicity assessment would have required higher coverage than what was found available. In summary, from these analyses we revealed that the problem of validating CNA calls, compared to determining CCF estimates, can be approached with lower coverage and purity values using CNAqc. Discussion WGS is a powerful approach to detect extensive mutations that drive human cancers. Many large-scale initiatives such as PCAWG ​(Campbell et al. 2020)​, the Hartwig Medical Foundation ​(Priestley et al. 2019) and Genomics England ​(Turnbull et al. 2018) have already generated WGS data for thousands of cancer patients, with many cancer institutes converging towards these efforts. Calling mutations from WGS data requires complex bioinformatics pipelines ​(Barnell et al. 2019; Cmero et al. 2020; Li et al. 2020) and any downstream analysis relies upon these calls, putting the quality of the generated data under the spotlight. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://paperpile.com/c/rqVmzs/CxXa https://paperpile.com/c/rqVmzs/67up https://paperpile.com/c/rqVmzs/mWfz https://paperpile.com/c/rqVmzs/j5j7+ydMa+tMOu https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. CNAqc offers the first principle framework to control the quality of tumour mutation calls. The tool can analyse SNVs and more general types of nucleotide substitutions; SNVs are more reliable and depend less on alignment quality than other mutations, and therefore should be checked first. CNAqc uses a peak-detection analysis to validate CNA segments and purity, exploiting a combinatorial model for cancer alleles. Within the same framework, CNAqc also computes CCF values, highlighting mutations for which such values are uncertain. CNAqc features can be used to clean up data, automatising parameter choice for virtually any caller, prioritizing good calls and selecting information for downstream analyses. The CNAqc framework leverages the relationship between tumour VAF and ploidy. The quality of the control process itself depends on the ability to process the VAF spectrum and detect peaks. Therefore, if the VAF quality is very low because, e.g., the sample has low purity or coverage, the overall quality of the check decreases, making it more difficult to completely automate quality checking. However, for the large majority of samples, CNAqc provides a very effective and fast way to integrate quality metrics in standard pipelines. Generating high quality calls is just a prelude to more complex analyses that interpret cancer genotypes and their history, with and without therapy ​(Ding et al. 2012; Landau et al. 2013; Caravagna et al. 07 12, 2016; Jamal-Hanjani et al. 2017; Turajlic et al. 2018; Caravagna et al. 09 2018)​. CNAqc can pass a sample at an early stage, leaving the possibility of assessing, at a later stage, whether the quality of the data is high enough to approach specific research questions. With the ongoing implementation of large-scale sequencing efforts, CNAqc provides a good solution for modular pipelines that self-tune parameters, based on quality scores. To our knowledge, this is the first stand-alone tool which leverages the power of combining the most common types of cancer mutations - SNVs and CNAs - to automatically control the quality of WGS assays. We believe CNAqc can help reduce the burden of manual quality checking and parameter tuning. References Bailey, Matthew H., Collin Tokheim, Eduard Porta-Pardo, Sohini Sengupta, Denis Bertrand, Amila Weerasinghe, Antonio Colaprico, et al. 2018. “Comprehensive Characterization of Cancer Driver Genes and Mutations.” ​Cell​ 173 (2): 371–85.e18. https://doi.org/​10.1016/j.cell.2018.02.060 ​. Barnell, Erica K., Peter Ronning, Katie M. Campbell, Kilannin Krysiak, Benjamin J. Ainscough, Lana M. Sheta, Shahil P. Pema, et al. 2019. “Standard Operating Procedure for Somatic Variant Refinement of Sequencing Data with Paired Tumor and Normal Samples.” ​Genetics .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://paperpile.com/c/rqVmzs/wPG3+tqeT+Rl5f+CImd+JI4a+eR0S https://paperpile.com/c/rqVmzs/wPG3+tqeT+Rl5f+CImd+JI4a+eR0S https://paperpile.com/c/rqVmzs/wPG3+tqeT+Rl5f+CImd+JI4a+eR0S http://paperpile.com/b/rqVmzs/UEke http://paperpile.com/b/rqVmzs/UEke http://paperpile.com/b/rqVmzs/UEke http://paperpile.com/b/rqVmzs/UEke http://paperpile.com/b/rqVmzs/UEke http://paperpile.com/b/rqVmzs/UEke http://dx.doi.org/10.1016/j.cell.2018.02.060 http://paperpile.com/b/rqVmzs/UEke http://paperpile.com/b/rqVmzs/j5j7 http://paperpile.com/b/rqVmzs/j5j7 http://paperpile.com/b/rqVmzs/j5j7 http://paperpile.com/b/rqVmzs/j5j7 https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. in Medicine: Official Journal of the American College of Medical Genetics​ 21 (4): 972–81. https://doi.org/​10.1038/s41436-018-0278-z​. Benjamin, David, Takuto Sato, Kristian Cibulskis, Gad Getz, Chip Stewart, and Lee Lichtenstein. 2019. “Calling Somatic SNVs and Indels with Mutect2.” ​bioRxiv​, December, 861054. https://doi.org/​10.1101/861054 ​. Boeva, Valentina, Andrei Zinovyev, Kevin Bleakley, Jean-Philippe Vert, Isabelle Janoueix-Lerosey, Olivier Delattre, and Emmanuel Barillot. 2011. “Control-Free Calling of Copy Number Alterations in Deep-Sequencing Data Using GC-Content Normalization.” Bioinformatics ​ 27 (2): 268–69. https://doi.org/​10.1093/bioinformatics/btq635 ​. Campbell, Peter J., Gad Getz, Jan O. Korbel, Joshua M. Stuart, Jennifer L. Jennings, Lincoln D. Stein, Marc D. Perry, et al. 2020. “Pan-Cancer Analysis of Whole Genomes.” ​Nature​ 578 (7793): 82–93. https://doi.org/​10.1038/s41586-020-1969-6 ​. Caravagna, Giulio, Ylenia Giarratano, Daniele Ramazzotti, Ian Tomlinson, Trevor A. Graham, Guido Sanguinetti, and Andrea Sottoriva. 09 2018. “Detecting Repeated Cancer Evolution from Multi-Region Tumor Sequencing Data.” ​Nature Methods​ 15 (9): 707–14. https://doi.org/​10.1038/s41592-018-0108-x​. Caravagna, Giulio, Alex Graudenzi, Daniele Ramazzotti, Rebeca Sanz-Pamplona, Luca De Sano, Giancarlo Mauri, Victor Moreno, Marco Antoniotti, and Bud Mishra. 07 12, 2016. “Algorithmic Methods to Infer the Evolutionary Trajectories in Cancer Progression.” Proceedings of the National Academy of Sciences of the United States of America​ 113 (28): E4025–34. https://doi.org/​10.1073/pnas.1520213113 ​. Caravagna, Giulio, Timon Heide, Marc J. Williams, Luis Zapata, Daniel Nichol, Ketevan Chkhaidze, William Cross, et al. 2020. “Subclonal Reconstruction of Tumors by Using Machine Learning and Population Genetics.” ​Nature Genetics​ 52 (9): 898–907. https://doi.org/​10.1038/s41588-020-0675-5 ​. Cmero, Marek, Ke Yuan, Cheng Soon Ong, Jan Schröder, Niall M. Corcoran, Tony Papenfuss, Christopher M. Hovens, Florian Markowetz, and Geoff Macintyre. 2020. “Inferring Structural Variant Cancer Cell Fraction.” ​Nature Communications​ 11 (1): 730. https://doi.org/​10.1038/s41467-020-14351-8 ​. Cortés-Ciriano, Isidro, Jake June-Koo Lee, Ruibin Xi, Dhawal Jain, Youngsook L. Jung, Lixing Yang, Dmitry Gordenin, et al. 2020. “Comprehensive Analysis of Chromothripsis in 2,658 Human Cancers Using Whole-Genome Sequencing.” ​Nature Genetics​ 52 (3): 331–41. https://doi.org/​10.1038/s41588-019-0576-7 ​. Cross, William, Michal Kovac, Ville Mustonen, Daniel Temko, Hayley Davis, Ann-Marie Baker, Sujata Biswas, et al. 10 2018. “The Evolutionary Landscape of Colorectal Tumorigenesis.” Nature Ecology & Evolution​ 2 (10): 1661–72. https://doi.org/​10.1038/s41559-018-0642-z​. Dentro, Stefan C., David C. Wedge, and Peter Van Loo. 2017. “Principles of Reconstructing the Subclonal Architecture of Cancers.” ​Cold Spring Harbor Perspectives in Medicine​ 7 (8). https://doi.org/​10.1101/cshperspect.a026625 ​. Ding, Li, Timothy J. Ley, David E. Larson, Christopher A. Miller, Daniel C. Koboldt, John S. Welch, Julie K. Ritchey, et al. 2012. “Clonal Evolution in Relapsed Acute Myeloid Leukaemia Revealed by Whole-Genome Sequencing.” ​Nature​ 481 (7382): 506–10. https://doi.org/​10.1038/nature10738 ​. Favero, F., T. Joshi, A. M. Marquard, N. J. Birkbak, M. Krzystanek, Q. Li, Z. Szallasi, and A. C. Eklund. 2015. “Sequenza: Allele-Specific Copy Number and Mutation Profiles from Tumor Sequencing Data.” ​Annals of Oncology: Official Journal of the European Society for Medical Oncology / ESMO​ 26 (1): 64–70. https://doi.org/​10.1093/annonc/mdu479 ​. Fischer, Andrej, Ignacio Vázquez-García, Christopher J. R. Illingworth, and Ville Mustonen. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint http://paperpile.com/b/rqVmzs/j5j7 http://paperpile.com/b/rqVmzs/j5j7 http://paperpile.com/b/rqVmzs/j5j7 http://dx.doi.org/10.1038/s41436-018-0278-z http://paperpile.com/b/rqVmzs/j5j7 http://paperpile.com/b/rqVmzs/bD5o http://paperpile.com/b/rqVmzs/bD5o http://paperpile.com/b/rqVmzs/bD5o http://paperpile.com/b/rqVmzs/bD5o http://paperpile.com/b/rqVmzs/bD5o http://dx.doi.org/10.1101/861054 http://paperpile.com/b/rqVmzs/bD5o http://paperpile.com/b/rqVmzs/IX1R http://paperpile.com/b/rqVmzs/IX1R http://paperpile.com/b/rqVmzs/IX1R http://paperpile.com/b/rqVmzs/IX1R http://paperpile.com/b/rqVmzs/IX1R http://dx.doi.org/10.1093/bioinformatics/btq635 http://paperpile.com/b/rqVmzs/IX1R http://paperpile.com/b/rqVmzs/CxXa http://paperpile.com/b/rqVmzs/CxXa http://paperpile.com/b/rqVmzs/CxXa http://paperpile.com/b/rqVmzs/CxXa http://paperpile.com/b/rqVmzs/CxXa http://dx.doi.org/10.1038/s41586-020-1969-6 http://paperpile.com/b/rqVmzs/CxXa http://paperpile.com/b/rqVmzs/eR0S http://paperpile.com/b/rqVmzs/eR0S http://paperpile.com/b/rqVmzs/eR0S http://paperpile.com/b/rqVmzs/eR0S http://paperpile.com/b/rqVmzs/eR0S http://paperpile.com/b/rqVmzs/eR0S http://dx.doi.org/10.1038/s41592-018-0108-x http://paperpile.com/b/rqVmzs/eR0S http://paperpile.com/b/rqVmzs/Rl5f http://paperpile.com/b/rqVmzs/Rl5f http://paperpile.com/b/rqVmzs/Rl5f http://paperpile.com/b/rqVmzs/Rl5f http://paperpile.com/b/rqVmzs/Rl5f http://paperpile.com/b/rqVmzs/Rl5f http://dx.doi.org/10.1073/pnas.1520213113 http://paperpile.com/b/rqVmzs/Rl5f http://paperpile.com/b/rqVmzs/chqB http://paperpile.com/b/rqVmzs/chqB http://paperpile.com/b/rqVmzs/chqB http://paperpile.com/b/rqVmzs/chqB http://paperpile.com/b/rqVmzs/chqB http://paperpile.com/b/rqVmzs/chqB http://dx.doi.org/10.1038/s41588-020-0675-5 http://paperpile.com/b/rqVmzs/chqB http://paperpile.com/b/rqVmzs/ydMa http://paperpile.com/b/rqVmzs/ydMa http://paperpile.com/b/rqVmzs/ydMa http://paperpile.com/b/rqVmzs/ydMa http://paperpile.com/b/rqVmzs/ydMa http://paperpile.com/b/rqVmzs/ydMa http://dx.doi.org/10.1038/s41467-020-14351-8 http://paperpile.com/b/rqVmzs/ydMa http://paperpile.com/b/rqVmzs/FjZP http://paperpile.com/b/rqVmzs/FjZP http://paperpile.com/b/rqVmzs/FjZP http://paperpile.com/b/rqVmzs/FjZP http://paperpile.com/b/rqVmzs/FjZP http://paperpile.com/b/rqVmzs/FjZP http://dx.doi.org/10.1038/s41588-019-0576-7 http://paperpile.com/b/rqVmzs/FjZP http://paperpile.com/b/rqVmzs/IC0y http://paperpile.com/b/rqVmzs/IC0y http://paperpile.com/b/rqVmzs/IC0y http://paperpile.com/b/rqVmzs/IC0y http://dx.doi.org/10.1038/s41559-018-0642-z http://paperpile.com/b/rqVmzs/IC0y http://paperpile.com/b/rqVmzs/Uxwc http://paperpile.com/b/rqVmzs/Uxwc http://paperpile.com/b/rqVmzs/Uxwc http://paperpile.com/b/rqVmzs/Uxwc http://paperpile.com/b/rqVmzs/Uxwc http://dx.doi.org/10.1101/cshperspect.a026625 http://paperpile.com/b/rqVmzs/Uxwc http://paperpile.com/b/rqVmzs/wPG3 http://paperpile.com/b/rqVmzs/wPG3 http://paperpile.com/b/rqVmzs/wPG3 http://paperpile.com/b/rqVmzs/wPG3 http://paperpile.com/b/rqVmzs/wPG3 http://paperpile.com/b/rqVmzs/wPG3 http://dx.doi.org/10.1038/nature10738 http://paperpile.com/b/rqVmzs/wPG3 http://paperpile.com/b/rqVmzs/tCb6 http://paperpile.com/b/rqVmzs/tCb6 http://paperpile.com/b/rqVmzs/tCb6 http://paperpile.com/b/rqVmzs/tCb6 http://paperpile.com/b/rqVmzs/tCb6 http://paperpile.com/b/rqVmzs/tCb6 http://dx.doi.org/10.1093/annonc/mdu479 http://paperpile.com/b/rqVmzs/tCb6 http://paperpile.com/b/rqVmzs/A7Vg https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. 2014. “High-Definition Reconstruction of Clonal Composition in Cancer.” ​Cell Reports​ 7 (5): 1740–52. https://doi.org/​10.1016/j.celrep.2014.04.055 ​. Gerstung, Moritz, Clemency Jolly, Ignaty Leshchiner, Stefan C. Dentro, Santiago Gonzalez, Daniel Rosebrock, Thomas J. Mitchell, et al. 2020. “The Evolutionary History of 2,658 Cancers.” ​Nature​ 578 (7793): 122–28. https://doi.org/​10.1038/s41586-019-1907-7 ​. Gonzalez-Perez, Abel, Christian Perez-Llamas, Jordi Deu-Pons, David Tamborero, Michael P. Schroeder, Alba Jene-Sanz, Alberto Santos, and Nuria Lopez-Bigas. 2013. “IntOGen-Mutations Identifies Cancer Drivers across Tumor Types.” ​Nature Methods​ 10 (11): 1081–82. https://doi.org/​10.1038/nmeth.2642 ​. Greaves, Mel, and Carlo C. Maley. 2012. “Clonal Evolution in Cancer.” ​Nature​ 481 (7381): 306–13. https://doi.org/​10.1038/nature10762 ​. Jamal-Hanjani, Mariam, Gareth A. Wilson, Nicholas McGranahan, Nicolai J. Birkbak, Thomas B. K. Watkins, Selvaraju Veeriah, Seema Shafi, et al. 2017. “Tracking the Evolution of Non-Small-Cell Lung Cancer.” ​The New England Journal of Medicine​ 376 (22): 2109–21. https://doi.org/​10.1056/NEJMoa1616288 ​. Kent, David G., and Anthony R. Green. 2017-4. “Order Matters: The Order of Somatic Mutations Influences Cancer Evolution.” ​Cold Spring Harbor Perspectives in Medicine​ 7 (4). https://doi.org/​10.1101/cshperspect.a027060 ​. Landau, Dan A., Scott L. Carter, Petar Stojanov, Aaron McKenna, Kristen Stevenson, Michael S. Lawrence, Carrie Sougnez, et al. 2013. “Evolution and Impact of Subclonal Mutations in Chronic Lymphocytic Leukemia.” ​Cell​ 152 (4): 714–26. https://doi.org/​10.1016/j.cell.2013.01.019 ​. Levine, Arnold J., Nancy A. Jenkins, and Neal G. Copeland. 2019. “The Roles of Initiating Truncal Mutations in Human Cancers: The Order of Mutations and Tumor Cell Type Matters.” ​Cancer Cell​ 35 (1): 10–15. https://doi.org/​10.1016/j.ccell.2018.11.009 ​. Li, Yilong, Nicola D. Roberts, Jeremiah A. Wala, Ofer Shapira, Steven E. Schumacher, Kiran Kumar, Ekta Khurana, et al. 2020. “Patterns of Somatic Structural Variation in Human Cancer Genomes.” ​Nature​ 578 (7793): 112–21. https://doi.org/​10.1038/s41586-019-1913-9 ​. Macintyre, Geoff, Teodora E. Goranova, Dilrini De Silva, Darren Ennis, Anna M. Piskorz, Matthew Eldridge, Daoud Sie, et al. 2018. “Copy Number Signatures and Mutational Processes in Ovarian Carcinoma.” ​Nature Genetics​ 50 (9): 1262–70. https://doi.org/​10.1038/s41588-018-0179-8 ​. Martincorena, Iñigo, Joanna C. Fowler, Agnieszka Wabik, Andrew R. J. Lawson, Federico Abascal, Michael W. J. Hall, Alex Cagan, et al. 2018. “Somatic Mutant Clones Colonize the Human Esophagus with Age.” ​Science​ 362 (6417): 911–17. https://doi.org/​10.1126/science.aau3879 ​. Martincorena, Iñigo, Amit Roshan, Moritz Gerstung, Peter Ellis, Peter Van Loo, Stuart McLaren, David C. Wedge, et al. 2015. “High Burden and Pervasive Positive Selection of Somatic Mutations in Normal Human Skin.” ​Science​ 348 (6237): 880–86. https://doi.org/​10.1126/science.aaa6806 ​. McGranahan, Nicholas, and Charles Swanton. 2015. “Biological and Therapeutic Impact of Intratumor Heterogeneity in Cancer Evolution.” ​Cancer Cell​ 27 (1): 15–26. https://doi.org/​10.1016/j.ccell.2014.12.001 ​. ———. 2017. “Clonal Heterogeneity and Tumor Evolution: Past, Present, and the Future.” ​Cell 168 (4): 613–28. https://doi.org/​10.1016/j.cell.2017.01.018 ​. Nik-Zainal, Serena, Peter Van Loo, David C. Wedge, Ludmil B. Alexandrov, Christopher D. Greenman, King Wai Lau, Keiran Raine, et al. 2012. “The Life History of 21 Breast Cancers.” ​Cell​ 149 (5): 994–1007. https://doi.org/​10.1016/j.cell.2012.04.023 ​. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint http://paperpile.com/b/rqVmzs/A7Vg http://paperpile.com/b/rqVmzs/A7Vg http://paperpile.com/b/rqVmzs/A7Vg http://paperpile.com/b/rqVmzs/A7Vg http://dx.doi.org/10.1016/j.celrep.2014.04.055 http://paperpile.com/b/rqVmzs/A7Vg http://paperpile.com/b/rqVmzs/vQgD http://paperpile.com/b/rqVmzs/vQgD http://paperpile.com/b/rqVmzs/vQgD http://paperpile.com/b/rqVmzs/vQgD http://paperpile.com/b/rqVmzs/vQgD http://dx.doi.org/10.1038/s41586-019-1907-7 http://paperpile.com/b/rqVmzs/vQgD http://paperpile.com/b/rqVmzs/Glz6 http://paperpile.com/b/rqVmzs/Glz6 http://paperpile.com/b/rqVmzs/Glz6 http://paperpile.com/b/rqVmzs/Glz6 http://paperpile.com/b/rqVmzs/Glz6 http://paperpile.com/b/rqVmzs/Glz6 http://dx.doi.org/10.1038/nmeth.2642 http://paperpile.com/b/rqVmzs/Glz6 http://paperpile.com/b/rqVmzs/Pf2t http://paperpile.com/b/rqVmzs/Pf2t http://paperpile.com/b/rqVmzs/Pf2t http://paperpile.com/b/rqVmzs/Pf2t http://dx.doi.org/10.1038/nature10762 http://paperpile.com/b/rqVmzs/Pf2t http://paperpile.com/b/rqVmzs/CImd http://paperpile.com/b/rqVmzs/CImd http://paperpile.com/b/rqVmzs/CImd http://paperpile.com/b/rqVmzs/CImd http://paperpile.com/b/rqVmzs/CImd http://paperpile.com/b/rqVmzs/CImd http://dx.doi.org/10.1056/NEJMoa1616288 http://paperpile.com/b/rqVmzs/CImd http://paperpile.com/b/rqVmzs/df7V http://paperpile.com/b/rqVmzs/df7V http://paperpile.com/b/rqVmzs/df7V http://paperpile.com/b/rqVmzs/df7V http://paperpile.com/b/rqVmzs/df7V http://dx.doi.org/10.1101/cshperspect.a027060 http://paperpile.com/b/rqVmzs/df7V http://paperpile.com/b/rqVmzs/tqeT http://paperpile.com/b/rqVmzs/tqeT http://paperpile.com/b/rqVmzs/tqeT http://paperpile.com/b/rqVmzs/tqeT http://paperpile.com/b/rqVmzs/tqeT http://paperpile.com/b/rqVmzs/tqeT http://dx.doi.org/10.1016/j.cell.2013.01.019 http://paperpile.com/b/rqVmzs/tqeT http://paperpile.com/b/rqVmzs/SxXl http://paperpile.com/b/rqVmzs/SxXl http://paperpile.com/b/rqVmzs/SxXl http://paperpile.com/b/rqVmzs/SxXl http://paperpile.com/b/rqVmzs/SxXl http://dx.doi.org/10.1016/j.ccell.2018.11.009 http://paperpile.com/b/rqVmzs/SxXl http://paperpile.com/b/rqVmzs/tMOu http://paperpile.com/b/rqVmzs/tMOu http://paperpile.com/b/rqVmzs/tMOu http://paperpile.com/b/rqVmzs/tMOu http://paperpile.com/b/rqVmzs/tMOu http://dx.doi.org/10.1038/s41586-019-1913-9 http://paperpile.com/b/rqVmzs/tMOu http://paperpile.com/b/rqVmzs/P1Yv http://paperpile.com/b/rqVmzs/P1Yv http://paperpile.com/b/rqVmzs/P1Yv http://paperpile.com/b/rqVmzs/P1Yv http://paperpile.com/b/rqVmzs/P1Yv http://paperpile.com/b/rqVmzs/P1Yv http://dx.doi.org/10.1038/s41588-018-0179-8 http://paperpile.com/b/rqVmzs/P1Yv http://paperpile.com/b/rqVmzs/uG2X http://paperpile.com/b/rqVmzs/uG2X http://paperpile.com/b/rqVmzs/uG2X http://paperpile.com/b/rqVmzs/uG2X http://paperpile.com/b/rqVmzs/uG2X http://paperpile.com/b/rqVmzs/uG2X http://dx.doi.org/10.1126/science.aau3879 http://paperpile.com/b/rqVmzs/uG2X http://paperpile.com/b/rqVmzs/4mqr http://paperpile.com/b/rqVmzs/4mqr http://paperpile.com/b/rqVmzs/4mqr http://paperpile.com/b/rqVmzs/4mqr http://paperpile.com/b/rqVmzs/4mqr http://paperpile.com/b/rqVmzs/4mqr http://dx.doi.org/10.1126/science.aaa6806 http://paperpile.com/b/rqVmzs/4mqr http://paperpile.com/b/rqVmzs/ZoHM http://paperpile.com/b/rqVmzs/ZoHM http://paperpile.com/b/rqVmzs/ZoHM http://paperpile.com/b/rqVmzs/ZoHM http://paperpile.com/b/rqVmzs/ZoHM http://dx.doi.org/10.1016/j.ccell.2014.12.001 http://paperpile.com/b/rqVmzs/ZoHM http://paperpile.com/b/rqVmzs/5LH8 http://paperpile.com/b/rqVmzs/5LH8 http://paperpile.com/b/rqVmzs/5LH8 http://paperpile.com/b/rqVmzs/5LH8 http://dx.doi.org/10.1016/j.cell.2017.01.018 http://paperpile.com/b/rqVmzs/5LH8 http://paperpile.com/b/rqVmzs/bHGV http://paperpile.com/b/rqVmzs/bHGV http://paperpile.com/b/rqVmzs/bHGV http://paperpile.com/b/rqVmzs/bHGV http://paperpile.com/b/rqVmzs/bHGV http://dx.doi.org/10.1016/j.cell.2012.04.023 http://paperpile.com/b/rqVmzs/bHGV https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Priestley, Peter, Jonathan Baber, Martijn P. Lolkema, Neeltje Steeghs, Ewart de Bruijn, Charles Shale, Korneel Duyvesteyn, et al. 2019. “Pan-Cancer Whole-Genome Analyses of Metastatic Solid Tumours.” ​Nature​ 575 (7781): 210–16. https://doi.org/​10.1038/s41586-019-1689-y​. Turajlic, Samra, Hang Xu, Kevin Litchfield, Andrew Rowan, Stuart Horswell, Tim Chambers, Tim O’Brien, et al. 2018. “Deterministic Evolutionary Trajectories Influence Primary Tumor Growth: TRACERx Renal.” ​Cell​ 173 (3): 595–610.e11. https://doi.org/​10.1016/j.cell.2018.03.043 ​. Turnbull, Clare, Richard H. Scott, Ellen Thomas, Louise Jones, Nirupa Murugaesu, Freya Boardman Pretty, Dina Halai, et al. 2018. “The 100 000 Genomes Project: Bringing Whole Genome Sequencing to the NHS.” ​BMJ ​ 361 (April): k1687. https://doi.org/​10.1136/bmj.k1687 ​. Van Loo, Peter, Silje H. Nordgard, Ole Christian Lingjærde, Hege G. Russnes, Inga H. Rye, Wei Sun, Victor J. Weigman, et al. 2010. “Allele-Specific Copy Number Analysis of Tumors.” Proceedings of the National Academy of Sciences of the United States of America​ 107 (39): 16910–15. https://doi.org/​10.1073/pnas.1009843107 ​. Watkins, Thomas B. K., Emilia L. Lim, Marina Petkovic, Sergi Elizalde, Nicolai J. Birkbak, Gareth A. Wilson, David A. Moore, et al. 11 2020. “Pervasive Chromosomal Instability and Karyotype Order in Tumour Evolution.” ​Nature​ 587 (7832): 126–32. https://doi.org/​10.1038/s41586-020-2698-6 ​. Zaccaria, Simone, and Benjamin J. Raphael. 2020. “Accurate Quantification of Copy-Number Aberrations and Whole-Genome Duplications in Multi-Sample Tumor Sequencing Data.” Nature Communications​ 11 (1): 4301. https://doi.org/​10.1038/s41467-020-17967-y​. Data Availability Multiregion ​colorectal cancer data is deposited in EGA under accession number EGAS00001003066. PCAWG calls are publicly available at ( ​https://dcc.icgc.org/​), the ICGC Data Portal. CNAqc is implemented as an open source R package that is hosted at the GitHub space of the Caravagna Lab https://caravagnalab.github.io/CNAqc/​. The tool webpage contains RMarkdown tutorial vignettes to run CNAqc analysis of a generic dataset, as well as documents that explain visualisation and parameterizations of the execution. All analyses in this paper can be replicated following the vignettes. Authors contribution .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint http://paperpile.com/b/rqVmzs/67up http://paperpile.com/b/rqVmzs/67up http://paperpile.com/b/rqVmzs/67up http://paperpile.com/b/rqVmzs/67up http://paperpile.com/b/rqVmzs/67up http://paperpile.com/b/rqVmzs/67up http://dx.doi.org/10.1038/s41586-019-1689-y http://paperpile.com/b/rqVmzs/67up http://paperpile.com/b/rqVmzs/JI4a http://paperpile.com/b/rqVmzs/JI4a http://paperpile.com/b/rqVmzs/JI4a http://paperpile.com/b/rqVmzs/JI4a http://paperpile.com/b/rqVmzs/JI4a http://paperpile.com/b/rqVmzs/JI4a http://dx.doi.org/10.1016/j.cell.2018.03.043 http://paperpile.com/b/rqVmzs/JI4a http://paperpile.com/b/rqVmzs/mWfz http://paperpile.com/b/rqVmzs/mWfz http://paperpile.com/b/rqVmzs/mWfz http://paperpile.com/b/rqVmzs/mWfz http://paperpile.com/b/rqVmzs/mWfz http://paperpile.com/b/rqVmzs/mWfz http://dx.doi.org/10.1136/bmj.k1687 http://paperpile.com/b/rqVmzs/mWfz http://paperpile.com/b/rqVmzs/yAgN http://paperpile.com/b/rqVmzs/yAgN http://paperpile.com/b/rqVmzs/yAgN http://paperpile.com/b/rqVmzs/yAgN http://paperpile.com/b/rqVmzs/yAgN http://dx.doi.org/10.1073/pnas.1009843107 http://paperpile.com/b/rqVmzs/yAgN http://paperpile.com/b/rqVmzs/NCPJ http://paperpile.com/b/rqVmzs/NCPJ http://paperpile.com/b/rqVmzs/NCPJ http://paperpile.com/b/rqVmzs/NCPJ http://paperpile.com/b/rqVmzs/NCPJ http://paperpile.com/b/rqVmzs/NCPJ http://dx.doi.org/10.1038/s41586-020-2698-6 http://paperpile.com/b/rqVmzs/NCPJ http://paperpile.com/b/rqVmzs/rmmC http://paperpile.com/b/rqVmzs/rmmC http://paperpile.com/b/rqVmzs/rmmC http://paperpile.com/b/rqVmzs/rmmC http://dx.doi.org/10.1038/s41467-020-17967-y http://paperpile.com/b/rqVmzs/rmmC https://dcc.icgc.org/ https://caravagnalab.github.io/CNAqc/ https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. All authors conceived the method, which GC formalised and implemented. All authors analysed the data and wrote the manuscript. Competing interests. The authors declare no competing interests. Online methods CNAqc supports two human genome references (GRCh38 and hg19), and the most common CNA profiles found in cancers: ● heterozygous diploid states (1:1) ; 2 ● loss of heterozygosity (LOH) in monosomy (1:0) and copy-neutral (2:0) states; ● triploid (AAB or 2:1) or tetraploid (AABB or 2:2) states. We make a simplifying assumption, whereby CNAs have been acquired in one step, starting from a simple heterozygous diploid state (the germline). For this reason, for tetraploid segments we only consider copy state 2:2, instead of 3:1 or 4:0. This allows us to make simpler computations. In practice, we avoid working with copy states for which the computation of CCFs is very difficult, and that are quite unlikely to be observed in real data. Also, we consider only clonal CNA segments. While subclonal CNA segments are certainly important for cancer genomics, the calls that we seek to quality check regard just clonal CNA events; being the one most prevalent in the majority of cancer cells, they have to be prioritised, with subclonal CNAs being only reliable for tumours with good clonal CNA calls. CNAqc works primarily with Whole-Genome Sequencing (WGS) data. For exome data, the reduced exonic mutation burden can make it more difficult to work with the spectrum of the VAF distribution. In general, the key determinant to detect peaks in the VAF, is the number of mutations per copy state. For tumours with strong endogenous mutant factors (e.g., smoking) or very high mutation rate (e.g., microsatellite unstable tumours), the number of exonic mutations could be high enough to use CNAqc. Peak-detection QC 2 The notation 1:1 is sometimes analogously expressed as genotype AB, 1:0 as A, 2:1 as AAB and 2:2 as AABB. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. We consider a somatic mutation ​present in 𝑚 copies of the tumour genome, when the sample purity is 𝜋 and the segment ploidy is 𝑝. Note that can be computed summing p the total number of copies of the minor and major allele at the mutation locus (​Figure 1 ​). The key equations for the expected VAF of a clonal mutation and its CCF are presented in the Main Text. Here we discuss how ​peaks can be used to QC both tumour purity and CNA segments and, consequently, overall tumour ploidy. From a QC perspective, if we solve for and the equations, we can get as which means that if we know tumour purity and CNA, we expect a peak at VAF , for a given value of , in the data distribution ( ​Figure 1a and ​1b ​). For instance, for a 1:1 segment ( ), the expected VAF for a heterozygous clonal ( ) mutation is 25% p = 2 m = 1 for a 50%-purity tumour, and 50% for a 100%-purity tumour. Similarly, for a 2:2 genome ( ) of a tumour with 75% purity, the expected VAF for clonal mutations accruedp = 4 before genome doubling and therefore visible in two copies ( ) is ~54%, while for m = 2 those accrued after genome doubling, and therefore present in single copy ( ), we m = 1 expect a ~21% VAF ​(Dentro, Wedge, and Van Loo 2017)​. CNAqc checks the data for peaks at these VAFs, with a tolerance . From the distance between the theoretical expectation and the estimator derived from data, we obtain an error metric for the calls. CNAqc first performs peak detection from the input VAF with two, separate, methods: 1. Via a kernel density estimation with fixed bandwidth, which is used to determine a smooth density profile. Peaks are then estimated from the discretized smooth, using specialised R packages for peak-detection and removing peaks with density below a parameterized cutoff. 2. Via Binomial mixture from the BMix ​(Caravagna et al. 2020) package ( ​https://caravagn.github.io/BMix/​), a peak is associated with each Binomial probability, for all mixture components . Peaks are matched to the expected theoretical values based on their euclidean distance. A theoretical peak can be matched to the closest peak in the data, or the one to the most right side of the frequency spectrum. This latter strategy works only if there are no miscalled CNAs. The first strategy (closest match), is the default CNAqc choice. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://www.codecogs.com/eqnedit.php?latex=m#0 https://www.codecogs.com/eqnedit.php?latex=%5Cpi#0 https://www.codecogs.com/eqnedit.php?latex=v#0 https://www.codecogs.com/eqnedit.php?latex=v%20%3D%20%5Cdfrac%7Bv%5B(p-2)%5Cpi%20%2B%202%5D%7D%7Bm%5Cpi%7D%20#0 https://www.codecogs.com/eqnedit.php?latex=v#0 https://www.codecogs.com/eqnedit.php?latex=m#0 https://paperpile.com/c/rqVmzs/Uxwc https://www.codecogs.com/eqnedit.php?latex=%5Cepsilon%3E0#0 https://paperpile.com/c/rqVmzs/chqB https://caravagn.github.io/BMix/ https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. For every peak a QC value (PASS or FAIL) is determined based on some tolerance . The overall QC status of copy states with multiple peaks is the QC of the peakε > 0 with most mutations underneath. The overall QC status for a sample with many copy states is determined by summing up the QC status of individual copy states, and weighting them by the number of mutations associated (majority rule). CCF estimation CNAqc can compute CCFs in two ways. One of the two uses the idea of the mixture highlighted in ​Figure 1c ​, the other is simpler and works better when data resolution is low, and the entropy of the mixture model would leave too many mutations unassigned. For the mixture approach, we build a 2-components Binomial mixture from the theoretical expectations and the data. This implicitly assumes that peaks have been QCed first. We constraint the success parameters to match the expected VAF, and use the proportion of mutations that appear underneath a peak as mixing proportions . π Then, from the latent variables of the model we compute the probability of assigning a z mutation with VAF to cluster ,xn c .(z | θ, )p n,k = c π From this information we obtain the entropy of , which is low for values that are (z)H z assignable to only one cluster. Recall in this respect that the maximum entropy distribution is the uniform one, which is when a mutation can be equally likely in 1 or 2 copies, based on VAF. We use a simple peak detection heuristic to find 2 points of changes in ; in (z)H between those values we cannot reliably assess , i.e. assess if the mutation is in m single or double copy. For these CNAqc leaves the CCF value as NA. The alternative approach uses a simpler idea, still working on the expected theoretical VAF. Here instead of fitting a mixture we determine the midpoint , between the two o expected theoretical VAF peaks. The midpoint is computed by weighting each of the two peaks proportionally to the number of mutations that appear underneath each peak. The midpoint is a cut: values below are in single copy, values above in two. This o procedure requires data with good sequencing coverage, and a good general quality. When mutation multiplicities have been determined, CCF computation is trivial, and follows the formula presented in the Main Text. A QC PASS status is assigned to the .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. CCF values for a copy state, if less than 10% (or any custom threshold) are unassigned. The overall sample is given a QC status based on a majority policy. Genome fragmentation Some recently identified patterns of somatic CNA changes can be attributed to the presence of highly fragmented tumour genomes, termed chromothripsis and chromoplexy, or localised hypermutation patterns, termed kataegis ​(Cortés-Ciriano et al. 2020)​. While these can be identified using dedicated bioinformatics tools, CNAqc offers a simple statistical test to detect the presence of over-fragmentation in a chromosome arm, a prerequisite that could point to the presence of such patterns. The test works at the level of each chromosome arm (1p, 1q, 2p, 2q, etc.), and uses the length of each input CNA segment to assign a “long segment” or “short segment” status. This is determined by a cut parameter that is set, by default, to 20% (i.e., ). μ .2μ = 0 Then, a null hypothesis is used to compute a p-value. That is defined using a Binomial test based on , the number of trials given by the total segment counts in the arm, and k the observed number of short segments . The Binomial distribution for is defined s H0 by , and the null is the probability of observing at least short segments, a one-tailed μ s test for whether the observations are biased towards shorter segments. The p-value is adjusted for family-wise error rate by Bonferroni, dividing the desired -value by the α number of tests. This test is applied to a subset of chromosome arms with a minimum number of segments, and that “jump” in ploidy by a minimum amount (empirical default values estimated from trial data). The arm-level jump is determined as the sum of the difference between the ploidy of two consecutive DNA segments. These covariates are similar to those used to infer CNA signatures from single-cell low-pass WGS ​(Macintyre et al. 2018) ​. Other features CNAqc contains multiple functions to subset the data (i.e., select mutations that map only to certain copy states, subset CNAs with a total ploidy, etc.), visualise the data (i.e., plot mutational burden by tumour genome) or smooth the input CNA segments. Smoothing is an operation that can be carried out before testing for over-fragmentation. In CNAqc, by smoothing we obtain that two contiguous segments are merged if they have exactly the same ploidy profile (i.e. same numbers for the major and minor .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://paperpile.com/c/rqVmzs/FjZP https://paperpile.com/c/rqVmzs/FjZP https://paperpile.com/c/rqVmzs/P1Yv https://paperpile.com/c/rqVmzs/P1Yv https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. alleles), and if they are a maximum distance apart (e.g. 1 megabase). This operation does not affect the ploidy profile of the calls, but reduces the amount of breakpoints that would inflate the p-value of the Binomial over-fragmentation test. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Main Text Figures Figure 1. a. ​Theoretical ​VAF histogram for diploid 1:1 mutations in a tumour. A clonal heterozygous mutation has 50% VAF; all mutations are observed with some Binomial sequencing noise. The clonal mutations form a peak at 100% CCF, plus other features that characterise the tumour clonal composition (e.g., the tail). The expected theoretical VAF decreases if sample purity reduces. ​b. The case of a 2:1 tumour genome, where we expect 2 peaks in the VAF originating from mutations present in one (orange) or two copies (purple). The multiplicity of a mutation can phase whether it happened before or after the CNA. For 2:1 we expect peaks at 66% and 33% VAF, both clonal mutations (100% CCF). ​c. ​Computing CCFs requires caution for mutations with different multiplicities; we support 2:0, 2:1 and 2:2 copy states in CNAqc, and offer two methods to compute CCFs. The one depicted is based on the entropy of a Binomial mixture. From the expected VAF peaks we construct a mixture density and use the entropy of its latent variables to capture uncertainty in the multiplicities. At the crossing of the components we cannot easily assign multiplicities, and therefore CCFs; the entropy peaks at the top of the uncertainty by definition. ​d. ​Heatmap expressing the relationship between copy states, mutation multiplicity and sample purity. The color reflects the expected VAF for the corresponding mutations, and can be used to QC both CNAs and purity estimates. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Figure 2. a. Genome-wide total clonal copy number segments for a PCAWG cancer sample with overall ploidy 2, and sample purity ~85%. The panel is composed of three illustrations. The bottom plot reports the copies of the major and minor alleles in each segment, and some genome areas are shaded. The central plot shows genome-wide somatic mutations with their depth of sequencing, and the top plot shows the total number of mappable mutations binned every megabase. ​b. Variant Allele Frequencies (VAFs) for the mutations that map to the input segments (note that these are all SNVs). ​c. ​Depth of sequencing (DP) for every SNV. d. Number of reads (NV) with the variant allele for every SNV. e. ​Cancer Cell Fractions (CCF) estimation for this sample, obtained from CNAqc. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Figure 3. a-d. Peak detection analysis assessing the quality of CNA segments (split by copy state), and tumour purity. The shaded gray area are input mutations, and the thin black profile is its kernel density estimation (KDE). The black circles represent the peaks detected from the KDE, and the vertical dashed lines are the expected peaks, given the tumour purity. If the data peaks fall within the shaded area surrounding the vertical line, the estimates are consistent and the plot is therefore green (QC pass). For copy states with total copy number >2, multiple peaks are checked independently. In that case the overall QC status for the copy state is a linear combination of the results, weighted by the number of mutations assignable to each peak. ​e-h. Cancer Cell Fractions (CCF) estimation for each tumour genome, using the entropy method. Each plot shows both CCF, and the VAF from which mutation multiplicities are computed. In the rightmost panel we overlay the entropy profile computed by a 2-dimensional Binomial mixture. Areas within the red vertical dashed lines are those for which CNAqc cannot assign a confident CCF value. For copy states 1:0 and 1:1 the mutation multiplicity is fixed to 1 by definition. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Figure 4. a. Circos plot for four possible whole-genome CNA segmentations determined by Sequenza with WGS data (~80x median coverage, purity 87%). The input sample is Set7_57, one of four multi-region biopsies for colorectal cancer patient Set7. The first run is with default Sequenza parameters. With CNAqc, we slightly adjust purity estimation and obtain a final run of the tool. We also one run forcing overall tumour ploidy to 4 (tetraploid), and one with maximum tumour purity 60%. ​b. Purity and ploidy estimation for the four Sequenza runs. Arrows show the adjustment proposed by CNAqc, the default and final runs are the only ones to pass QC. ​c. Final run with perfect results for Set7_57: copy number segments, depth of coverage per mutation and mutation density per megabase. ​d. ​Miscalled copy-neutral LOH segment, obtained by forcing a tetraploid solution in Sequenza. For a 2:0 segment with the estimated Sequenza .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. purity we expected peaks at ~60% and ~30% VAF, which cannot be matched. ​e. CNA calling with CNAqc and Sequenza for 4 WGS biopsies of the primary colorectal cancer Set7. Figure 5. a. Summary CNAqc pass or fail barplot for top-quality PCAWG samples 065 n = 1 across distinct tumour types. Failures for peaks are with a 3% error tolerance, and CCFs with 10% of SNVs not assignable, per copy state. ​b. ​Zoom peak analysis with a scatter showing, for .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. every tumour type, the total cases per tumour against the proportion of pass or fails; each dot size is proportional to the error measure from mismatched peaks. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figures Supplementary Figure S1. ​PCAWG sample with low mutational burden. Supplementary Figure S2. ​Sample​ ​Set7_55 (multi-region). Supplementary Figure S3. ​Sample​ ​Set7_59 (multi-region). Supplementary Figure S4. ​Sample​ ​Set7_62 (multi-region). Supplementary Figure S5. ​Sample​ ​Set6_42 (multi-region). Supplementary Figure S6. ​Sample​ ​Set6_44 (multi-region). Supplementary Figure S7. ​Sample​ ​Set6_45 (multi-region). Supplementary Figure S8. ​Sample​ ​Set6_46 (multi-region). Supplementary Figure S9. ​Sample​ ​Set6_47 (multi-region). Supplementary Figure S10. ​Sample​ ​Set6_48 (multi-region). Supplementary Figure S11. ​PCAWG sample with overstimated 100% purity. Supplementary Figure S12. ​PCAWG sample with true 99% purity. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S1. ​Example PCAWG medulloblastoma sample with low-mutational burden, which passes data QC with CNAqc. ​a. ​Data for the sample (genome-wide CNA segments, CCF and read counts distribution). Note that this sample has only 76 SNVs in diploid tumour regions, like we observe in whole-exome assays. ​b,c. Peak analysis and CCF computation for diploid SNVs. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S2. ​Colorectal multi-region sample Set7_55 for patient Set7 (see also Main Text ​Figure 4 ​). ​a. ​Data for the sample (genome-wide CNA segments, CCF and read counts distribution). ​b,c. Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S3. ​Colorectal multi-region sample Set7_59 for patient Set7 (see also Main Text ​Figure 4 ​). ​a. ​Data for the sample (genome-wide CNA segments, CCF and read counts distribution). ​b,c. Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S4. ​Colorectal multi-region sample Set7_62 for patient Set7 (see also Main Text ​Figure 4 ​). ​a. ​Data for the sample (genome-wide CNA segments, CCF and read counts distribution). ​b,c.​ Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S5. ​Colorectal multi-region sample Set6_42 for patient Set6. ​a. Data for the sample (genome-wide CNA segments, CCF and read counts distribution). b,c.​ Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S6. ​Colorectal multi-region sample Set6_44 for patient Set6. ​a. Data for the sample (genome-wide CNA segments, CCF and read counts distribution). b,c.​ Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S7. ​Colorectal multi-region sample Set6_45 for patient Set6. ​a. Data for the sample (genome-wide CNA segments, CCF and read counts distribution). b,c.​ Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S8. ​Colorectal multi-region sample Set6_46 for patient Set6. ​a. Data for the sample (genome-wide CNA segments, CCF and read counts distribution). b,c.​ Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S9. ​Colorectal multi-region sample Set6_47 for patient Set6. ​a. Data for the sample (genome-wide CNA segments, CCF and read counts distribution). b,c.​ Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S10. ​Colorectal multi-region sample Set6_48 for patient Set6. a. ​Data for the sample (genome-wide CNA segments, CCF and read counts distribution). ​b,c.​ Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure 11. ​Example PCAWG sample with purity of 100%. ​a. ​Data for the sample (genome-wide CNA segments, CCF and read counts distribution). ​b. This sample has 75% of its SNVs in diploid tumour regions, where a small peak is detectable at the expected purity. The VAF clearly peaks at ~10%, possibly suggesting a purity of 20% or lower, rather than 100%. Further doubts about the current purity come from non-diploid regions, where all peaks are mismatched; for this sample CNAs called with a low-purity solution should be compared to the 100% purity solution. ​c. CCF computation for the sample. Notice that in triploid and tetraploid tumour genomes we do not find mutations present in 2 copies. Was this true then the tumour did not acquire any SNV right before the CNA. Also, here we are not cross-checking QC results from peak .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. detection; for instance we could decide to use only mutations that map to PASS states (1:1, 2:2), and reject all others. Supplementary Figure 12. ​Example PCAWG pancreatic adenocarcinoma with 99% purity (and 3 possible driver SNVs, 2 of them involving tumour suppressor genes in LOH regions). ​a. ​Data for the sample (genome-wide CNA segments, CCF and read counts distribution). ​b. This sample has 90% of its SNVs in diploid tumour regions, and the others in a variety of distinct CNA segments. From a peak analysis point of view, all the calls are validated. ​c.​ CCF values for this sample are also good. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-698605 ---- Comparative evaluation of full-length isoform quantification from RNA-Seq 1 Comparative evaluation of full-length isoform quantification from RNA-Seq Dimitra Sarantopoulou1,#a ¶, Thomas G. Brooks1¶, Soumyashant Nayak1, Anthonijo Mrcela1, Nicholas F. Lahens1, Gregory R. Grant1,2* 1 Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America 2 Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America #a Current address: National Institute on Aging, National Institutes of Health, Baltimore, Maryland, United States of America ¶ equal contributors * Corresponding author Email: ggrant@pennmedicine.upenn.edu (GG) Abstract Full-length isoform quantification from RNA-Seq is a key goal in transcriptomics analyses and has been an area of active development since the beginning. The fundamental difficulty stems from the fact that RNA transcripts are long, while RNA-Seq reads are short. Here we use simulated benchmarking data that reflects many properties of real data, including polymorphisms, intron signal and non-uniform coverage, allowing for systematic comparative analyses of isoform quantification accuracy and its impact on differential expression analysis. Genome, transcriptome and pseudo alignment-based methods are included; and a simple approach is included as a baseline control. Salmon, kallisto, RSEM, and Cufflinks exhibit the highest accuracy on idealized data, while on more realistic data they do not perform dramatically better than the simple approach. We determine the structural parameters with the greatest impact on quantification accuracy to be length and sequence compression complexity and not so much the number of isoforms. The effect of incomplete annotation on performance is also investigated. Overall, the tested methods show sufficient divergence from the truth to suggest that full-length isoform quantification and isoform level DE should still be employed selectively. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint mailto:ggrant@pennmedicine.upenn.edu https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 Keywords benchmarking, isoform quantification, simulated data, pseudo-alignment, RNA-Seq, short reads Background Alternative splicing and isoform switching play central roles in cell function; and disruption of the splicing mechanism is associated with many diseases and drug targets (1,2). The function of a protein is ultimately determined by its full complement of functional domains. Differential splicing typically involves a reshuffling of the functional domains to construct a functionally different protein. Gene level analyses must ignore these differences. Before things like pathway enrichment analysis can be brought down to the transcript level, it will be necessary to quantify expression of full-length isoforms. For investigations specifically focused on splicing, one also has the option of working at the local splicing level (e.g., MAJIQ(7)). If, for example, full-length isoform quantification simply leads to an exon skipping event, that would have also been found by local splicing methods. Investigators must therefore carefully factor in the goals of their analysis to decide at which level features should be quantified. Another reason for estimating isoform level expression is to give better estimates of gene level expression. Indeed, it is not clear how to achieve gene level quantification from local splicing information. For various purposes, full length isoform quantification must be more informative than local splicing information when it can be achieved, and the primary reason local splicing methods are popular right now is due to the relative difficulty in working with full length. The fact that isoform quantification is a key goal for modern transcriptomic profiling is reflected in how active the community has been in developing methods and how popular those methods have been, in spite of their notoriously high false positive rates. Despite many published algorithms, in practice effective quantification of full-length isoforms from short-read RNA-Seq remains problematic and therefore has never been routine. The fundamental limitation is that individual short reads do not contain information on long- range interactions that would associate splicing events that are separated by more than the fragment length. Regardless, methods can exploit additional biological and stochastic .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://paperpile.com/c/KJIn71/bLj64+vEOvf https://paperpile.com/c/KJIn71/CdSll https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3 information, like canonical splice sites, which combined with alignment information can increase accuracy (3–6). Although long sequence read technology is improving, compared to short read technology it continues to be lower throughput with a much higher base-wise error rate and is generally more expensive. Therefore, most RNA-Seq studies are still performed with short reads and this will likely remain the case until competing technologies mature. Short reads are typically 100-150 bases long, and usually obtained from both ends of short 200-500 base fragments. Meanwhile a significant portion of RNA transcripts are over 1000 bases and many are much longer. Given the difficulty in full-length isoform quantification, many RNA-Seq studies simply quantify at the gene level, which is much easier because uniquely aligning reads are rarely ambiguous at the gene level. Indeed, unless the investigator is specifically interested in splicing, gene level analysis will likely lead to the same conclusions, since all isoforms of the same gene typically have the same pathway annotations. Meaningful unbiased benchmarking conclusions rely on independent investigations and realistic benchmarking data where the ground truth is known or well-approximated. There are in fact a few independent studies that compare the performance of transcript quantification methods using simulated data (8), real data (9), or a hybrid approach with both real and simulated data (10–12). So why did we embark on another comparative study? Angelini et al (8) and Kanitz et al (12) are five and six years old, respectively, and hence they do not reflect the recent developments in this fast-changing field. For instance, they do not include the popular pseudo-alignment-based methods kallisto (19) and Salmon (18). Angelini et al (8) take the approach of using simulated data, which is most similar to the approach employed here, however, they utilize the FLUX simulator which does not allow for many of the effects of real data we can model using the BEERS simulator (16). Also, the primary focus is on detection of isoforms as being “present/absent” in the sample and accuracy of quantification was presented as tables of quantiles. Their conclusion was that “all tables indicate that the problem of obtaining reliable estimates is still open.” Therefore, these methods require ongoing evaluation by the user community. Zhang et al (10) use the human universal reference sample (UHRR) and the human brain universal reference (HBRR) which are such artificial samples that it is not clear what practical guidance can be drawn from the results. In particular the UHRR is a mixture of 10 cancer cell lines. Cancer transcriptomes are notoriously scrambled and mutated, and therefore represent a very special case, particularly with regards to annotation-based .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://paperpile.com/c/KJIn71/PWVxf+Kb6E9+h2p3a+fde32 https://paperpile.com/c/KJIn71/Wv4BO https://paperpile.com/c/KJIn71/o7ZvV https://paperpile.com/c/KJIn71/8nLxh+4WiAP+iuDTb https://paperpile.com/c/KJIn71/Wv4BO https://paperpile.com/c/KJIn71/iuDTb https://paperpile.com/c/KJIn71/8nLxh https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4 quantification. Moreover, a mixture of ten such cell lines give a sample so different from what researchers use in practice that it precludes the possibility of evaluating the methods in the context of a typical differential expression analysis, which is the main goal of most RNA-Seq studies. With the UHRR and HBRR samples only technical replicates can be generated, while all DE methods require biological replicates. Simulated data which mimic real samples is arguably more realistic than real data obtained from mixtures of 10 cancer cell lines. In silico simulated data offer more control as the truth is known exactly, but these data invariably simplify some of the inherent complexities of real data. In 2016, Teng et al (13) published very nice guidance on quantification benchmarking. Their approach assumes one has benchmarking data where the truth is known on the level of differential expression, without assuming as known the actual quantified values. Since the goal here is to investigate quantification accuracy directly, the methods in Teng et al are not directly applicable. Other studies focus only on single-cell data (14), or on differential splicing (15). Commonly, RNA-Seq transcript level quantification is validated by PCR. However, PCR is low-throughput and is based on probes that interrogate only a small part of a given transcript; it is also sensitive to biases at the amplification step. On the other hand, in in silico simulated data the truth is known exactly. Hayer et al (11) investigated de novo transcriptome assembly, where isoform structures need to be inferred directly from the RNA-Seq data and concluded that none of the evaluated methods is accurate enough for routine use and further method development is required. The problem we investigate is considerably easier; isoform level annotation is given and reads must just be assigned to the correct isoform. Approaches for quantifying isoform expression can be divided into three main categories. The first approach uses reads mapped to the genome by an intron-aware aligner, e.g. STAR (17). The genome alignment information is then used to assign quantified values to transcripts (3–6). The second approach is similar to the first, except it is based on reads aligned directly to the transcriptome, rather than the genome (6,17). The third approach follows the concept of pseudo-alignment which prioritizes execution performance and does not involve bona fide alignment (18,19). In reality, all genome aligners are transcriptome- aware, and transcriptome alignments are genome aware, so the distinction is not as cut and dried as it once was. But nonetheless, we continue to distinguish the two, with caveats. There are many published methods for quantifying full-length isoforms, however the vast majority of studies performing isoform specific analysis have used Cufflinks, RSEM or some simple counting method following genome alignment (Fig 1) (20–22). Pseudo-aligners were .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://paperpile.com/c/KJIn71/igD0B https://paperpile.com/c/KJIn71/31mS7 https://paperpile.com/c/KJIn71/4WiAP https://paperpile.com/c/KJIn71/IqPuQ https://paperpile.com/c/KJIn71/fde32+PWVxf+h2p3a+Kb6E9 https://paperpile.com/c/KJIn71/2uHCX+fde32 https://paperpile.com/c/KJIn71/tEgBs https://paperpile.com/c/KJIn71/0cQRD+dMkfh+Gklkp https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 5 introduced more recently and therefore have lower adoption but are beginning to see wider usage (23,24). Here we present a benchmarking analysis of the six most popular isoform quantification methods: kallisto, Salmon, RSEM, Cufflinks, HTSeq, and featureCounts, based on a survey of the literature (Fig 1). HTSeq and featureCounts are not recommended by the authors for full-length isoform quantification, however they were included for the purpose of comparison and because they are used in practice. We also include a naïve read proportioning method, based on employing the distribution of signal inferred from the unambiguous read alignments to portion out the ambiguous read alignments, similar to the method first described by Mortazavi et al (40). We generated datasets from two mouse tissues, Liver and Hippocampus, which are known to be quite different in terms of splicing, with brain generally being more complex than any other tissue. A hybrid approach is taken to obtaining benchmarking data, where real samples are emulated to generate simulated data where the true isoform abundances are known; this was done using a modified version of the BEERS simulator (16). Idealized data were generated to obtain upper bounds on the accuracy of all methods. Data were also generated with variants, sequencing errors, intron signal and non-uniform coverage, to assess how they affect performance. Since annotation is never perfect, we evaluate performance while varying annotation completeness. Fig 1. Most popular quantification methods. Ranking of quantification methods by the number of times found in the 100 most recent RNA-Seq studies (published during March-May, 2019), which reported the quantification method used. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://paperpile.com/c/KJIn71/PP9si+PPXjz https://paperpile.com/c/KJIn71/UjYYq https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 6 Usually, the aim of an RNA-Seq analysis is to inform a downstream differential expression (DE) analysis. Therefore, we also evaluate the methods on this level, using both real and simulated data. However, it is much more challenging to produce realistic data with known ground truth at the DE level. Unlike isoform level quantification which is sample-specific, DE ground truth is established at the population level, and therefore involves much more complex benchmarking data. Our simulated samples reflect the complex joint distribution of expression across biological replicates, and thus it is meaningful to perform a DE analysis on them. This is described in more detail below but briefly, in lieu of knowing the ground truth in terms of which isoforms are differentially expressed, for each method we compare the DE analysis performed on the known true isoform quantifications of the simulated data to the DE analyses performed on the estimated counts determined using the particular method. The more different the two analyses are, the less accurate the quantification method must be in informing the DE analysis. This then allows us to compare the methods in terms of their accuracy of quantification. It is possible that a method underperforms another method at the level of quantification, but outperforms it in the DE analysis. Results Hybrid benchmarking study using both real and simulated data For the simulated data we started with 11 real RNA-Seq samples: six liver and six hippocampus samples from the Mouse Genome Project (25). Isoform expression distributions were estimated from these samples in (7) which were then used to generate simulated data for which the source isoform of every read is known. Two types of simulated datasets were generated with the BEERS simulator (16). First, idealized simulated data were generated, with no SNPs, indels, or sequencing errors, no intron signal and uniform coverage across each isoform (7). Second, simulated data were generated with polymorphisms (SNPs and indels), sequencing errors, intron signal, and empirically inferred non-uniform coverage (7). Relative performance on idealized data does not necessarily reflect relative performance on real data, but we do expect the accuracy of the methods on idealized data to be upper bounds on the accuracy in practice. If a bound on idealized data is below what one would tolerate in practice, then it cannot be expected to be viable in practice. The (more) realistic data provide insight into the effect of the various factors on the method performance. The realistic data probably also gives bounds on accuracy of real data since it was designed to be no more complex than real data. For simplicity of exposition, we will refer to the data with the complexities as the “realistic” data, keeping in .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://paperpile.com/c/KJIn71/2y805 https://paperpile.com/c/KJIn71/CdSll https://paperpile.com/c/KJIn71/UjYYq https://paperpile.com/c/KJIn71/CdSll https://paperpile.com/c/KJIn71/CdSll https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 7 mind it does not reflect every property of real data, just the five properties listed above (SNPs, Indels, Sequencing Error, Intron Signal and Non-Uniform Coverage). For both the idealized and realistic simulated data, we use three liver and three hippocampus samples to evaluate isoform quantification, and six liver and five hippocampus samples to evaluate DE analysis, as in [21]. All samples were obtained from independent animals raised as biological replicates. Comparisons between tissues were employed to assess consistency and differential expression; brain has a more complex transcriptome than other tissues (26), and thus isoform level analysis is expected to be more challenging for the algorithms. We performed a comparative analysis of seven of the most commonly used full-length isoform quantification algorithms; kallisto (19), Salmon (18), RSEM (6), Cufflinks (4), HTSeq (3), featureCounts (5) and a naïve read proportioning approach similar to the method first described by Mortazavi et al (40) (NRP; See Methods). Kallisto and Salmon are pseudo-aligners; RSEM, Cufflinks, HTSeq, and featureCounts are genome alignment-based approaches where the alignments are guided by incorporating transcriptome information, and NRP is a transcriptome alignment-based approach. These methods were evaluated at the isoform expression level using idealized and realistic simulated data, with full and incomplete annotation, and also at the differential expression level using both realistic and real data. Comparison of full-length quantification methods Idealized data The idealized data has no indels, SNP’s, or errors, includes no intron signal, and deviates from uniform coverage across each isoform only as much as may happen due to random sampling. Under such perfect conditions we expect that all methods will achieve their best performance. The data were aligned to the reference genome or transcriptome with STAR (17) and quantified with the seven methods. In Fig 2A, estimated expression is plotted against the true transcript counts, for each method in Liver. Each point represents the average of the three replicates of that tissue. A point on the diagonal indicates a perfect estimate. A point on the X-axis indicates an unexpressed transcript which was erroneously given positive expression. A point on the Y-axis indicates an expressed transcript which was erroneously given zero expression. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://paperpile.com/c/KJIn71/o6rcT https://paperpile.com/c/KJIn71/tEgBs https://paperpile.com/c/KJIn71/fde32 https://paperpile.com/c/KJIn71/Kb6E9 https://paperpile.com/c/KJIn71/PWVxf https://paperpile.com/c/KJIn71/h2p3a https://paperpile.com/c/KJIn71/IqPuQ https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 8 Fig 2. Comparison of estimated quantification with the truth in simulated data. (A,B) Scatter plots between the inferred and true counts. Each point represents the average expression of three samples. A) idealized data B) realistic data. (C,D) Percentiles of the |logFC| (relative to true counts), for the set of expressed isoforms in sample 1 in C) idealized and D) realistic data. A point (x,y) on a graph means x% of the transcripts have |logFC|0. Specifically, a point (x,y) on the graph means x% of transcripts have |logFC|y. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 17 Fig 8 shows the percentile plots of the |logFC|. Hippocampus sample Hip1 is shown but all samples Hip and Liv look very similar. The first thing to note is that removing the maximally expressed isoform has dramatically decreased the accuracy of all methods except for HTSeq and featureCounts. And removing the non-expressed isoforms has marginally increased accuracy for those methods. In contrast, for HTSeq and featureCounts we observe the opposite. Removing the non-expressed isoforms has dramatically decreased accuracy and removing the highest expressed isoform has made very little difference, particularly with featureCounts. Fig 9 compares for the different methods the percentile plots for removing the maximally expressed isoform. This eliminates the isoform of the majority of the reads so should have a dramatic effect on accuracy. Here Salmon has the most difficulty and HTSeq and featureCounts are the most robust to this, followed by NRP. Here we see a significant difference between Salmon and kallisto that goes in the opposite direction of the differences seen by the other perspectives. Effects on differential expression Fig 9. Removal of highest expressed isoforms. The annotation was modified by removing the highest expressed isoform of every gene. For each method the percentile plots are shown. Here a point (x,y) on a curve means x% of isoforms have |logFC|>y. The lower the curve, the better. Surprisingly, Salmon has the most difficulty and HTSeq the least. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 18 Next, we use differential expression to assess quantification accuracy. If differential expression analysis is the downstream goal for the quantified values, then it does not matter if the absolute abundances differ from the truth, if the DE p-values are unaffected. To investigate this, the two tissues were compared against each other; different enough tissues so that there is an abundance of differentially expressed genes. Six hippocampus samples and six liver samples of the realistic data were quantified, with each of the seven methods, and the resulting quantified values were used as input for DE analyses with EBSeq (41), which is optimized for isoform differential expression. The p-values generated from the true counts are compared to p-values from the inferred counts - the assumption being that the closer a DE analysis on the inferred counts is to the corresponding DE analysis on the true counts, the more effectively the method has quantified the expression, with respect to informing the DE analysis. Kallisto and Salmon are recommended to be run with Sleuth, however since Sleuth cannot take true counts as input, the comparison would not be meaningful. Since we are comparing EBSeq (truth) to EBSeq (inferred) for all methods, it should be meaningful to compare methods to each other with this metric. Fig 10. Method effect on differential expression analysis, using realistic data. For each method, a DE analysis with EBSeq was performed between the two tissues. (A) A point (x,y) on a curve means for the top x DE transcripts using real counts, and the top x DE transcripts using the inferred quants, have Jaccard index y. (B) A point (x,y) on a curve means there are y isoforms with q-value < x. The curves should be evaluated in relation to the truth, which is the gray curve. At varying q-value cutoffs between 0.05 and 0.2 all methods become anti-conservative. Salmon and Cufflinks track the truth closest at small cutoffs. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://paperpile.com/c/KJIn71/lznVl https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 19 Comparing two developmentally divergent tissues, we expect the majority of transcripts that are expressed to be differentially expressed. Figure 10A shows the overlap with the truth, for the top n most significant genes, as n varies from 1 to 50,000. Since EBSeq reports a lot of zero p-values, rounded down from their limit of precision, ties were broken with the logFC. The vertical axis is the Jaccard index (29) of the top n DE transcripts determined using the real counts and the top n DE transcripts determined using the inferred counts. The Jaccard index of two subsets of a set is the size of the intersection divided by the size of the union. The higher the curve, the better. Salmon and Cufflinks are performing best from this perspective, followed by RSEM. NRP and kallisto appear roughly equivalent. In Fig 10B the number of DE transcripts is plotted as a function of the q-value cutoff (S5 Table). If a curve rises above the truth, then that method must be reporting more false-positives than the q-value indicates. At varying places between 0.05 and 0.2 all methods become anti-conservative. Salmon and Cufflinks track the truth closest at small cutoffs. This data can also be used to evaluate the DE methods themselves – EBSeq, Sleuth and DESeq2. DESeq is included for reference, but it was not specifically designed with transcript-level DE in mind. In DE benchmarking, it is notoriously difficult to determine a benchmark set of either differential, or non-differential, transcripts. However, if an isoform Fig 11. Method effect on differential expression analysis, using realistic data. The roughly 43,872 isoforms with zero true expression in both Liver and Hippocampus, serve as a set of null isoforms for the DE analysis. (A) gives a lower bound on the true FDR of the isoforms rejected at each q-value cutoff. Plots above the black line are anti-conservative. (B) same as A but shows the actual number of null isoforms determined DE as a function of the q- value. Note that only 94,929 isoforms exist in total. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 20 has zero expression in all replicates of both conditions, then it must necessarily be non- differential. A total of 43,872 isoforms have zero in all replicates of both conditions. Any transcript called DE in this set must be a false positive arising from mistakes in the quantification process. This allows us to define a lower bound on the actual FDR, because it gives a lower bound on the number of false positives, as given by the number of these null isoforms that were called DE. This lower bound on the FDR is plotted as a function of the q- value cutoff (Fig 11A). Additionally, the actual number of null isoforms called DE is plotted as a function of the q-value cutoff in Fig 11B. Fig 11A shows that in all cases the true FDR is much greater than reported. Indeed, Fig 11B shows that even at very small q-values EBSeq and DESeq are reporting thousands of these false positives. At an FDR of 0.01 there are at least 1,000 isoforms using any method. These cannot simply be the 1% false positives allowed by an FDR of 0.01 since that would then require an additional 99,000 true positives, which is more isoforms than are even annotated. Why is this happening? When an isoform has zero true expression, but another isoform of the same gene has positive expression, it is easy for reads of the expressed isoform to be misassigned to unexpressed. However, if none of the isoforms of a gene are expressed, it is far less likely that any of the isoforms are assigned spurious reads since it is much less likely that any reads map anywhere to the gene’s locus. Therefore, if a gene has no expressed isoforms in Liver and has one or more expressed isoform in Hippocampus, in addition to one or more unexpressed isoforms, then the unexpressed isoforms will tend to have zero expression in Liver and will tend to incur spurious expression in Hippocampus. Such isoforms are then easily mistaken as differential. An isoform level DE method should account for this variability, but we see in Fig 11 that both EBSeq and Sleuth are anti- conservative. The isoform-level DE methods do however outperform DESeq2, which is not intended for transcript-level analysis. On the quantification methods where it is applicable, Sleuth shows the lowest false positive rate, reflecting the fact that it uses additional variance information from bootstrap samples. Evaluation with real data In all comparisons performed with the simulated datasets, HTSeq and featureCounts are very similar and kallisto, Salmon, RSEM, Cufflinks, and NRP are also generally comparable. To explore whether the comparative analyses can be replicated with a real experiment, we .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 21 used the real data that informed the simulations. Here we used six Hippocampus and six Liver samples. Hierarchical clustering was performed with correlation distance, on the average expression of six samples. The results recapitulate these two groups in hippocampus (Fig 12A), while in liver Cufflinks clusters further and alone (Fig 12B), as in the realistic simulated data (Fig 3A-B). This suggests that Cufflinks is strongly influenced by a tissue-specific effect and confirms that the simulated data successfully capture properties of the real data. Furthermore, we compare the seven quantification approaches on how well they inform a DE analysis, using the real data. We quantified six samples from each tissue with the seven methods, followed by DE analysis between the two tissues using EBSeq. The methods cluster similarly for both realistic and real data (Figs 3,12). There is a significant difference in the number of DE transcripts identified at various q-value cutoffs, among the seven methods (Fig 12D, S6 Table). Fig 12. Method effect on DE analysis, using real data. Hierarchical clustering by correlation distance of the average expression using A) six liver samples or B) six hippocampus samples. C) Hierarchical clustering by correlation distance of the logFC of hippocampus over liver samples. For each method, we performed a DE analysis between the two tissues. D) Number of DE transcripts identified at various q-value cutoffs. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 22 Discussion Isoform level quantification has been an area of active development since the inception of RNA-Seq. It got off to a rough start and progressed slowly, however steadily, and we see considerable improvement over the last five years. Nevertheless, using both realistic simulated and real data, no method achieved high enough accuracy across the board that it can be recommended for general purposes. Overall, Salmon marginally outperformed the other methods by our benchmarks. It must be kept in mind however that the additional complexities of real data will likely affect those marginal differences in unpredictable ways. Therefore, if one is going to do full length isoform quantification at this stage, then Salmon or RSEM could be equally effective choices. Cufflinks performs well from many perspectives but the erratic behavior in the Liver clustering (Fig 3,12) is concerning. Salmon, as a pseudo-aligner, has the advantage of efficiency. However, if one is performing small or medium sized RNA-Seq studies, then genome alignments should in principle always be performed anyway so that coverage plots can be examined in a genome browser. Since there is no shortcut to that process, the advantages of Salmon and kallisto in terms of efficiency really only come into play when hundreds or thousands of samples must be processed. Since data sets with hundreds of thousands of samples are on the horizon, this is a real concern. But for most targeted RNA- Seq analyses, as is done routinely in research labs, this will factor less into the decision. Salmon (18) is similar to kallisto, and originally was identical except for incorporating a sample-specific model of fragment GC bias to improve its quantification estimates. Our simulated data, generated by BEERS (16), do not reflect these biases, and thus this feature of Salmon could not be reasonably evaluated in this study. The only simulator currently available that models fragment GC biases is Polyester (33). However, both Polyester and Salmon use the same underlying model for fragment GC bias (34), which may bias results towards Salmon’s benefit. Salmon further has options to control for read start sequence bias (such as from random hexamer priming) and positional bias (such as 5’ or 3’ bias), which were also not evaluated here. Future benchmarking studies will require datasets (both real and simulated) that capture the true sequence properties underlying non-uniform coverage in order to quantitatively assess the performance impact offered by incorporating a fragment bias model. This will be accounted for in BEERS2.0. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://paperpile.com/c/KJIn71/2uHCX https://paperpile.com/c/KJIn71/UjYYq https://paperpile.com/c/KJIn71/2dRaO https://paperpile.com/c/KJIn71/loeyd https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 23 Additionally, we investigated some extreme cases of inaccuracy, in both simulated and real data, where transcripts were estimated to be highly expressed by one method and non- expressed by the other. In the simulated data, we identified enriched genomic properties that drive the deviation of each method from the true counts. And in real data, we isolated one example of large quantification differences between methods. In this, the inclusion of a single read causes kallisto and RSEM to disagree by 137 counts to 0, and the difference resolves if that read is removed. This edge case occurred because only two reads were unambiguous to the two isoforms of a highly expressed gene. The transcript-level DE method Sleuth (31) uses bootstrap resampling to control for possibilities like this example. EBSeq uses the number of sibling isoforms as a factor in its variance computation. However, our analysis indicates these while these methods outperform DESeq2, they could still be generating too many false positives. In particular when all isoforms of a gene are unexpressed in the first condition, and one isoform is expressed in the second condition, we observe a lot of false positives on the other unexpressed isoforms of that same gene, due entirely to quantification inaccuracy. Overall, kallisto and Salmon as alignment-free methods require less computational time while achieving similar or better accuracy compared to other methods whereas RSEM and Cufflinks perform well among the alignment-dependent methods. However, our results indicate that all tested methods should be employed selectively, especially when long transcripts with many isoforms or transcripts with low sequence complexity are the candidates of interest for the study. NRP is a straightforward and simple approach that is relatively robust to polymorphisms, non-uniform coverage and intron signal; however, it struggles with a greater number of isoforms. In any case it performs equally well or in some cases outperforms more sophisticated methods, suggesting that information extraction and inference from short RNA-Seq reads is largely saturated and future, more complex models might offer only small benefits in gene isoform quantification. These results indicate the differing strengths of different approaches to this problem. As such, it may be possible to leverage the different methods to achieve overall greater accuracy. For example, NRP, HTSeq and featureCounts appear to do better on one-isoform genes. So, it may make sense to treat those genes separately. In any case this must continue to be an active area of research before the technology can transform transcriptomics and realize the advantages of full-length isoform quantification. Methods .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://paperpile.com/c/KJIn71/2ytPs https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 24 Data generation We used the same method for generating simulated data as described in Norton et al (7). For all of the procedures described below, we used gene models from release 75 of Ensembl GRCm38 annotation, and sequence information from the GRCm38 build of the mouse genome. We used the empirical expression levels and percent spliced included (PSI) values across all of the Mouse Genome Project (MGP) (25) liver and hippocampus samples estimated in Norton et al (7). Briefly, the samples were aligned with STAR, and gene-level counts were calculated with htseq-count. Next, ENSEMBL transcript models were used to identify local splicing variations (LSVs); loci with exon junctions that start at the same coordinate but end at different coordinates (or vice versa). Of the 41,133 annotated genes expressed in the MGP data, 3,055 were randomly selected to reflect the empirical PSI values for their associated transcripts. For this "empirical set" of genes we estimated PSI values separately for each sample by comparing the relative ratios of all junction-spanning reads that mapped to an LSV. These PSI values reflect the biological noise and real differential splicing (if any) between the two tissues. For each of the remaining genes, we simulated no differential splicing between tissues with the following procedure: 1) For a given gene with n spliceforms, randomly select a gene with the same number of spliceforms from the empirical set. 2) For this empirical gene, randomly select the PSI values from one MGP sample. 3) Assign these PSI values across all samples for the gene in the simulated set. 4) To add inter-sample variability, randomly add/subtract a random number (uniform from 0 - 0.025) to the PSI values in each sample, such that PSI values for the gene/sample still sum to 1. These estimated gene expression counts and PSI values, for both the empirical set and remaining set of genes, served as input into the BEERS simulator (16). For the idealized data, we used a uniform distribution for read coverage, with no intronic signal, and no sequencing errors, substitutions, or indels (parameters: -strandspecific -error 0 - subfreq 0 -indelfreq 0 -intronfreq 0.05 -fraglength 100,250,500). For the realistic data, we used a 3' biased distribution for read coverage that was inferred empirically from previous data (35). We also added 5% intronic signal, and used a sequencing error rate of 0.5%, a substitution frequency of 0.1%, and an indel frequency of 0.01% (parameters: - strandspecific -error 0.005 -subfreq 0.001 -indelfreq 0.0001 -intronfreq 0.05 -fraglength 100,250,500). Lastly, we did not simulate novel (unannotated) splicing events in either dataset (parameter: -palt 0). RNA-Seq analysis .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://paperpile.com/c/KJIn71/CdSll https://paperpile.com/c/KJIn71/2y805 https://paperpile.com/c/KJIn71/CdSll https://paperpile.com/c/KJIn71/UjYYq https://paperpile.com/c/KJIn71/Y4x1H https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 25 The two simulated RNA-Seq datasets were aligned to both the GRCm38 build of the mouse genome and transcriptome with STAR-v2.7.6a (17). For all transcript models we used release 75 of the Ensembl GRCm38 annotation. The breakdown of the annotation by number of spliceforms is given in S7 Fig. The raw read counts were quantified at the transcript level, using the following methods: the pseudo-aligners kallisto-v0.46.2 (19) and salmon-v1.4.0 (18), the naïve read proportioning approach (NRP: http://bioinf.itmat.upenn.edu/BEERS/bp3/) based on transcriptome alignment, as well as the genome alignment based methods RSEM (6), Cuffdiff (Cufflinks-v2.2.1) (4,36), HTSeq- v0.12.3 (3), and featureCounts (Subread-v2.0.1) (5). EBSeq-v1.30.0 (41) was used for differential analysis, both between hippocampus and liver; and also between estimated and true transcript counts. All visualizations were done with R-v3.6.1 packages (37). The command line parameters used for each tool are in S7 Table. Differential Expression Analysis Transcript-level differential expression was assessed via three methods. DESeq2-v3.12 and EBSeq-v1.30.0 were run on the inferred quantified values from all quantification methods. In addition, the Sleuth-v0.30.0 method was run on the quantifications from Salmon and kallisto, using 50 bootstrap samples and the Wasabi package (https://github.com/COMBINE-lab/wasabi) to convert Salmon to the Sleuth input format. All methods were run on the realistic simulated data and compared the five hippocampus samples to the six liver samples and on the real samples, six hippocampus versus six liver samples. For the simulated data, we also ran DESeq2 and EBSeq given the true quantified variables for comparison with the inferred quantifications. EBSeq was configured to perform two-condition isoform-level DE with the recommended uncertainty groups of genes with 1, 2 or 3 or more transcripts. The maxround parameter was set to 25. Since EBSeq is a Bayesian method, we used the reported posterior probability of equivalent expression as the q-value of the transcript being DE (41). Since EBSeq yields many transcripts with q=0, we broke ties by using the logFC from the quantified values, when ranking genes by q-value. Description of the seven quantification methods Kallisto is a pseudo-aligner which uses a hash-based approach to assemble compatibility classes of transcripts for every read by mapping the read’s k-mers, using the transcriptome k-mer de Bruijn graphs (19). It requires few computing resources and has a fast runtime. The index was built from the transcript sequences and transcript abundances were .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://paperpile.com/c/KJIn71/IqPuQ https://paperpile.com/c/KJIn71/tEgBs http://bioinf.itmat.upenn.edu/BEERS/bp3/ https://paperpile.com/c/KJIn71/fde32 https://paperpile.com/c/KJIn71/Kb6E9+psgqY https://paperpile.com/c/KJIn71/PWVxf https://paperpile.com/c/KJIn71/h2p3a https://paperpile.com/c/KJIn71/gpyBm https://paperpile.com/c/KJIn71/tEgBs https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 26 quantified via pseudo-alignment using the index. The counts estimates in the est counts column were used in our analyses. Fifty bootstrap runs were performed for DE analysis by sleuth. Salmon is a pseudo-aligner which also accounts for various biases in the data (GC content, starting sequencing bias, position-specific fragment start location bias such as a 5’ or 3’ bias) (18). Like kallisto, it has fast runtime and low resource requirements. The index is built from transcript sequences and decoy sequences of the entire genome were provided. The NumReads estimate was used in our analysis. Fifty bootstrap runs were performed for DE analysis by sleuth. RSEM is a gene/isoform abundance tool for RNA-Seq data which uses a generative model for the RNA-Seq read sequencing process with parameters given by the expression level for each isoform (6,38). A set of reference transcript sequences was built using rsem-prepare- reference script based on the GRCm38 Ensemblv75 reference genome and the corresponding transcript annotation file. Then the isoform abundances were estimated using rsem-calculate-expression. For our analysis, we use the expected count in the isoform output file which contains the sum (taken over all reads) of the posterior probability that each read comes from the isoform. To prepare input for Cufflinks, HTSeq and featureCounts, the real and simulated data were aligned to a STAR genome index built with the GRCm38 Ensemblv75 transcript annotation file. Cuffdiff2 (36) is an algorithm of the Cufflinks suite (4), which estimates expression at the transcript-level and controls for variability across replicates. Because of alternative splicing in higher eukaryotes, isoforms of most genes share large numbers of exonic sequences which leads to ambiguous mapping of reads at the transcript-level. Cuffdiff2 first estimates the transcript-level fragment counts and then updates the estimate using a measure of uncertainty which captures the confidence that a given fragment is correctly assigned to the transcript that generated it (39). We provided the sorted aligned files and the appropriate annotation file to cuffdiff2 and used the isoforms.count_tracking file generated. For HTSeq (3), htseq-count was used to estimate isoform level abundances from the alignments. We used the recommended default mode which discards any ambiguously mapped reads and hence conservative in its estimate. The HTSeq documentation suggests that one should expect sub-optimal results when it is used for transcript-level estimates and recommends performing exon-level analysis instead (using DEXSeq). Nevertheless, we use .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://paperpile.com/c/KJIn71/fde32+dF6xd https://paperpile.com/c/KJIn71/psgqY https://paperpile.com/c/KJIn71/Kb6E9 https://paperpile.com/c/KJIn71/BrvzO https://paperpile.com/c/KJIn71/PWVxf https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 27 it for transcript-level fragment count estimates in order to quantify its underperformance relative to the other methods. featureCounts (5) is a read count program to quantify RNA-Seq (or DNA-Seq) reads in terms of any type of genomic property (such as gene, transcript, exon, etc.). It is very similar to htseq-count, with the main differences being efficient memory management and low runtime. As a baseline comparison, we considered a Naïve Read Proportioning (NRP) approach as a baseline. This is essentially the method described by Mortazavi et al (40) but without normalizing by transcript length. NRP uses a transcriptome alignment (provided by STAR in this case) and in the first pass, computes the number of reads mapping unambiguously to each transcript. To deal with ambiguous mappers, it then takes a second pass on the alignment file. If a read maps ambiguously to a set of transcripts 𝓣 {𝑇1, 𝑇2, … 𝑇𝑛 } and 𝑐1, 𝑐2, … 𝑐𝑛are the respective fragment counts from unambiguous mappers in the first step, it increments the fragment count of 𝑇𝑖 by 𝑐𝑖 𝑐1+⋯+𝑐𝑛 . If all of the 𝑐𝑖’s are 0, that is, none of the transcripts in 𝓣 have any reads mapping unambiguously to them, we increment the fragment count of 𝑇𝑖 by 𝑙𝑖 𝑙1+⋯+𝑙𝑛 where 𝑙𝑖 is the length of transcript 𝑇𝑖. Statistical analysis As a measure of the accuracy of each method, we compute the absolute value of the log2 fold-change (fold-change after adjusting numerator and denominator by pseudocount of 1) for estimated counts relative to the known simulated true counts. For example, if x is the true count and y is the estimated count for a particular method, we calculate the quantity of | log 𝑦+1 𝑥+1 | for each transcript. The closer the logFC is to 0, the more accurate the method is for that transcript. In order to better represent the distribution of the |logFC| values for each method, we plot (for the set of expressed isoforms) the value of |logFC| corresponding to every tenth percentile starting from 0. If the method has high accuracy, we expect the graph to be close to 0. Thus, if the graph for method A is higher than method B, we conclude that tool B is more accurate. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://paperpile.com/c/KJIn71/h2p3a https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 28 Moreover, we identify the genomic properties of the data that affect the accuracy of the methods. For each method, we identified the most discordant transcripts sorting by |logFC|. Using the Ensembl annotation and genome sequence for GRCm38, we created a database of transcript properties (such as number of isoforms, hexamer entropy, transcript length, compression complexity* (32), exon count, etc.) and their global distributions across the transcriptome. Then for the lists of discordant transcripts, we computed the Kolmogorov- Smirnov two-sample test p-values for each transcript property, followed by Bonferroni correction for multiple testing, to identify the properties that exhibit significant deviation from the global distribution. * Transcript sequence compression complexity is a metric that captures the amount of lossless compression of the transcript sequence. The higher the sequence complexity, the lower the compression, which implies higher transcript sequence compression complexity. List of abbreviations logFC: log2 fold change DE analysis: differential expression analysis DE transcripts: differentially expressed transcripts NRP: naïve read proportioning approach Declarations Acknowledgements We thank the High Performance Computing at Penn Medicine (PMACS HPC) funded by 1S10OD012312 NIH, for the cluster computing support. Availability of data and materials .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://paperpile.com/c/KJIn71/tsCR0 https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 29 All raw and processed RNA-Seq data used in this study are available at Array Express under accession number E-MTAB-599. All simulated data generated in this study are available at http://bioinf.itmat.upenn.edu/BEERS/bp3/. Additional Files Supplemental materials are in Sarantopoulou_FLIquant_supplemental_material.pdf Authors’ contributions GG, DS, and SN conceived of and designed the study. DS, TB and SN performed all computational analysis and visualization. NL produced all RNA-Seq simulated data. All authors contributed to discussions and running the algorithms. DS, TB, SN, and GG wrote the manuscript. All authors read and approved the manuscript. References 1. Kahles A, Lehmann K-V, Toussaint NC, Hüser M, Stark SG, Sachsenberg T, et al. Comprehensive Analysis of Alternative Splicing Across Tumors from 8,705 Patients. Cancer Cell. 2018 Aug 13;34(2):211–24.e6. 2. Cooper TA, Wan L, Dreyfuss G. RNA and disease. Cell. 2009 Feb 20;136(4):777–93. 3. Anders S, Pyl PT, Huber W. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015 Jan 15;31(2):166–9. 4. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010 May;28(5):511–5. 5. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014 Apr 1;30(7):923– 30. 6. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011 Aug 4;12:323. 7. Norton SS, Vaquero-Garcia J, Lahens NF, Grant GR, Barash Y. Outlier detection for .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint http://bioinf.itmat.upenn.edu/BEERS/bp3/ http://paperpile.com/b/KJIn71/bLj64 http://paperpile.com/b/KJIn71/bLj64 http://paperpile.com/b/KJIn71/bLj64 http://paperpile.com/b/KJIn71/vEOvf http://paperpile.com/b/KJIn71/PWVxf http://paperpile.com/b/KJIn71/PWVxf http://paperpile.com/b/KJIn71/Kb6E9 http://paperpile.com/b/KJIn71/Kb6E9 http://paperpile.com/b/KJIn71/Kb6E9 http://paperpile.com/b/KJIn71/h2p3a http://paperpile.com/b/KJIn71/h2p3a http://paperpile.com/b/KJIn71/h2p3a http://paperpile.com/b/KJIn71/fde32 http://paperpile.com/b/KJIn71/fde32 http://paperpile.com/b/KJIn71/CdSll https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 30 improved differential splicing quantification from RNA-Seq experiments with replicates. Bioinformatics. 2018 May 1;34(9):1488–97. 8. Angelini C, De Canditiis D, De Feis I. Computational approaches for isoform detection and estimation: good and bad news. BMC Bioinformatics. 2014 May 9;15:135. 9. Chandramohan R, Wu P-Y, Phan JH, Wang MD. Benchmarking RNA-Seq quantification tools. Conf Proc IEEE Eng Med Biol Soc. 2013;2013:647–50. 10. Zhang C, Zhang B, Lin L-L, Zhao S. Evaluation and comparison of computational tools for RNA-seq isoform quantification. BMC Genomics. 2017 Aug 7;18(1):583. 11. Hayer KE, Pizarro A, Lahens NF, Hogenesch JB, Grant GR. Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data. Bioinformatics. 2015 Dec 15;31(24):3938–45. 12. Kanitz A, Gypas F, Gruber AJ, Gruber AR, Martin G, Zavolan M. Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA- seq data. Genome Biol. 2015 Jul 23;16:150. 13. Teng M, Love M, Davis CA, Djebali S, Dobin A, Graveley BR, Li S, Mason CE, Olson S, Pervouchine D, Sloan CA, Wei X, Zhan L, Irizzary RA. A benchmark for RNA-Seq quantification pipelines. Genome Bio. 17, 74 (2016). 14. Westoby J, Herrera MS, Ferguson-Smith AC, Hemberg M. Simulation-based benchmarking of isoform quantification in single-cell RNA-seq. Genome Biol. 2018 Nov 7;19(1):191. 15. Merino GA, Conesa A, Fernández EA. A benchmarking of workflows for detecting differential splicing and differential expression at isoform level in human RNA-seq studies. Brief Bioinform. 2019 Mar 22;20(2):471–81. 16. Grant GR, Farkas MH, Pizarro AD, Lahens NF, Schug J, Brunk BP, et al. Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM). Bioinformatics. 2011 Sep 15;27(18):2518–28. 17. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1;29(1):15–21. 18. Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias- .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint http://paperpile.com/b/KJIn71/CdSll http://paperpile.com/b/KJIn71/CdSll http://paperpile.com/b/KJIn71/Wv4BO http://paperpile.com/b/KJIn71/Wv4BO http://paperpile.com/b/KJIn71/o7ZvV http://paperpile.com/b/KJIn71/o7ZvV http://paperpile.com/b/KJIn71/8nLxh http://paperpile.com/b/KJIn71/8nLxh http://paperpile.com/b/KJIn71/4WiAP http://paperpile.com/b/KJIn71/4WiAP http://paperpile.com/b/KJIn71/4WiAP http://paperpile.com/b/KJIn71/iuDTb http://paperpile.com/b/KJIn71/iuDTb http://paperpile.com/b/KJIn71/iuDTb http://paperpile.com/b/KJIn71/igD0B http://paperpile.com/b/KJIn71/igD0B http://paperpile.com/b/KJIn71/igD0B http://paperpile.com/b/KJIn71/31mS7 http://paperpile.com/b/KJIn71/31mS7 http://paperpile.com/b/KJIn71/31mS7 http://paperpile.com/b/KJIn71/UjYYq http://paperpile.com/b/KJIn71/UjYYq http://paperpile.com/b/KJIn71/UjYYq http://paperpile.com/b/KJIn71/IqPuQ http://paperpile.com/b/KJIn71/IqPuQ http://paperpile.com/b/KJIn71/2uHCX https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 31 aware quantification of transcript expression. Nat Methods. 2017 Apr;14(4):417–9. 19. Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016 May;34(5):525–7. 20. Lateef A, Prabhudas SK, Natarajan P. RNA sequencing and de novo assembly of Solanum trilobatum leaf transcriptome to identify putative transcripts for major metabolic pathways. Sci Rep. 2018 Oct 18;8(1):15375. 21. Hoang TV, Kumar PKR, Sutharzan S, Tsonis PA, Liang C, Robinson ML. Comparative transcriptome analysis of epithelial and fiber cells in newborn mouse lenses with RNA sequencing. Mol Vis. 2014 Nov 4;20:1491–517. 22. Wu KC, Cui JY, Liu J, Lu H, Zhong X-B, Klaassen CD. RNA-Seq provides new insights on the relative mRNA abundance of antioxidant components during mouse liver development. Free Radic Biol Med. 2019 Jan 16;134:335–42. 23. Del-Aguila JL, Benitez BA, Li Z, Dube U, Mihindukulasuriya KA, Budde JP, et al. TREM2 brain transcript-specific studies in AD and TREM2 mutation carriers. Mol Neurodegener. 2019 May 8;14(1):18. 24. Sharma A, Das S, Kumar V. Transcriptome-wide changes in testes reveal molecular differences in photoperiod-induced seasonal reproductive life-history states in migratory songbirds. Mol Reprod Dev [Internet]. 2019 Apr 25; Available from: http://dx.doi.org/10.1002/mrd.23155 25. Keane TM, Goodstadt L, Danecek P, White MA, Wong K, Yalcin B, et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature. 2011 Sep 14;477(7364):289–94. 26. Zaghlool A, Ameur A, Cavelier L, Feuk L. Splicing in the Human Brain [Internet]. International Review of Neurobiology. 2014. p. 95–125. Available from: http://dx.doi.org/10.1016/b978-0-12-801105-8.00005-9 27. Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015 Apr;12(4):357–60. 28. Nayak S, Lahens NF, Kim EJ, Ricciotti E, Paschos G, Tishkoff S, et al. ISO-Relevance Functions - A Systematic Approach to Ranking Genomic Features by Differential Effect Size [Internet]. bioRxiv. 2018 [cited 2019 May 17]. p. 381814. Available from: .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint http://paperpile.com/b/KJIn71/2uHCX http://paperpile.com/b/KJIn71/tEgBs http://paperpile.com/b/KJIn71/tEgBs http://paperpile.com/b/KJIn71/0cQRD http://paperpile.com/b/KJIn71/0cQRD http://paperpile.com/b/KJIn71/0cQRD http://paperpile.com/b/KJIn71/dMkfh http://paperpile.com/b/KJIn71/dMkfh http://paperpile.com/b/KJIn71/dMkfh http://paperpile.com/b/KJIn71/Gklkp http://paperpile.com/b/KJIn71/Gklkp http://paperpile.com/b/KJIn71/Gklkp http://paperpile.com/b/KJIn71/PP9si http://paperpile.com/b/KJIn71/PP9si http://paperpile.com/b/KJIn71/PP9si http://paperpile.com/b/KJIn71/PPXjz http://paperpile.com/b/KJIn71/PPXjz http://paperpile.com/b/KJIn71/PPXjz http://paperpile.com/b/KJIn71/PPXjz http://dx.doi.org/10.1002/mrd.23155 http://paperpile.com/b/KJIn71/2y805 http://paperpile.com/b/KJIn71/2y805 http://paperpile.com/b/KJIn71/2y805 http://paperpile.com/b/KJIn71/o6rcT http://paperpile.com/b/KJIn71/o6rcT http://dx.doi.org/10.1016/b978-0-12-801105-8.00005-9 http://paperpile.com/b/KJIn71/OCNWO http://paperpile.com/b/KJIn71/OCNWO http://paperpile.com/b/KJIn71/t97hQ http://paperpile.com/b/KJIn71/t97hQ http://paperpile.com/b/KJIn71/t97hQ https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 32 https://www.biorxiv.org/content/10.1101/381814v1.abstract 29. Jaccard P. Nouvelles researches sur la distribution florale. Bulletin de la Société vaudoise des sciences naturelles. Vols. 44, 223-270. 1908. 30. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. 31. Pimentel H, Bray NL, Puente S, Melsted P, Pachter L. Differential analysis of RNA-seq incorporating quantification uncertainty. Nat Methods. 2017 Jul;14(7):687–90. 32. Lempel A, Ziv J. On the Complexity of Finite Sequences. IEEE Trans Inf Theory. 1976 Jan;22(1):75–81. 33. Frazee AC, Jaffe AE, Langmead B, Leek JT. Polyester: simulating RNA-seq datasets with differential transcript expression. Bioinformatics. 2015 Sep 1;31(17):2778–84. 34. Love MI, Hogenesch JB, Irizarry RA. Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nat Biotechnol. 2016 Dec;34(12):1287–91. 35. Lahens NF, Kavakli IH, Zhang R, Hayer K, Black MB, Dueck H, et al. IVT-seq reveals extreme bias in RNA sequencing. Genome Biol. 2014 Jun 30;15(6):R86. 36. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol. 2013 Jan;31(1):46–53. 37. R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria [Internet]. 2017; Available from: http://www.R-project.org/ 38. Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010 Feb 15;26(4):493–500. 39. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011 Mar 16;12(3):R22. 40. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5: 621-628. doi:10.1038/nmeth.1226 .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://www.biorxiv.org/content/10.1101/381814v1.abstract http://paperpile.com/b/KJIn71/w21SD http://paperpile.com/b/KJIn71/w21SD http://paperpile.com/b/KJIn71/lznVl http://paperpile.com/b/KJIn71/lznVl http://paperpile.com/b/KJIn71/2ytPs http://paperpile.com/b/KJIn71/2ytPs http://paperpile.com/b/KJIn71/tsCR0 http://paperpile.com/b/KJIn71/tsCR0 http://paperpile.com/b/KJIn71/2dRaO http://paperpile.com/b/KJIn71/2dRaO http://paperpile.com/b/KJIn71/loeyd http://paperpile.com/b/KJIn71/loeyd http://paperpile.com/b/KJIn71/loeyd http://paperpile.com/b/KJIn71/Y4x1H http://paperpile.com/b/KJIn71/Y4x1H http://paperpile.com/b/KJIn71/psgqY http://paperpile.com/b/KJIn71/psgqY http://paperpile.com/b/KJIn71/psgqY http://paperpile.com/b/KJIn71/gpyBm http://paperpile.com/b/KJIn71/gpyBm http://www.r-project.org/ http://paperpile.com/b/KJIn71/dF6xd http://paperpile.com/b/KJIn71/dF6xd http://paperpile.com/b/KJIn71/BrvzO http://paperpile.com/b/KJIn71/BrvzO https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ 33 41. Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BMG, Haag JD, Gould MN, Stewart RM, Kendziorski C. EBSeq: EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics, Volume 29, Issue 8, 15 April 2013, Pages 1035–1043, .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ Sarantopoulou et al, Benchmarking of FLI quantification for RNA-Seq (Supplemental material) - 1 Supplemental Figures S1 Fig. Method effect on full-length isoform quantification using simulated data. Method effect on full-length isoform quantification using simulated data. Average expression of three hippocampus samples, comparing each method to the truth, using A) idealized and B) realistic data. Percentiles of cumulative distribution of |logFC| using C) idealized data, D) realistic data, E-F) idealized and realistic data respectively, where we restricted to the set of genes that have at least 3 expressed isoforms. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ Sarantopoulou et al, Benchmarking of FLI quantification for RNA-Seq (Supplemental material) - 2 S2 Fig. Effect of transcript length on quantification accuracy. Effect of transcript length on quantification accuracy, given by adjusted logFC of the average of the three hippocampus samples, using A) idealized and B) realistic data. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ Sarantopoulou et al, Benchmarking of FLI quantification for RNA-Seq (Supplemental material) - 3 fig S3 Fig. Differential distribution of transcript compression complexity. For each method the foreground and background distributions are shown for transcript compression complexity. The background is over all isoforms, the foreground is over the top 2,000 discordant transcripts sorted by absolute adjusted log2FC. The foreground distribution is highly enriched for low compression complexity for all methods. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ Sarantopoulou et al, Benchmarking of FLI quantification for RNA-Seq (Supplemental material) - 4 S4 Fig. The distribution of the #genes according to the #annotated isoforms. The distribution of the number of genes for different number of annotated isoforms. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 11, 2021. ; https://doi.org/10.1101/698605doi: bioRxiv preprint https://doi.org/10.1101/698605 http://creativecommons.org/licenses/by-nc-nd/4.0/ Sarantopoulou_FLIquant_benchmark Sarantopoulou_FLIquant_benchmark_supplemental 10_1101-727867 ---- scAEspy: a tool for autoencoder-based analysis of single-cell RNA sequencing data scAEspy: a tool for autoencoder-based analysis of single-cell RNA sequencing data Andrea Tangherloni[∗], E-mail : andrea.tangherloni@unibg.it Wellcome Trust-Medical Research Council Cambridge Stem Cell Institute CB2 0AW, Cambridge, UK Department of Haematology, University of Cambridge CB2 0AW, Cambridge UK Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus CB10 1SA, Hinxton, UK Current address: Department of Human and Social Sciences, University of Bergamo 24129, Bergamo, Italy Federico Ricciuti, E-mail : f.ricciuti@campus.unimib.it Department of Informatics, Systems and Communication, University of Milano-Bicocca 20126, Milan, Italy Daniela Besozzi, E-mail : daniela.besozzi@unimib.it Department of Informatics, Systems and Communication, University of Milano-Bicocca 20126, Milan, Italy Bicocca Bioinformatics, Biostatistics and Bioimaging Centre (B4), 20900, Milan, Italy Pietro Liò[†],[∗], E-mail : pl219@cam.ac.uk Department of Computer Science and Technology, University of Cambridge CB3 0FD, Cambridge, UK Ana Cvejic[†],[∗], E-mail : as889@cam.ac.uk Wellcome Trust-Medical Research Council Cambridge Stem Cell Institute CB2 0AW, Cambridge, UK Department of Haematology, University of Cambridge CB2 0AW, Cambridge, UK Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus CB10 1SA, Hinxton, UK [∗]Corresponding author. [†]These authors contributed equally. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. RESEARCH scAEspy: a tool for autoencoder-based analysis of single-cell RNA sequencing data Andrea Tangherloni1,2,3,6*, Federico Ricciuti4, Daniela Besozzi4,7, Pietro Liò5† and Ana Cvejic1,2,3† *Correspondence: andrea.tangherloni@unibg.it 1Wellcome Trust-Medical Research Council Cambridge Stem Cell Institute, Cambridge, UK Full list of author information is available at the end of the article †Equal contributor Abstract Background: Single-cell RNA sequencing (scRNA-Seq) experiments are gaining ground to study the molecular processes that drive normal development as well as the onset of different pathologies. Finding an effective and efficient low-dimensional representation of the data is one of the most important steps in the downstream analysis of scRNA-Seq data, as it could provide a better identification of known or putatively novel cell-types. Another step that still poses a challenge is the integration of different scRNA-Seq datasets. Though standard computational pipelines to gain knowledge from scRNA-Seq data exist, a further improvement could be achieved by means of machine learning approaches. Results: Autoencoders (AEs) have been effectively used to capture the non-linearities among gene interactions of scRNA-Seq data, so that the deployment of AE-based tools might represent the way forward in this context. We introduce here scAEspy, a unifying tool that embodies: (1) four of the most advanced AEs, (2) two novel AEs that we developed on purpose, (3) different loss functions. We show that scAEspy can be coupled with various batch-effect removal tools to integrate data by different scRNA-Seq platforms, in order to better identify the cell-types. We benchmarked scAEspy against the most used batch-effect removal tools, showing that our AE-based strategies outperform the existing solutions. Conclusions: scAEspy is a user-friendly tool that enables using the most recent and promising AEs to analyse scRNA-Seq data by only setting up two user-defined parameters. Thanks to its modularity, scAEspy can be easily extended to accommodate new AEs to further improve the downstream analysis of scRNA-Seq data. Considering the relevant results we achieved, scAEspy can be considered as a starting point to build a more comprehensive toolkit designed to integrate multi single-cell omics. Keywords: Autoencoders; scRNA-Seq; Dimensionality reduction; Clustering; Batch correction; Data integration Background Single-cell RNA sequencing (scRNA-Seq) was named the “Method of the Year” in 2013, and it is currently used to investigate cell-to-cell heterogeneity since it allows to measure the transcriptome-wide gene expression at single-cell resolution, enabling the identification of different cell-types. scRNA-Seq data are prevalent generated in studies that aim at understanding the molecular processes driving normal develop- ment and the onset of pathologies [1, 2]. This field of research continuously poses new computational questions that have to be addressed [3]. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint mailto:andrea.tangherloni@unibg.it https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 2 of 27 One of the most important steps in scRNA-Seq analysis is the clustering of cells into groups that correspond to known or putatively novel cell-types, by considering the expression of common sets of signature genes. However, this step still remains a challenging task because applying clustering approaches in high-dimensional spaces can generate misleading results, as the distance between most pairs of points is similar [4]. As a consequence, finding an effective and efficient low-dimensional rep- resentation of the data is one of the most crucial steps in the downstream analysis of scRNA-Seq data. A common workflow of downstream analysis, depicted in Figure 1, includes two dimensionality reduction steps: (1) Principal Component Analysis (PCA) [5] for an initial reduction of the dimensions based on the Highly Vari- able Genes (HVGs), and (2) a non-linear dimensionality reduction approach—e.g., t-distributed Stochastic Neighbour Embedding (t-SNE) [6] or Uniform Manifold Approximation and Projection (UMAP) [7, 8]—on the PCA space for visualisa- tion purposes (e.g., showing the labelled clusters) [9, 10]. In addition, when mul- tiple scRNA-Seq datasets have to be combined for further analyses, the technical non-negligible batch-effects that may exist among the datasets must be taken into account [3, 9, 11–14], making the dimensionality reduction even more complicated and fundamental. Indeed, finding a salient batch corrected and low dimensional embedding space can help to better partition and distinguish the various cell-types. Although commonly used approaches for dimensionality reduction achieved good performance when applied to scRNA-Seq data [9], novel and more robust dimension- ality reduction strategies should be used to account for the sparsity, intrinsic noise, unexpected dropout, and burst effects [3, 15], as well as the low amounts of RNA that are typically present in single-cells. Ding et al. showed that low-dimensional representations of the original data learned using latent variable models preserve both the local and global neighbour structures of the original data [16]. Autoen- coders (AEs) showed outstanding performance in this regard due to their ability to capture the strong non-linearities among the gene interactions existing in the high-dimensional expression space. Autoencoders for denoising and dimensionality reduction Deep Count AE network (DCA) was one of the first AE-based approach proposed to denoise scRNA-Seq datasets [17] by considering the count distribution, overdis- persion, and sparsity of the data. DCA relies on a negative binomial noise model, with or without zero-inflation, to capture nonlinear gene-gene dependencies. Start- ing from the vanilla version of the Variational AE (VAE) [18], several approaches have been proposed. Among them, single-cell Variational Inference (scVI) was the first scalable framework that allowed for a probabilistic representation and anal- ysis of gene expression datasets [19]. scVI was built upon Deep Neural Networks (DNNs) and stochastic optimization to consider the information across similar cells and genes to approximate the distributions underlying the analysed gene expression data. This computational tool allows for coupling low-dimensional probabilistic rep- resentation of gene expression data with the downstream analysis to consider the measurement of uncertainty through a statistical model. Svensson et al. integrated a Linearly Decoded VAE (LDVAE) into scVI [20], enabling the identification of re- lationships among the cell representation coordinates and gene weights via a factor .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 3 of 27 mode. Single-cell VAE (scVAE) was introduced to directly model the raw counts from RNA-seq data [21, 22]. More importantly, the authors proposed a Gaussian- mixture model to better learn biologically plausible groupings of scRNA-Seq data on the latent space. Decomposition using Hierarchical AE (scDHA) is a hierarchical AE composed of two modules [23]. The first module is a non-negative kernel AE able to provide a non- negative, part-based denoised representation of the original data. During this step, the genes and the components having an insignificant contribution to the denoised representation of the data are removed. The second module is a stacked Bayesian self-learning network built upon the VAE. This specific module is used to project the denoised data into a low-dimensional space used during the downstream analysis. scDHA outperformed PCA, t-SNE, and UMAP in terms of silhouette index [24] on the tested datasets. AEs coupled with disentanglement methods have been used to both improve the data representation and obtain better separation of the biological factors of vari- ation in gene expression data [25]. In addition, a graph AE, consisting of graph convolutional layers, was developed to predict relationships between single-cells. This framework can be used to identify the cell-types in the dataset under analysis and discover the driver genes for the differentiation process. Wang et al. proposed a deep VAE for scRNA-Seq data named VASC [26], a deep multi-layer genera- tive model that improves the dimensionality reduction and visualisation steps in an unsupervised manner. Thanks to its ability to model dropout events—which can hinder various downstream analysis steps (e.g., clustering analysis, differential expression analysis, inference of gene-to-gene relationships) by introducing a high number of zero counts in the expression matrices—, and to find nonlinear hierarchi- cal representations of the data, VASC obtained superior performance with respect to four state-of-the-art dimensionality reduction and visualisation approaches [26]. Dimensionality Reduction with Adversarial VAE (DR-A) has been recently pro- posed to fulfil the dimensionality reduction step from a data-driven point of view [27]. Compared to the previous approaches, DR-A exploits an adversarial VAE- based framework, which is a recent variant of generative adversarial networks. DR- A generally obtained more accurate low-dimensional representation of scRNA-Seq data compared to state-of-the-art approaches (e.g., PCA, scVI, t-SNE, UMAP), leading to better clustering performance. Geddes et al. proposed an AE-based clus- ter ensemble framework to improve the clustering step [28]. As a first step, random subspace projections of the data are compressed onto a low-dimensional space by exploiting an AE, obtaining different encoded spaces. Then, an ensemble cluster- ing approach is applied across all the encoded spaces to generate a more accurate clustering of the cells. Autoencoders for the imputation of missing data AutoImpute was proposed to deal with the insufficient quantities of starting RNA in the individual cells, a problem that generally leads to significant dropout events. As a consequence, the resulting gene expression matrices are sparse and contain a high number of zero counts. AutoImpute is an AE-based imputation method that works on sparse gene expression matrices, trying to learn the inherent distribution of the .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 4 of 27 input data to assign the missing values [29]. scSVA was also proposed to identify and recover dropout events [30], which are imputed by fitting a mixed model of each possible cell-type. In addition, it performs an efficient feature extraction step of the high-dimensional scRNA-Seq data, obtaining a low-dimensional embedding. In the tests showed by the authors, scSVA was able to outperform different state-of-the-art and novel approaches (e.g., PCA, t-SNE, UMAP, VASC). Other two methods based on nonparametric AEs were proposed to address the imputation problem [31]. Learning with AuToEncoder (LATE) relies on an AE that is directly trained on a gene expression matrix with parameters randomly generated, while TRANSfer learning with LATE (TRANSLATE) takes into consideration a reference gene expression dataset to estimate the parameters that are then used by LATE on the new gene expression matrix. LATE and TRANSLATE were able to obtain outstanding performance on both real and simulated data by recovering nonlinear relationships in pairs of genes, allowing for a better identification and separation of the cell-types. GraphSCI combines Graph convolution network and AE to systematically in- tegrate gene-to-gene relationships with the gene expression data. It is the first approach that integrates gene-to-gene relationships into a deep learning frame- work. GraphSCI is able to impute the dropout events by taking advantage of low- dimensional representations of similar cells and gene-gene interactions [32]. Generally, in the existing AEs the input data are usually codified in a specific for- mat, making their integration into the existing scRNA-Seq analysis toolkits (e.g., Scanpy [33] and Seurat [34]) a difficult task. In addition, the existing tools are im- plemented in Keras[1], TensorFlow [35] or PyTorch [36], and all the three libraries are thus required to run them. Finally, the currently available AEs cannot be di- rectly exploited to obtain the latent space or to generate synthetic cells. In order to overcome the described limitations, we developed scAEspy, which is a unifying, user-friendly, and standalone tool that relies only on TensorFlow and allows easy access to different AEs by setting up only two user-defined parameters. scAEspy can be used on High-Performance Computing (HPC) infrastructures to speed-up its execution. It can be easily run on clusters of both Central Processing Units (CPUs) and Graphics Processing Units (GPUs). Indeed, it was designed and developed to be executed on multi- and many-core infrastructures. In addition, scAEspy gives access to the latent space, generated by the trained AE, which can be directly used to show the cells in this embedded space or as a starting point for other dimension- ality reduction approaches (e.g., t-SNE and UMAP) as well as downstream analyses (e.g., batch-effect removal). In this work, we show how scAEspy can be used to deal with the existing batch- effects among samples. Indeed, the application of batch-effect removal tools into the latent space allowed us to outperform state-of-the-art methods as well as the same batch-effect removal tools applied on the PCA space. Finally, scAEspy implements different loss functions, which are fundamental to deal with different sequencing platforms. [1]https://github.com/fchollet/keras .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://github.com/fchollet/keras https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 5 of 27 Results We tested PCA and AEs to address the integration of different datasets. Specifically, we used all the AEs implemented in our scAEspy tool: VAE [18], an AE only based on the Maximum Mean Discrepancy (MMD) distance (called here MMDAE) [37], MMDVAE, Gaussian-mixture VAE (GMVAE), and two novel Gaussian-mixture AEs that we developed, called GMMMD and GMMMDVAE, respectively. In all the performed tests, the constrained versions of the following loss functions were used: Negative Binomial (NB), Poisson, zero-inflated NB (ZINB), zero-inflated Poisson (ZIP). We used a number of Gaussian distributions equal to the number of datasets to integrate for GMVAE, GMMMD, and GMMMDVAE. In addition, we tested the following configurations of hidden layer and latent space to understand how the dimension of the AEs might potentially affect the performance: (256,64), (256,32), (256,16), (128,64), (128,32), (128,16), (64,32), and (64,16); where (H,L) repre- sents the number of neurons composing the hidden layer (H neurons) and latent space (L neurons). In order to deal with the possible batch-effects, we applied the following ap- proaches, as suggested in [9,12] and being the most used batch-effect removal tools in the literature: Batch Balanced k-Nearest Neighbours (BBKNN) [38, 39], Har- mony [40], ComBat [41–43], and the Seurat implementation of the Canonical Cor- relation Analysis (CCA) [14]. Thus, we compared vanilla PCA and AEs, PCA and AEs followed by either BBKNN or Harmony, ComBat, and CCA. The proposed strategies were compared on three publicly available datasets, namely: Peripheral Blood Mononuclear Cells (PBMCs), Pancreatic Islet Cells (PICs), and Mouse Cell Atlas (MCA) by using well-known clustering metrics (i.e., Adjusted Rand Index, Adjusted Mutual Information Index, Fowlkes Mallows In- dex, Homogeneity Score, and V-Measure). It is worth mentioning that generally the cell-types are manually identified by expert biologists starting from an over or under clustering of the data, eventually followed by different steps of sub clustering of some clusters. Here, we evaluate how the different strategies are able to auto- matically separate the cells by fixing the number of clusters equal to the number of cell-types manually identified by the authors of the papers. Datasets Peripheral Blood Mononuclear Cells PBMCs from eight patients with systemic lupus erythematosus were collected and processed using the 10× Chromium Genomics platform [44]. The dataset is com- posed of a control group (6573 cells) and an interferon-β stimulated group (7466 cells). We considered the 8 distinct cell-types identified by the authors following a standard workflow [44]. The count matrices were downloaded from Seurat’s tu- torial “Integrating stimulated vs. control PBMC datasets to learn cell-type specific responses" [2]. Pancreatic islet cells PIC datasets were generated independently using four different platforms: CEL- Seq [45] (1004 cells), CEL-Seq2 [46] (2285 cells), Fluidigm C1 [47] (638 cells), and [2]https://satijalab.org/seurat/v3.0/immune_alignment.html .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://satijalab.org/seurat/v3.0/immune_alignment.html https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 6 of 27 Smart-Seq2 [48] (2394 cells). For our tests, we considered the 13 different cell-types across the datasets identified in [49] by applying PCA on the scaled integrated data matrix. The count matrices were downloaded from Seurat’s tutorial “Integration and Label Transfer" [3]. Mouse cell atlas MCA is composed of two different datasets. The former was generated by Han et al. [50] using Microwell-Seq (4239 cells) [50], while the latter by the Tabula Muris Consortium [51] using Smart-Seq2 (2715 cells). The 11 distinct cell-types with the highest number of cells, which were present in both datasets, have been taken into account as in [12]. The count matrices were downloaded from the public GitHub repository related to [12] [4]. Metrics Adjusted Rand Index The Rand Index (RI) is a similarity measure between the results obtained from the application of two different clustering methods. The first clustering method is used as ground truth (i.e., true clusters), while the second one has to be evaluated (i.e., predicted clusters). RI is calculated by considering all pairs of samples appearing in the clusters, namely, it counts the pairs that are assigned either to the same or different clusters in both the predicted and the true clusters. The Adjusted RI (ARI) [52] is the “adjusted for chance" version of RI. Its values vary in the range [−1,1]: a value close to 0 means a random assignment, independently of the number of clusters, while 1 indicates that the clusters obtained with both clustering approaches are identical. Negative values are obtained if the index is less than the expected index. Adjusted Mutual Information Index The Mutual Information Index (MII) [53] represents the mutual information of two random variables, which is a similarity measure of the mutual dependence between the two variables. Specifically, it is used to quantify the amount of information that can be gained by one random variable observing the other variable. MII is strictly correlated with the entropy of a random variable, which quantifies the expected “amount of information" that is contained in a random variable. This index is used to measure the similarity between two labels of the same data. Similarly to ARI, the Adjusted MII (AMII) is “adjusted for chance" and its values vary in the range [0,1]. Fowlkes Mallows Index The Fowlkes Mallows Index (FMI) [54] measures the similarity between the clusters obtained by using two different clustering approaches. It is defined as the geometric mean between precision and recall. Assuming that the first clustering approach is the ground truth, the precision is the percentage of the results that are relevant, while the recall refers to the percentage of total relevant results correctly assigned by the second clustering approach. The index ranges from 0 to 1. [3]https://satijalab.org/seurat/v3.0/integration.html [4]https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://satijalab.org/seurat/v3.0/integration.html https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 7 of 27 Homogeneity Score The result of the tested clustering approach satisfies the Homogeneity Score (HS) [55] if all of its clusters contain only cells which are members of a single cell-type. Its values range from 0 to 1, where 1 indicates perfectly homogeneous labelling. Notice that by switching true cluster labels with the predicted cluster labels, the Completeness Score is obtained. Completeness Score The result of the tested clustering approach satisfies the Completeness Score (CS) [55] if all the cells that are members of a given cell-type are elements of the same cluster. Its values range from 0 to 1, where 1 indicates perfectly complete labelling. Notice that by switching true cluster labels with the predicted cluster labels, the HS is obtained. V-Measure The V-Measure (VM) [56] is the harmonic mean between HS and CS; it is equivalent to MII when the arithmetic mean is used as aggregation function. Integration of multiple datasets obtained with the same sequencing platforms Nowadays, various scRNA-Seq platforms are currently available (e.g., droplet-based and plate-based [57–66]) and their integration is often challenging due to the differ- ences in biological sample batches as well as to the used experimental platforms. To test whether AEs can be effectively applied to combine multiple datasets, generated using the same platform but under different experimental conditions, we used the PBMC datasets. We merged the control and treated datasets by using vanilla PCA and AEs, PCA and AEs followed by either BBKNN or Harmony, ComBat, and CCA. After the construction of the neighbourhood graphs, we performed a clustering step by using the Leiden algorithm [67]. Since in the original paper 8 different cell-types were manually identified [44], we selected Leiden’s resolutions that allowed us to obtain 8 distinct clusters and calculated all the metrics described above. In what follows, the calculated values of all metrics are given in percentages. For each metric, the higher the value the better the result. Our analysis showed that the CCA-based approach, proposed in the Seurat li- brary, achieved a mean ARI equal to 73.49% (with standard deviation equal to ±1.52%), ComBat reached a mean ARI of 72.84% (±0.78%), vanilla PCA had a mean ARI of 68.90% (±0.86%), PCA followed by BBKNN was able to obtain a mean ARI of 83.65% (±0.81%), while followed by Harmony reached a mean ARI of 82.83% (±1.20%), as shown in Figure 3A. Among all the tested AEs, MMDAE followed by Harmony (using the NB loss function and 256 neurons for the hidden layer and 32 neurons for the latent space) achieved the best results, with a mean ARI equal to 87.18% (±0.49%). In order to assess whether any of the results ob- tained by the best AE were different from a statistical point of view, we applied the Mann–Whitney U test with the Bonferroni correction [68–70]. In all the compar- isons, MMDAE followed by Harmony had a p-value lower than 0.0001, confirming that the achieved results are statistically different compared to those achieved by the other approaches. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 8 of 27 Regarding the AMII, CCA had a mean value of 66.46% (±0.50%), ComBat achieved a mean value of 70.95% (±0.82%), vanilla PCA obtained a mean value of 68.44% (±1.00%), PCA followed by BBKNN reached a mean value of 75.22% (±0.76%), while followed by Harmony a mean value of 74.55% (±1.19%). MM- DAE followed by Harmony had better results, with a mean value equal to 78.61% (±0.29%). MMDAE followed by Harmony outperformed the other strategies also in terms of of FMS, HS, CS, and VM (see Additional file 2 and Figure 4). We also compared the results obtained by the best AE for each of the tested di- mension (H,L) in terms of ARI (Figure 3B). GMMMD followed by Harmony (using the NB loss function) obtained the best results for the dimension (64,16), GMMMD followed by Harmony (using the Poisson loss function) reached the best results for the dimensions (128,16) and (256,64), and GMMMD followed by BBKNN (using the NB loss function) achieved the best results for the dimension (128,32). MMDAE followed by Harmony (using the NB loss function) was able to reach the best results for the dimensions (64,32), (256,16), and (256,32), while MMDAE followed by Har- mony (using the Poisson loss function) obtained the best result for the dimensions (128,64). Notice that we used two Gaussian distributions because we merged two different datasets. In order to visually assess the quality of the separation of the manually annotated cell-type and the found clusters, we plotted them in the UMAP space generated starting from the MMDAE followed by Harmony space (Figures 3C and D). Finally, we also plotted the two samples in the same UMAP space to visually see the quality of the alignment between the two samples them-self (Figure 5A). This plot confirms that the batch-effects were completely removed. Our analysis showed that clustering the neighbourhood graph generated from AE spaces allowed for a better identification of the existing cell-types when compared to other approaches, thus confirming the ARI results. Integration of multiple datasets obtained with different sequencing platforms Combining datasets from different studies and scRNA-Seq platforms can be a pow- erful approach to obtain complete information about the biological system under investigation. However, when datasets generated with different platforms are com- bined, the high variability in the gene expression matrices can obscure the existing biological relationships. For example, the gene expression values are much higher in data acquired with plate-based methods (i.e., up to millions) than in those ac- quired with droplet-based methods (i.e., a few thousands). Thus, combining gene expression data that spread across several orders of magnitude is a difficult task that cannot be tackled by using linear approaches like PCA. To examine how well AEs perform in resolving this task, we combined four PIC datasets acquired with CEL-Seq [59], CEL-Seq2 [60], Fluidigm C1 [66], and Smart-Seq2 protocols [65]. We integrated the datasets by using vanilla PCA and AEs, PCA and AEs fol- lowed by either BBKNN or Harmony, ComBat, and CCA. Since in the original paper 13 cell-types were manually annotated for the PIC datasets [49], we clustered the neighbourhood graphs using the Leiden algorithm considering only the resolu- tions that allowed us to obtain 13 distinct clusters. We then calculated ARI, AMII, FMS, HS, CS, and VM metrics. The calculated values of all metrics are given in percentages; for each metric, the higher the value the better the result. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 9 of 27 CCA had a very low mean ARI, i.e., 5.45% (±0.22%), ComBat obtained a mean ARI of 76.20% (±3.06%), vanilla PCA achieved a mean ARI of 61.38% (±0.09%), PCA followed by BBKNN reached a mean ARI of 71.49% (±0.58%), while followed by Harmony was able to obtain a mean ARI of 94.00% (±0.36%), see Figure 6A. GMMMD followed by Harmony (using the NB loss function and 256 neurons for the hidden layer and 32 neurons for the latent space) outperformed the other AEs, achieving a mean ARI equal to 94.23% (±0.12%). In all the comparisons, expect for the one against PCA followed by Harmony, GMMMD followed by Harmony had a p-value lower than 0.0001, confirming that the achieved results are statistically different with respect to those obtained by the other approaches. Similar results were achieved for the AMII metric, CCA reached a mean value equal to 16.57% (±0.33%), ComBat obtained a mean value of 76.11% (±1.14%), vanilla PCA reached a mean value of 71.70% (±0.11%), PCA followed by BBKNN achieved a mean value of 77.55% (±0.65%) and PCA followed by Harmony a mean value of 91.17% (±0.47%), while GMMMD followed by Harmony was able to reach a mean value equal to 89.37% (±0.02%). Considering the other measures, both PCA and GMMMD followed by Harmony obtained very similar results, outperforming the other strategies (see Additional file 3 and Figure 7). Considering the best AE for each of the tested dimension (H,L) in terms of ARI (see Figure 6B), GMMMD followed by Harmony (using the NB loss function) resulted the best choice for the dimensions (128,64) and (256,32), while it obtained the best results for the dimension (64,32) when the Poisson loss function was used. GMVAE followed by Harmony (using the Poisson loss function) reached the best results for the dimensions (128,32) and (256,16). MMDAE followed by Harmony achieved the best results for the dimensions (256,64) and (64,16), exploiting the NB loss function and Poisson loss function, respectively. Finally, VAE followed by Harmony obtained the best results with the Poisson function for the dimension (128,16). Note that we exploited four Gaussian distributions because we merged four different datasets. The quality of the separation of the manually annotated cell-type and found clusters can be visually evaluated in Figures 6C and D) We finally visualised the cells (coloured by platform) using the UMAP space generated from the GMMMD followed by Harmony space (Figure 5B) to confirm that the batch-effects among the samples sequenced with different platforms were correctly removed. Taken together, our analysis shows that GMMMD followed by Harmony can efficiently identify the “shared" cell-types across the different platforms due to its ability to deal with the high variability in the gene expression matrices. We would like to highlight that PCA followed by Harmony was capable of achieving good results because the original clusters were obtained by applying a similar pipeline [49]. As a final test, we combined two MCA datasets acquired with Microwell-Seq [50] and Smart-Seq2 protocols [65]. We integrated the datasets in the same way we did in the other two tests. We clustered the neighbourhood graphs using the Leiden algorithm considering only the resolutions that allowed us to obtain 11 distinct clusters because 11 distinct cell-types were manually annotated for the PIC datasets [50]. We then calculated all metrics. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 10 of 27 In such a case, MMDVAE followed by Harmony (using the Poisson loss function and 256 neurons for the hidden layer and 64 neurons for the latent space) outper- formed the other AEs as well as the other strategies, obtaining a mean ARI equal to 79.50% (±0.02%), as shown in Figure 8A. ComBat achieved the worst mean ARI, i.e., 54.13% (±4.22%), CCA reached a mean ARI of 57.62% (±0.75%), vanilla PCA obtained a mean ARI of 67.29% (±11.55%), PCA followed by BBKNN had a similar mean ARI, that is, 67.73% (±3.98%), while followed by Harmony achieved a mean ARI of 66.08% (±0.11%). MMDVAE followed by Harmony had a p-value lower than 0.0001 in all the tested comparisons. Considering the other metrics, MMDVAE followed by Harmony generally obtained better results compared to the other strategies (see Additional file 3 and Figure 9). Comparing the best AE for each of the tested dimension (H,L) in terms of ARI, the vanilla GMMMD with the NB loss function obtained the best results for the di- mension (128,16), while GMMMD followed by Harmony reached the best results for the dimensions (256,32) and (64,16), exploiting the Poisson loss function and NB loss function, respectively. MMDAE followed by BBKNN (using the ZIP loss func- tion) achieved the best results for the dimensions (256,16) and (128,64), exploiting the NB loss function and Poisson loss function, respectively. MMDVAE followed by Harmony resulted the best choice for the dimensions (128,16) and (256,64) when coupled with the NB loss function and Poisson loss function, respectively. Finally, VAE followed Harmony with the Poisson loss function obtained the best results for the dimension (64,32). As for the the integration of the PBMC datasets, we used two Gaussian distributions because we merged two different datasets. Figures 8C and D show the UMAP generated from the MMDVAE followed by Harmony space coloured by the manually annotated cell-type and found clusters, respectively, while Figure 5C depicts the cells coloured by platform on the same UMAP space, confirming that the batch-effects between the two samples were cor- rectly removed. In this case, the achieved results show that MMDVAE followed by Harmony was able to better identify the “shared" cell-types across the different platforms. Discussion Non-linear approaches for dimensionality reduction can be effectively used to cap- ture the non-linearities among the gene interactions that may exist in the high- dimensional expression space of scRNA-Seq data [16]. Among the different non- linear approaches, AEs showed outstanding performance, outperforming other ap- proaches like UMAP and t-SNE. Several AE-based methods have been developed so far, but their integration with the common single-cell toolkits results a difficult task because they usually require input data codified in a specific format. In addi- tion, three different machine learning libraries are required to use them (i.e., Keras, TensorFlow, and PyTorch). Here, we proposed scAEspy, a unifying and user-friendly tool that allows the user to use the most recent and promising AEs (i.e., VAE, MMDAE, MMDVAE, and GMVAE). We also designed and developed GMMMD and GMMMDVAE, two novel AEs that combine MMDAE and MMDVAE with GMVAE to exploit more than one Gaussian distribution. We introduced a learnable prior distribution in the latent .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 11 of 27 space to model the dimensionality of the subpopulations of cells composing the data or to combine multiple samples. We integrated AEs with both Harmony and BBKNN to remove the existing batch- effects among different datasets. Our results showed that exploiting the latent space to remove the existing batch-effects permits for a better identification of the cell subpopulations. As a batch-effect removal tool, Harmony allowed for achieving bet- ter results than BBKNN in the majority of the cases. When different droplet-based data have to be combined, our GMMMD and the MMDAE, coupled with the con- strained NB and Poisson loss functions, obtained the highest results compared to all the other AEs. In order to combine and analyse multiple datasets, generated by using different scRNA-Seq platforms, both our GMMMD and MMDVAE, mainly to- gether with the NB and Poisson loss functions, outperformed the other strategies. However, also GMVAE and the simple VAE obtained outstanding performance, highlighting that the Kullback-Leibler divergence function can become fundamen- tal to handle data spreading various orders of magnitude, especially the high values (up to millions) introduced by plate-based methods. It is clear that using more than one Gaussian distribution allow for obtaining a better integration of the datasets and separation of the cell-types when more than two datasets have to be integrated, as clearly shown by the results reached on the PIC datasets. Considering the achieved results on the identification of the clusters, scAEspy can be used at the basis of methods that aim to automatically identify the cell- types composing the scRNA-Seq datasets under analysis [71]. As a matter of fact, scAEspy coupled with BBKNN was successfully applied to integrate 15 different foetal human samples, enabling the identification of rare blood progenitor cells [72]. Conclusions In this study, we proposed a AE-based and user-friendly tool, named scAEspy, which allows for using the most recent and promising AEs to analyse scRNA-Seq data. The user can select the desired AE by only setting up two user-defined parame- ters. Once the selected AE has been trained, it can be used to generate synthetic cells to increase the number of data for further downstream analyses (e.g., training classifiers). In scAEspy, the latent space is easily accessible and thus allows the user to perform different analyses, such as the correction of possible batch-effect in a reduced non-linear space or the inference of differentiation trajectories. In this case, the latent space can be utilised to generate the “pseudotime” that measures transcriptional changes that a cell undergoes during the dynamic process. Thanks to its modularity, scAEspy can be extended to accommodate new AEs so that the user will be always able to utilise the latest and cutting-edge AEs [73], which can improve the downstream analysis of scRNA-Seq data. It is worth noticing that scAEspy can be used on HPC infrastructures, both based on CPUs and GPUs, to speed-up the computations. This is a crucial point when datasets composed of hundreds of thousands of cells are analysed. In such cases, the required running time drastically increases, so relying on HPC infrastructures is the best solution to incredibly reduce the prohibitive running time. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 12 of 27 Future improvements As an improvement, prior biological knowledge about genes from ontologies can be incorporated into scAEspy. Ontologies can introduce useful information into ma- chine learning systems that are used to solve biological problems. They allow for integrating data from different omics (e.g., genomics, transcriptomics, proteomics, and metabolomics) as structured representations of semantic knowledge, which is commonly used for the representation of biological concepts. This approach has been successfully applied to predict the clinical targets from high-dimensional low- sample data [74]. Specifically, ontology embeddings are able to capture the semantic similarities among the genes, which can be exploited to sparsify the network con- nections. In addition, the Gene Ontology (GO) [75] can be exploited to interpret the extracted features from the latent spaces generated by the AEs, allowing for bringing an explanation to the learned representations of the gene expression data. As a possible example, g:Profiler [76] focusing on GO terms, Kyoto Encyclopedia of Genes and Genomes (KEGG), and Reactome can be used on the learned embed- dings to investigate the joint effects of different gene sets within specific biological pathways. This approach can help the interpretability and explainability of the learned embeddings of the used AEs. Integration of multi-omics data Since AEs showed outstanding performance in the integration of multi-omics of cancer data [73], we plan to extend scAEspy to analyse other single-cell omics. For instance, AEs can be applied to analyse scATAC-seq, where the identification of the cell-types is still more difficult due to technical challenges [11, 77]. scAEspy could be effectively applied to analyse disparate types of single-cell data from different points of view. The latent representations of different or combined single-cell omics can be used for further and more in-depth analyses. For instance, the application of other machine learning techniques (e.g., deep neural networks) to the latent representations could facilitate the identification of interesting patterns on gene expression or methylation data, as well as relationships among genomics variants. In that regard, scAEspy can be the starting point to build a more comprehensive toolkit designed to integrate multi single-cell omics as an integration and extension of the work proposed in [73]. Methods We developed scAEspy so that it can be easily integrated into both Scanpy and Seu- rat pipelines, as it directly works on a gene expression matrix (see Figure 2). We integrated into a single tool the latest and most powerful AEs designed to resolve the problems underlying scRNA-Seq data (e.g., sparsity, intrinsic noise, dropout events [3]). Specifically, scAEspy is comprised of six AEs, based on the VAE [18] and InfoVAE [37] architectures. The following most advanced AEs are included in scAEspy: VAE, MMDAE, MMDVAE, GMVAE, and two novel Gaussian-mixture AEs that we developed, called GMMMD and GMMMDVAE. GMMMD is a modifi- cation of the MMDAE where more than one Gaussian distribution is used to model different modes and only the MMD function is used as divergence function. GMM- MDVAE is a combination of MMDVAE and GMVAE where both the MMD func- tion [78] and the Kullback-Leibler divergence function [79] are used. scAEspy allows .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 13 of 27 the user to exploit these six different AEs by setting up two user-defined parameters, α and λ, which are needed to balance the MMD and the Kullback-Leibler diver- gence functions. We designed and developed GMMMD and GMMMDVAE starting from InfoVAE [37] and scVAE [21]. In addition, a learnable mixture distribution was used for the prior distribution in the latent space, and also the marginal condi- tional distribution was defined to be a learnable mixture distribution with the same number of components as the prior distribution. Finally, the user can also select the following loss functions: NB, constrained NB, Poisson, constrained Poisson, ZINB, constrained ZINB, ZIP, constrained ZIP, and Mean Square Error (MSE). The tested batch-effect removal tools Originally proposed to deal with batch-effects in microarray gene expression data [41], ComBat has been successfully applied to analyse scRNA-Seq data [80]. Briefly, given a gene expression matrix, it is firstly standardised so that all genes have similar means and variances. Then, starting from the obtained standardised matrix, standard distributions are fitted using a Bayesian approach to estimate the existing batch-effects in the data. Finally, the original expression matrix is corrected using the computed batch-effect estimators. In our tests, we used the default parameter settings provided by the Scanpy function combat. We then applied PCA on the space obtained by the top k (here, we set k = 1000) HVGs calculated by using the function provided by Scanpy (v.1.4.5.1), where the top HVGs are separately selected within each batch and merged to avoid the selection of batch-specific genes. We calculated the first 50 components and applied the so-called “elbow method” to select the number of components for the downstream analysis. We used the first 18, 13, and 18 components for PBMC, PIC, and MCA datasets, respectively. After that, we calculated the neighbourhood graph by using the default parameter settings proposed in Scanpy. We clustered the obtained neighbourhood graphs with the Leiden algorithm by selecting the values of the resolution parameter such that the number of clusters was equal to the manually annotated clusters. Finally, all the metrics for each found resolution have been calculated. As another batch-effect removal tool, we used the CCA-based approach proposed in the Seurat package (v.2.3.4) [14]. We applied both RunCCA and MultiCCA Seurat functions to integrate two batches and more than two batches, respectively. Firstly, we normalised and log-transformed the counts. Then, we calculated the top 1000 HVGs by using the function provided by Scanpy (v.1.4.5.1). We also scaled the log- transformed data to zero mean and unit variance. In both RunCCA and MultiCCA Seurat functions, as a first step, the CCA components (here, we exploited the first 20) are used to compute the linear combinations of the genes with the maximum correlation between the batches. A dynamic time warping (AlignSubspace Seurat function), which accounts for population density changes, is then used to align the calculated vectors and obtain a single low-dimensional subspace where the batch- effects are corrected. We calculated the neighbourhood graph, using the default parameter settings proposed in Scanpy, starting from the aligned low-dimensional subspace. We clustered the built neighbourhood graphs with the Leiden algorithm as explained before. Finally, we calculated all the metrics for each found resolution. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 14 of 27 We also applied Harmony [40] to remove the batch-effects. Starting from a reduced space (e.g., PCA space or latent space), Harmony exploits an iterative clustering- based procedure to remove the multiple-dataset-specific batch-effects. In each itera- tion, the following 4 steps are applied: (i) the cells are grouped into multiple-dataset clusters by exploiting a variant of the soft k-means clustering, which is a fast and flexible method developed to cluster single-cell data; (ii) a centroid is calculated for each cluster and for each specific dataset; (iii) using the calculated centroids, a correction factor is derived for each dataset; (iv) the correction factors are then used to correct each cell with a cell-specific factor. As a further batch-effect removal tool, we applied BBKNN [39]. Polanski et al. [38, 39] showed that BBKNN has comparable or better performance in removing batch- effects with respect to the CCA-based approach proposed in the Seurat package, Scanorama [81] and mnnCorrect [82]. In addition, BBKNN is a lightweight graph alignment method that requires minimal changes to the classical workflow. Indeed, it computes the k-nearest neighbours in a reduced space (e.g., PCA or latent space), where the nearest neighbours are identified in a batch-balanced manner using a user-defined distance (in our tests, we used the Euclidean distance). The neighbour information is transformed into connectivities to build a graph where all cells across batches are linked together. We used both Harmony and BBKNN to correct the PCA and AE spaces. As a final step, we calculated the UMAP spaces starting from the built neighbour- hood graphs and using the default parameter settings proposed in Scanpy, except for the initialisation of the low dimensional embedding (i.e., init_pos equal to random, and random_state equal to 10 of the umap function). The proposed pipeline We modified the workflow shown in Figure 1 by replacing PCA with AEs (Figure 2). We merged the gene expression matrices of E different samples (E = 2, E = 4, and E = 2 for PBMC, PIC, and MCA datasets, respectively). We applied both PCA and AEs on the space obtained by the top 1000 HVGs calculated by using the latest implementation of Scanpy function. For what concerns PCA, we firstly normalised and log-transformed the counts, then we applied a classic standardisation, that is, the distribution of the expression of each gene was scaled to zero mean and unit variance. We calculated the first 50 components; after that, we used the “elbow method” to select the first 12, 14, and 19 components for PBMC, PIC, and MCA datasets, respectively. Regarding AEs, we used the original counts since AEs showed to achieve better results when applied using the raw counts [21]. Indeed, using the counts allows for exploiting discrete probability distributions, such as Poisson and NB distributions, which obtained the best results in our tests. In all the tests presented here, we used a single hidden layer. In addition, we set 100 epochs, sigmoid activation functions, and a batch equal to 100 samples (i.e., cells). In all tests we used the Adam optimizer [83]. After that, we applied three different strategies (Figure 2): (i) we calculated the neighbourhood graph in both PCA and AE spaces by using the default parameter settings proposed in Scanpy. Then, we clustered the obtained neighbourhood graphs with the Leiden algorithm as described before. Finally, we calculated all the metrics .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 15 of 27 Table 1 Setting of α, λ, and K to obtain the desired AE. α λ K VAE 0 1 1 MMDAE 1 1 1 MMDVAE 0 2 1 GMVAE 0 1 > 1 GMMMD 1 1 > 1 GMMMDVAE 0 2 > 1 for each found resolution. (ii) we performed a similar analysis where we firstly corrected the PCA and AE spaces using Harmony [40] with the default parameter settings proposed in https://github.com/slowkow/harmonypy. (iii) we performed the same analysis described in (i) by replacing the neighbourhood graphs with those generated using BBKNN, using the default parameter settings. The generalised formulation of scAEspy In this work, we used the notation proposed in [37] to extend MMDVAE with multiple Gaussian distributions as well as to introduce a learnable prior distribution in the latent space. The idea behind the introduction of learnable coefficients is that they might be suitable to model the diversity among the subpopulations of cells composing the data or to combine multiple samples or datasets. We consider p∗(x) as the unknown probability in the input space over which the optimisation problem is formulated, z is the latent representation of x with |z| ≤ |x|. The encoder is identified by a function eφ : x 7→ z, while the decoder by a function dθ : z 7→ x. We remind that in VAEs, the input x is not mapped into a single point in the latent space, but it is represented by a probability distribution over the latent space. q(z) can be any possible distribution in the latent space and y ∈{1, . . . ,K} is a categorical random variable, where K corresponds to the number of desired Gaussian distributions. As general strict divergence function, we considered the MMD(·) divergence function [78]. The ELBO term proposed in this work, which is the measure maximised during the training of AEs, is: ELBO = E[log(d(x|z,y))] (1) − (α + λ−1)MMD(pe(z)||q(z)) − (1−α)E[KL(pe(z,y|x)||q(z,y))], where KL(·) is the Kullback-Leibler divergence [79] between two distributions. All the mathematical details required to derive the generalised formula shown in Equa- tion 1 can be found in the Additional file 1. Equation 1 allows the user to easily exploit VAE, MMDAE, MMDVAE, GMVAE, GMMMD, and GMMMMDVAE (see Table 1). Availability and requirements scAEspy is written in Python programming language (v.3.6.5) and it relies on TensorFlow (v.1.12.0), an open-source and massively used machine learning li- brary [35]. scAEspy requires the following Python libraries: NumPy, SciKit-Learn, .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://github.com/slowkow/harmonypy https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 16 of 27 Matplotlib, and Seaborn. scAEspy’s open-source code is available on GitLab: https://gitlab.com/cvejic-group/scaespy under the GPL-3 license. The repository contains all the scripts, code and Jupyter Notebooks used to obtain the results shown in the paper. In the provided Jupyter Notebooks, we show how it is easy to integrate scAEspy and Scanpy, and how the data can be visualised and explored by using both scAEspy and Scanpy’s functions. We also provide a detailed description of scAEspy’s parameters so that it can be used by both novice and expert researchers for downstream analyses. Competing interests The authors declare that they have no competing interests. Author’s contributions AT conceived the project. AT and FR developed the software. AC, DB, and PL supervised the project and helped to interpret and present the results. AT performed all the tests and analysed the results. AT wrote the manuscript. AC, DB, and PL edited the manuscript. All authors read and approved the final manuscript. Acknowledgements This research was supported by Cancer Research UK grant number C45041/A14953 (AC and AT), European Research Council project 677501 – ZF_Blood (AC) and a core support grant from the Wellcome Trust and MRC to the Wellcome Trust – Medical Research Council Cambridge Stem Cell Institute. We thank Dr. Leonardo Rundo (Department of Radiology, University of Cambridge) for their critical comments. Author details 1Wellcome Trust-Medical Research Council Cambridge Stem Cell Institute, Cambridge, UK. 2Department of Haematology, University of Cambridge, Cambridge, UK. 3Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK. 4Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan, Italy. 5Department of Computer Science and Technology, University of Cambridge, Cambridge, UK. 6Current address: Department of Human and Social Sciences, University of Bergamo, Bergamo, Italy. 7Bicocca Bioinformatics, Biostatistics and Bioimaging Centre (B4), Milan, Italy. References 1. Gladka, M.M., Molenaar, B., De Ruiter, H., Van Der Elst, S., Tsui, H., Versteeg, D., Lacraz, G.P., Huibers, M.M., Van Oudenaarden, A., Van Rooij, E.: Single-cell sequencing of the healthy and diseased heart reveals cytoskeleton-associated protein 4 as a new modulator of fibroblasts activation. Circulation 138(2), 166–180 (2018). doi:10.1161/CIRCULATIONAHA.117.030742 2. Keren-Shaul, H., Spinrad, A., Weiner, A., Matcovitch-Natan, O., Dvir-Szternfeld, R., Ulland, T.K., David, E., Baruch, K., Lara-Astaiso, D., Toth, B., et al.: A unique microglia type associated with restricting development of alzheimer’s disease. Cell 169(7), 1276–1290 (2017). doi:10.1016/j.cell.2017.05.018 3. Lähnemann, D., Köster, J., Szczurek, E., McCarthy, D.J., Hicks, S.C., Robinson, M.D., Vallejos, C.A., Campbell, K.R., Beerenwinkel, N., Mahfouz, A., et al.: Eleven grand challenges in single-cell data science. Genome Biol. 21(1), 1–35 (2020). doi:10.1186/s13059-020-1926-6 4. Steinbach, M., Ertöz, L., Kumar, V.: The challenges of clustering high dimensional data. In: New Directions in Statistical Physics: Econophysics, Bioinformatics, and Pattern Recognition, pp. 273–309. Springer, Berlin, Heidelberg (2004). doi:10.1007/978-3-662-08968-2_16 5. Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemom Intell Lab Syst. 2(1-3), 37–52 (1987). doi:10.1016/0169-7439(87)80084-9 6. Maaten, L.v.d., Hinton, G.: Visualizing data using t-SNE. J Mach Learn Res. 9(Nov), 2579–2605 (2008) 7. McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) 8. Becht, E., McInnes, L., Healy, J., Dutertre, C.-A., Kwok, I.W., Ng, L.G., Ginhoux, F., Newell, E.W.: Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. 37(1), 38 (2019). doi:10.1038/nbt.4314 9. Luecken, M.D., Theis, F.J.: Current best practices in single-cell rna-seq analysis: a tutorial. Mol. Syst. Biol. 15(6), 8746 (2019). doi:10.15252/msb.20188746 10. Hwang, B., Lee, J.H., Bang, D.: Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp Mol Med. 50(8), 96 (2018). doi:10.1038/s12276-018-0071-8 11. Luecken, M.D., Buttner, M., Chaichoompu, K., Danese, A., Interlandi, M., Müller, M.F., Strobl, D.C., Zappia, L., Dugas, M., Colomé-Tatché, M., et al.: Benchmarking atlas-level data integration in single-cell genomics. BioRxiv (2020). doi:10.1101/2020.05.22.111161 12. Tran, H.T.N., Ang, K.S., Chevrier, M., Zhang, X., Lee, N.Y.S., Goh, M., Chen, J.: A benchmark of batch-effect correction methods for single-cell rna sequencing data. Genome Biol. 21(1), 1–32 (2020). doi:10.1186/s13059-019-1850-9 13. Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly, K., Irizarry, R.A.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11(10), 733–739 (2010). doi:10.1038/nrg2825 .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://gitlab.com/cvejic-group/scaespy http://dx.doi.org/10.1161/CIRCULATIONAHA.117.030742 http://dx.doi.org/10.1016/j.cell.2017.05.018 http://dx.doi.org/10.1186/s13059-020-1926-6 http://dx.doi.org/10.1007/978-3-662-08968-2$_$16 http://dx.doi.org/10.1016/0169-7439(87)80084-9 http://dx.doi.org/10.1038/nbt.4314 http://dx.doi.org/10.15252/msb.20188746 http://dx.doi.org/10.1038/s12276-018-0071-8 http://dx.doi.org/10.1101/2020.05.22.111161 http://dx.doi.org/10.1186/s13059-019-1850-9 http://dx.doi.org/10.1038/nrg2825 https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 17 of 27 14. Butler, A., Hoffman, P., Smibert, P., Papalexi, E., Satija, R.: Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 36(5), 411 (2018). doi:10.1038/nbt.4096 15. Bacher, R., Kendziorski, C.: Design and computational analysis of single-cell rna-sequencing experiments. Genome Biol. 17(1), 63 (2016). doi:10.1186/s13059-016-0927-y 16. Ding, J., Condon, A., Shah, S.P.: Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat Commun. 9(1), 2002 (2018). doi:10.1038/s41467-018-04368-5 17. Eraslan, G., Simon, L.M., Mircea, M., Mueller, N.S., Theis, F.J.: Single-cell RNA-seq denoising using a deep count autoencoder. Nat Commun. 10(1), 390 (2019). doi:10.1038/s41467-018-07931-2 18. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 19. Lopez, R., Regier, J., Cole, M.B., Jordan, M.I., Yosef, N.: Deep generative modeling for single-cell transcriptomics. Nat Methods 15(12), 1053 (2018). doi:10.1038/s41592-018-0229-2 20. Svensson, V., Gayoso, A., Yosef, N., Pachter, L.: Interpretable factor models of single-cell rna-seq via variational autoencoders. Bioinformatics 36(11), 3418–3421 (2020). doi:10.1093/bioinformatics/btaa169 21. Grønbech, C.H., Vording, M.F., Timshel, P.N., Sønderby, C.K., Pers, T.H., Winther, O.: scVAE: Variational auto-encoders for single-cell gene expression data. bioRxiv, 318295 (2018). doi:10.1101/318295 22. Grønbech, C.H., Vording, M.F., Timshel, P.N., Sønderby, C.K., Pers, T.H., Winther, O.: scVAE: Variational auto-encoders for single-cell gene expression data. Bioinformatics (2020). doi:10.1093/bioinformatics/btaa293 23. Tran, D., Nguyen, H., Tran, B., Nguyen, T.: Fast and precise single-cell data analysis using hierarchical autoencoder. bioRxiv, 799817 (2019). doi:10.1101/799817 24. Rousseeuw, J.: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1989) 25. Bica, I., Andrés-Terré, H., Cvejic, A., Liò, P.: Unsupervised generative and graph representation learning for modelling cell differentiation. Sci. Rep. 10(1), 1–13 (2020). doi:10.1038/s41598-020-66166-8 26. Wang, D., Gu, J.: VASC: dimension reduction and visualization of single-cell RNA-seq data by deep variational autoencoder. Genom Proteom Bioinf. 16(5), 320–331 (2018). doi:10.1016/j.gpb.2018.08.003 27. Lin, E., Mukherjee, S., Kannan, S.: A deep adversarial variational autoencoder model for dimensionality reduction in single-cell rna sequencing analysis. BMC bioinformatics 21(1), 1–11 (2020). doi:10.1186/s12859-020-3401-5 28. Geddes, T.A., Kim, T., Nan, L., Burchfield, J.G., Yang, J.Y., Tao, D., Yang, P.: Autoencoder-based cluster ensembles for single-cell rna-seq data analysis. BMC bioinformatics 20(19), 660 (2019). doi:10.1186/s12859-019-3179-5 29. Talwar, D., Mongia, A., Sengupta, D., Majumdar, A.: AutoImpute: Autoencoder based imputation of single-cell RNA-seq data. Sci. Rep. 8(1), 16329 (2018). doi:10.1038/s41598-018-34688-x 30. Sun, S., Liu, Y., Shang, X.: Deep generative autoencoder for low-dimensional embeding extraction from single-cell rnaseq data. In: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1365–1372 (2019). doi:10.1109/BIBM47256.2019.8983289. IEEE 31. Badsha, M.B., Li, R., Liu, B., Li, Y.I., Xian, M., Banovich, N.E., Fu, A.Q.: Imputation of single-cell gene expression with an autoencoder neural network. Quant. Biol., 1–17 (2020). doi:10.1007/s40484-019-0192-7 32. Rao, J., Zhou, X., Lu, Y., Zhao, H., Yang, Y.: Imputing single-cell rna-seq data by combining graph convolution and autoencoder neural networks. bioRxiv (2020). doi:10.1101/2020.02.05.935296 33. Wolf, F.A., Angerer, P., Theis, F.J.: SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19(1), 15 (2018). doi:10.1186/s13059-017-1382-0 34. Satija, R., Farrell, J.A., Gennert, D., Schier, A.F., Regev, A.: Spatial reconstruction of single-cell gene expression data. Nat Biotechnol. 33(5), 495 (2015). doi:10.1038/nbt.3192 35. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machine learning. In: Proceedings of the Symposium on Operating Systems Design and Implementation), pp. 265–283 (2016) 36. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: Proceedings of the Conference on Advances in Neural Information Processing Systems (2017) 37. Zhao, S., Song, J., Ermon, S.: Infovae: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262 (2017) 38. Park, J.-E., Polański, K., Meyer, K., Teichmann, S.A.: Fast batch alignment of single cell transcriptomes unifies multiple mouse cell atlases into an integrated landscape. bioRxiv, 397042 (2018) 39. Polański, K., Young, M.D., Miao, Z., Meyer, K.B., Teichmann, S.A., Park, J.-E.: Bbknn: fast batch alignment of single cell transcriptomes. Bioinformatics 36(3), 964–965 (2020). doi:10.1093/bioinformatics/btz625 40. Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., Wei, K., Baglaenko, Y., Brenner, M., Loh, P.-r., Raychaudhuri, S.: Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019). doi:10.1038/s41592-019-0619-0 41. Johnson, W.E., Li, C., Rabinovic, A.: Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics 8(1), 118–127 (2007). doi:10.1093/biostatistics/kxj037 42. Leek, J.T., Johnson, W.E., Parker, H.S., Jaffe, A.E., Storey, J.D.: The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28(6), 882–883 (2012). doi:10.1093/bioinformatics/bts034 43. Pedersen, B.: Python implementation of ComBat. GitHub (2012) 44. Kang, H.M., Subramaniam, M., Targ, S., Nguyen, M., Maliskova, L., McCarthy, E., Wan, E., Wong, S., Byrnes, L., Lanata, C.M., et al.: Multiplexed droplet single-cell rna-sequencing using natural genetic variation. Nat Biotechnol. 36(1), 89 (2018). doi:10.1038/nbt.4042 .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint http://dx.doi.org/10.1038/nbt.4096 http://dx.doi.org/10.1186/s13059-016-0927-y http://dx.doi.org/10.1038/s41467-018-04368-5 http://dx.doi.org/10.1038/s41467-018-07931-2 http://dx.doi.org/10.1038/s41592-018-0229-2 http://dx.doi.org/10.1093/bioinformatics/btaa169 http://dx.doi.org/10.1101/318295 http://dx.doi.org/10.1093/bioinformatics/btaa293 http://dx.doi.org/10.1101/799817 http://dx.doi.org/10.1038/s41598-020-66166-8 http://dx.doi.org/10.1016/j.gpb.2018.08.003 http://dx.doi.org/10.1186/s12859-020-3401-5 http://dx.doi.org/10.1186/s12859-019-3179-5 http://dx.doi.org/10.1038/s41598-018-34688-x http://dx.doi.org/10.1109/BIBM47256.2019.8983289 http://dx.doi.org/10.1007/s40484-019-0192-7 http://dx.doi.org/10.1101/2020.02.05.935296 http://dx.doi.org/10.1186/s13059-017-1382-0 http://dx.doi.org/10.1038/nbt.3192 http://dx.doi.org/10.1093/bioinformatics/btz625 http://dx.doi.org/10.1038/s41592-019-0619-0 http://dx.doi.org/10.1093/biostatistics/kxj037 http://dx.doi.org/10.1093/bioinformatics/bts034 http://dx.doi.org/10.1038/nbt.4042 https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 18 of 27 45. Grün, D., Muraro, M.J., Boisset, J.-C., Wiebrands, K., Lyubimova, A., Dharmadhikari, G., van den Born, M., van Es, J., Jansen, E., Clevers, H., et al.: De novo prediction of stem cell identity using single-cell transcriptome data. Cell stem cell 19(2), 266–277 (2016). doi:10.1016/j.stem.2016.05.010 46. Muraro, M.J., Dharmadhikari, G., Grün, D., Groen, N., Dielen, T., Jansen, E., van Gurp, L., Engelse, M.A., Carlotti, F., de Koning, E.J., et al.: A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3(4), 385–394 (2016). doi:10.1016/j.cels.2016.09.002 47. Lawlor, N., George, J., Bolisetty, M., Kursawe, R., Sun, L., Sivakamasundari, V., Kycia, I., Robson, P., Stitzel, M.L.: Single-cell transcriptomes identify human islet cell signatures and reveal cell-type–specific expression changes in type 2 diabetes. Genome Res. 27(2), 208–222 (2017). doi:10.1101/gr.212720.116 48. Segerstolpe, Å., Palasantza, A., Eliasson, P., Andersson, E.-M., Andréasson, A.-C., Sun, X., Picelli, S., Sabirsh, A., Clausen, M., Bjursell, M.K., et al.: Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24(4), 593–607 (2016). doi:10.1016/j.cmet.2016.08.020 49. Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck III, W.M., Hao, Y., Stoeckius, M., Smibert, P., Satija, R.: Comprehensive integration of single-cell data. Cell (2019). doi:10.1016/j.cell.2019.05.031 50. Han, X., Wang, R., Zhou, Y., Fei, L., Sun, H., Lai, S., Saadatpour, A., Zhou, Z., Chen, H., Ye, F., et al.: Mapping the mouse cell atlas by microwell-seq. Cell 172(5), 1091–1107 (2018). doi:10.1016/j.cell.2018.02.001 51. Consortium, T.M., et al.: Single-cell transcriptomics of 20 mouse organs creates a tabula muris. Nature 562, 367–372 (2018). doi:10.1038/s41586-018-0590-4 52. Hubert, L., Arabie, P.: Comparing partitions. J Classif. 2(1), 193–218 (1985). doi:10.1007/BF01908075 53. Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res. 3(Dec), 583–617 (2002) 54. Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J Am Stat Assoc. 78(383), 553–569 (1983) 55. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J Mach Learn Res. 11(Oct), 2837–2854 (2010) 56. Rosenberg, A., Hirschberg, J.: V-measure: A conditional entropy-based external cluster evaluation measure. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 410–420 (2007) 57. Macosko, E.Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., Tirosh, I., Bialas, A.R., Kamitaki, N., Martersteck, E.M., et al.: Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161(5), 1202–1214 (2015). doi:10.1016/j.cell.2015.05.002 58. Klein, A.M., Mazutis, L., Akartuna, I., Tallapragada, N., Veres, A., Li, V., Peshkin, L., Weitz, D.A., Kirschner, M.W.: Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161(5), 1187–1201 (2015). doi:10.1016/j.cell.2015.04.044 59. Hashimshony, T., Wagner, F., Sher, N., Yanai, I.: Cel-seq: single-cell rna-seq by multiplexed linear amplification. Cell Rep. 2(3), 666–673 (2012). doi:10.1016/j.celrep.2012.08.003 60. Hashimshony, T., Senderovich, N., Avital, G., Klochendler, A., de Leeuw, Y., Anavy, L., Gennert, D., Li, S., Livak, K.J., Rozenblatt-Rosen, O., et al.: Cel-seq2: sensitive highly-multiplexed single-cell rna-seq. Genome Biol. 17(1), 77 (2016). doi:10.1186/s13059-016-0938-8 61. Zheng, G.X., Terry, J.M., Belgrader, P., Ryvkin, P., Bent, Z.W., Wilson, R., Ziraldo, S.B., Wheeler, T.D., McDermott, G.P., Zhu, J., et al.: Massively parallel digital transcriptional profiling of single cells. Nat Commun. 8, 14049 (2017). doi:10.1038/ncomms14049 62. Gierahn, T.M., Wadsworth II, M.H., Hughes, T.K., Bryson, B.D., Butler, A., Satija, R., Fortune, S., Love, J.C., Shalek, A.K.: Seq-well: portable, low-cost rna sequencing of single cells at high throughput. Nat Methods 14(4), 395 (2017). doi:10.1038/nmeth.4179 63. Islam, S., Kjällquist, U., Moliner, A., Zajac, P., Fan, J.-B., Lönnerberg, P., Linnarsson, S.: Characterization of the single-cell transcriptional landscape by highly multiplex rna-seq. Genome Res. 21(7), 1160–1167 (2011). doi:10.1101/gr.110882.110 64. Ramsköld, D., Luo, S., Wang, Y.-C., Li, R., Deng, Q., Faridani, O.R., Daniels, G.A., Khrebtukova, I., Loring, J.F., Laurent, L.C., et al.: Full-length mrna-seq from single-cell levels of rna and individual circulating tumor cells. Nat Biotechnol. 30(8), 777 (2012). doi:10.1038/nbt.2282 65. Picelli, S., Faridani, O.R., Björklund, Å.K., Winberg, G., Sagasser, S., Sandberg, R.: Full-length rna-seq from single cells using smart-seq2. Nat Protoc. 9(1), 171 (2014). doi:10.1038/nprot.2014.006 66. Jaitin, D.A., Kenigsberg, E., Keren-Shaul, H., Elefant, N., Paul, F., Zaretsky, I., Mildner, A., Cohen, N., Jung, S., Tanay, A., et al.: Massively parallel single-cell rna-seq for marker-free decomposition of tissues into cell types. Science 343(6172), 776–779 (2014). doi:10.1126/science.1247651 67. Traag, V.A., Waltman, L., van Eck, N.J.: From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 9 (2019). doi:10.1038/s41598-019-41695-z 68. Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. of Math. Stat., 50–60 (1947) 69. Wilcoxon, F.: Individual comparisons by ranking methods. In: Breakthroughs in Statistics, pp. 196–202. Springer, New York, NY (1992). doi:10.1007/978-1-4612-4380-9_16 70. Dunn, O.J.: Multiple comparisons among means. J. Am. Stat. Assoc. 56(293), 52–64 (1961) 71. Ma, F., Pellegrini, M.: ACTINN: Automated identification of cell types in single cell RNA sequencing. Bioinformatics (2019). doi:10.1093/bioinformatics/btz592 72. Ranzoni, A.M., Tangherloni, A., Berest, I., Riva, S.G., Myers, B., Strzelecka, P.M., Xu, J., Panada, E., Mohorianu, I., Zaugg, J.B., et al.: Integrative single-cell rna-seq and atac-seq analysis of human foetal liver and bone marrow haematopoiesis. BioRxiv (2020). doi:10.1101/2020.05.06.080259 73. Simidjievski, N., Bodnar, C., Tariq, I., Scherer, P., Andres Terre, H., Shams, Z., Jamnik, M., Liò, P.: Variational autoencoders for cancer data integration: design principles and computational practice. Front. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint http://dx.doi.org/10.1016/j.stem.2016.05.010 http://dx.doi.org/10.1016/j.cels.2016.09.002 http://dx.doi.org/10.1101/gr.212720.116 http://dx.doi.org/10.1016/j.cmet.2016.08.020 http://dx.doi.org/10.1016/j.cell.2019.05.031 http://dx.doi.org/10.1016/j.cell.2018.02.001 http://dx.doi.org/10.1038/s41586-018-0590-4 http://dx.doi.org/10.1007/BF01908075 http://dx.doi.org/10.1016/j.cell.2015.05.002 http://dx.doi.org/10.1016/j.cell.2015.04.044 http://dx.doi.org/10.1016/j.celrep.2012.08.003 http://dx.doi.org/10.1186/s13059-016-0938-8 http://dx.doi.org/10.1038/ncomms14049 http://dx.doi.org/10.1038/nmeth.4179 http://dx.doi.org/10.1101/gr.110882.110 http://dx.doi.org/10.1038/nbt.2282 http://dx.doi.org/10.1038/nprot.2014.006 http://dx.doi.org/10.1126/science.1247651 http://dx.doi.org/10.1038/s41598-019-41695-z http://dx.doi.org/10.1007/978-1-4612-4380-9$_$16 http://dx.doi.org/10.1093/bioinformatics/btz592 http://dx.doi.org/10.1101/2020.05.06.080259 https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 19 of 27 Genet. 10, 1205 (2019). doi:10.3389/fgene.2019.01205 74. Trębacz, M., Shams, Z., Jamnik, M., Scherer, P., Simidjievski, N., Terre, H.A., Liò, P.: Using ontology embeddings for structural inductive bias in gene expression data analysis. arXiv preprint arXiv:2011.10998 (2020) 75. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al.: Gene ontology: tool for the unification of biology. Nat. Genet. 25(1), 25–29 (2000) 76. Raudvere, U., Kolberg, L., Kuzmin, I., Arak, T., Adler, P., Peterson, H., Vilo, J.: g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucl. Acids Res. 47(W1), 191–198 (2019). doi:10.1093/nar/gkz369 77. Chen, X., Miragaia, R.J., Natarajan, K.N., Teichmann, S.A.: A rapid and robust method for single cell chromatin accessibility profiling. Nat Commun. 9(1), 5345 (2018). doi:10.1038/s41467-018-07771-0 78. Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.J.: A kernel method for the two-sample-problem. In: Proceedings of the Conference on Advances in Neural Information Processing Systems, pp. 513–520 (2007) 79. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann Math Statist. 22(1), 79–86 (1951). doi:10.1214/aoms/1177729694 80. Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S., Vert, J.-P.: A general and flexible method for signal extraction from single-cell rna-seq data. Nat. Commun. 9(1), 1–17 (2018). doi:10.1038/s41467-017-02554-5 81. Hie, B.L., Bryson, B., Berger, B.: Panoramic stitching of heterogeneous single-cell transcriptomic data. BioRxiv, 371179 (2018) 82. Haghverdi, L., Lun, A.T., Morgan, M.D., Marioni, J.C.: Batch effects in single-cell rna-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol. 36(5), 421 (2018). doi:10.1038/nbt.4091 83. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) Additional Files Additional file 1 — Mathematical formulation of the proposed autoencoders We provide the mathematical derivation of GMMMD and GMMMDVAE as well as the generalised formulation that we derived by following the notation proposed in [37]. Additional file 2 — Excel file of the metrics calculated for the PBMC datasets Each tab is related to a tested approach and shows the calculated metrics and used method. Additional file 3 — Excel file of the metrics calculated for the PIC datasets Each tab is related to a tested approach and shows the calculated metrics and used method. Additional file 4 — Excel file of the metrics calculated for the MCA datasets Each tab is related to a tested approach and shows the calculated metrics and used method. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint http://dx.doi.org/10.3389/fgene.2019.01205 http://dx.doi.org/10.1093/nar/gkz369 http://dx.doi.org/10.1038/s41467-018-07771-0 http://dx.doi.org/10.1214/aoms/1177729694 http://dx.doi.org/10.1038/s41467-017-02554-5 http://dx.doi.org/10.1038/nbt.4091 https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 20 of 27 Figures PCA on standardised data Steps (iv) and (v) ClusteringMarker genes Step (vii) Data visualisation Quality Control Step (i)  Log-transformationNormalisation Step (ii)  Step (vi) Highly variable genes Step (iii)  Figure 1 A common workflow for the downstream analysis of scRNA-Seq data. The workflow includes the following seven steps: (i) quality control to remove low-quality cells that may add technical noise, which could obscure the real biological signals; (ii) normalisation and log-transformation; (iii) identification of the HVGs to reduce the dimensionality of the dataset by including only the most informative genes; (iv) standardisation of each gene to zero mean and unit variance; (v) dimensionality reduction generally obtained by applying PCA; (vi) clustering of the cells starting from the low-dimensional representation of the data that are used to annotate the obtained clusters (i.e., identification of known and putatively novel cell-types); (vii) data visualisation on the low-dimensional space generated by applying a non-linear approach (e.g., t-SNE or UMAP) on the reduced space calculated in step (v). .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 21 of 27 m1 genes  n 1  ce lls Gene expression matrix Sample 1 k genes  n 1 + .. . + n E  c el ls    HVGs per   sample  k  n od es  k nodes t-SNE UMAPLeiden algorithm Label assignment scAEspy mE genes  n E  c el ls Gene expression matrix Sample E ... Corrected neighbourhood graph BBKNN (Corrected neighbourhood graph) z- sc or e Rank Marker genes Dimensionality reduction Sample 1 Sample E Graph clustering UMAP1 U M A P 2 tSNE1 tS N E 2 12 3 4 5 6 7 1 2 3 45 6 6 ... HARMONY (corrected latent space) Neighbourhood graph Corrected latent space Figure 2 The proposed workflow to integrate different samples. Given E different samples, their gene expression matrices are merged. Then, the top k HVGs are selected by considering the different samples. Specifically, they are selected within each sample separately and then merged to avoid the selection of batch-specific genes. scAEspy is used to reduce the HVG space (k dimensions), and the obtained latent space can be (i) used to calculate a t-SNE space, (ii) corrected by Harmony, and (iii) used to infer an uncorrected neighbourhood graph. The corrected latent space by Harmony is then used to build a neighbourhood graph, which is clustered by using the Leiden algorithm and used to calculate a UMAP space. Otherwise, BBKNN is applied to rebuild a uncorrected neighbourhood graph by taking into account the possible batch-effects. The corrected neighbourhood graph by BBKNN is then clustered by using the Leiden algorithm and used to calculate a UMAP space. In order to assign the correct label to the obtained clusters, the marker genes are calculated by using the Mann–Whitney U test. Finally, the annotated clusters can be visualised in both t-SNE and UMAP space. .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 22 of 27 C C A C om B at M M D A E -H ar m on y- 25 6× 32 P C A P C A -B B K N N P C A -H ar m on y 70 75 80 85 90 95 100 A R I **** **** **** **** **** (A) G M M M D -B B K N N -1 28 × 32 G M M M D -H ar m on y- 64 × 16 G M M M D -H ar m on y- 12 8× 16 G M M M D -H ar m on y- 25 6× 64 M M D A E -H ar m on y- 25 6× 16 M M D A E -H ar m on y- 25 6× 32 M M D A E -H ar m on y- 64 × 32 M M D A E -H ar m on y- 12 8× 64 84 86 88 90 92 94 **** **** **** **** **** **** *** (B) UMAP1 U M A P 2 (C) B cells CD14+ Monocytes CD4 T cells CD8 T cells Dendritic cells FCGR3A+ Monocytes Megakaryocytes NK cells UMAP1 U M A P 2 (D) 0 1 2 3 4 5 6 7 Figure 3 Results obtained on the PBMC datasets. (A) Boxplot showing the ARI values achieved by CCA, ComBat, PCA, MMDAE followed by Harmony with dimension (256, 32), PCA followed by BBKNN, and PCA followed by Harmony on the PBMC datasets. (B) Boxplot showing the ARI values achieved by the best AE for each of the tested dimension (H, L) of the hidden layer (H neurons) and latent space (L neurons). (C) UMAP visualisation of the cell-type manually annotated in the original paper. (D) UMAP visualisation of clusters identified by the Leiden algorithm using the resolution corresponding by the best ARI achieved by MMDAE followed by Harmony. p-value≤ 0.0001 (****); 0.0001 0.05 (ns) .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 23 of 27 65 70 75 80 85 90 A M II **** **** **** **** **** (A) 76 78 80 82 84 ** **** ns **** **** **** ns (B) 80 85 90 95 100 FM S **** **** **** **** **** (C) 89 90 91 92 93 94 95 96 **** **** **** **** **** **** **** (D) 65 70 75 80 85 90 H S **** **** **** **** **** (E) 76 78 80 82 84 ** **** ns **** **** **** ns (F) 70 75 80 85 90 95 C S **** **** **** **** **** (G) 82 84 86 88 90 **** **** **** **** * ns **** (H) C C A C om B at M M D A E -H ar m on y- 25 6× 32 P C A P C A -B B K N N P C A -H ar m on y 70 75 80 85 90 V M **** **** **** **** **** (I) G M M M D -B B K N N -1 28 × 32 G M M M D -H ar m on y- 64 × 16 G M M M D -H ar m on y- 12 8× 16 G M M M D -H ar m on y- 25 6× 64 M M D A E -H ar m on y- 25 6× 16 M M D A E -H ar m on y- 25 6× 32 M M D A E -H ar m on y- 64 × 32 M M D A E -H ar m on y- 12 8× 64 79 80 81 82 83 84 85 86 **** **** **** **** **** **** **** (J) Figure 4 Boxplot showing the values of the calculated metrics using CCA, ComBat, PCA, MMDAE followed by Harmony with dimension (256, 32), PCA followed by BBKNN, and PCA followed by Harmony as well as by the best AE for each of the tested dimension (H, L), analysing the PBMC datasets. (A) AMII achieved by the different strategies. (B) AMII achieved by the best AE for each of the tested dimension. (C) FMS achieved by the different strategies. (D) FMS achieved by the best AE for each of the tested dimension. (E) HS achieved by the different strategies. (F) HS achieved by the best AE for each of the tested dimension. (G) CS achieved by the different strategies. (H) CS achieved by the best AE for each of the tested dimension. (I) VM achieved by the different strategies. (J) VM achieved by the best AE for each of the tested dimension. p-value≤ 0.0001 (****); 0.0001 0.05 (ns) .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 24 of 27 UMAP1 U M A P 2 (A) Control cells Treated cells UMAP1 U M A P 2 (B) CEL-Seq CEL-Seq2 Fluidigm C1 Smart-Seq2 UMAP1 U M A P 2 (C) Microwell-Seq Smart-Seq2 Figure 5 UMAP visualisation showing the sample alignment performed by Harmony into the latent space obtained by MMDAE with dimension (256, 32) for the PBMC datasets (A), by GMMMD with dimension (256, 32) for the PIC datasets (B), and by MMDVAE with dimension (256, 64) for the MCA datasets (C). C C A C om B at G M M M D -H ar m on y- 25 6× 32 P C A P C A -B B K N N P C A -H ar m on y 20 40 60 80 100 120 140 A R I **** **** **** **** ns (A) G M M M D -H ar m on y- 12 8× 64 G M M M D -H ar m on y- 25 6× 32 G M M M D -H ar m on y- 64 × 32 G M V A E -H ar m on y- 12 8× 32 G M V A E -H ar m on y- 25 6× 16 M M D -H ar m on y- 25 6× 64 M M D -H ar m on y- 64 × 16 V A E -H ar m on y- 12 8× 16 90 92 94 96 98 **** **** **** **** **** (B) UMAP1 U M A P 2 (C) Acinar cells Activated stellate Alpha cells Beta cells Delta cells Ductal cells Endothelial cells Epsilon cells Gamma cells Macrophages Mast cells Quiescent stellate Schwann cells UMAP1 U M A P 2 (D) 0 1 2 3 4 5 6 7 8 9 10 11 12 Figure 6 Results obtained on the PIC datasets. (A) Boxplot showing the ARI values achieved by CCA, ComBat, PCA, GMMMD followed by Harmony with dimension (256, 32), PCA followed by BBKNN, and PCA followed by Harmony on the PBMC datasets. (B) Boxplot showing the ARI values achieved by the best AE for each of the tested dimension (H, L) of the hidden layer (H neurons) and latent space (L neurons). (C) UMAP visualisation of the cell-type manually annotated in the original paper. (D) UMAP visualisation of clusters identified by the Leiden algorithm using the resolution corresponding by the best ARI achieved by GMMMD followed by Harmony. p-value≤ 0.0001 (****); 0.0001 0.05 (ns) .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 25 of 27 20 40 60 80 100 120 A M II **** **** **** **** **** (A) 82 84 86 88 90 92 94 96 **** **** **** **** **** (B) 20 40 60 80 100 120 140 FM S **** **** **** **** ns (C) 92 93 94 95 96 97 98 99 100 FM S **** **** **** **** **** (D) 20 40 60 80 100 120 140 H S **** **** **** **** **** (E) 86 88 90 92 94 **** **** **** **** **** (F) 20 40 60 80 100 120 C S **** **** **** **** **** (G) 82 84 86 88 90 92 94 96 **** **** **** **** **** (H) C C A C om B at G M M M D -H ar m on y- 25 6× 32 P C A P C A -B B K N N P C A -H ar m on y 20 40 60 80 100 120 V M **** **** **** **** **** (I) G M M M D -H ar m on y- 12 8× 64 G M M M D -H ar m on y- 25 6× 32 G M M M D -H ar m on y- 64 × 32 G M V A E -H ar m on y- 12 8× 32 G M V A E -H ar m on y- 25 6× 16 M M D -H ar m on y- 25 6× 64 M M D -H ar m on y- 64 × 16 V A E -H ar m on y- 12 8× 16 86 88 90 92 94 **** **** **** **** **** (J) Figure 7 Boxplot showing the values of the calculated metrics using CCA, ComBat, PCA, GMMMD followed by Harmony with dimension (256, 32), PCA followed by BBKNN, and PCA followed by Harmony as well as by the best AE for each of the tested dimension (H, L), analysing the PIC datasets. (A) AMII achieved by the different strategies. (B) AMII achieved by the best AE for each of the tested dimension. (C) FMS achieved by the different strategies. (D) FMS achieved by the best AE for each of the tested dimension. (E) HS achieved by the different strategies. (F) HS achieved by the best AE for each of the tested dimension. (G) CS achieved by the different strategies. (H) CS achieved by the best AE for each of the tested dimension. (I) VM achieved by the different strategies. (J) VM achieved by the best AE for each of the tested dimension. p-value≤ 0.0001 (****); 0.0001 0.05 (ns) .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 26 of 27 C C A C om B at M M D V A E -H ar m on y- 25 6× 64 P C A P C A -B B K N N P C A -H ar m on y 50 60 70 80 90 100 A R I **** **** **** **** **** (A) G M M M D -1 28 × 16 G M M M D -H ar m on y- 64 × 16 G M M M D -H ar m on y- 25 6× 32 M M D -B B K N N -1 28 × 64 M M D -B B K N N -2 56 × 16 M M D V A E -H ar m on y- 12 8× 16 M M D V A E -H ar m on y- 25 6× 64 V A E -H ar m on y- 64 × 32 40 50 60 70 80 90 **** **** **** (B) UMAP1 U M A P 2 (C) B cells Dendritic cells Endothelial cells Epithelial cells Macrophages Monocytes NK cells Neutrophils Smooth-muscle cells Stromal cells T cells UMAP1 U M A P 2 (D) 0 1 2 3 4 5 6 7 8 9 10 Figure 8 Results obtained on the MCA datasets. (A) Boxplot showing the ARI values achieved by CCA, ComBat, PCA, MMDVAE followed by Harmony with dimension (256, 64), PCA followed by BBKNN, and PCA followed by Harmony. (B) Boxplot showing the ARI values achieved by the best AE for each of the tested dimension (H, L) of the hidden layer (H neurons) and latent space (L neurons). (C) UMAP visualisation of the cell-type manually annotated in the original paper. (D) UMAP visualisation of clusters identified by the Leiden algorithm using the resolution corresponding by the best ARI achieved by MMDVAE followed by Harmony. p-value≤ 0.0001 (****); 0.0001 0.05 (ns) .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tangherloni et al. Page 27 of 27 70 75 80 85 A M II **** ns **** **** **** (A) 65 70 75 80 85 **** **** **** (B) 60 70 80 90 100 FM S **** **** **** **** **** (C) 50 60 70 80 90 **** **** **** (D) 70 75 80 85 90 H S **** ns **** ns **** (E) 65 70 75 80 85 **** **** **** (F) 70 75 80 85 90 C S **** ns **** **** **** (G) 65 70 75 80 85 **** **** **** (H) C C A C om B at M M D V A E -H ar m on y- 25 6× 64 P C A P C A -B B K N N P C A -H ar m on y 67.5 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 V M **** ns **** ns **** (I) G M M M D -1 28 × 16 G M M M D -H ar m on y- 64 × 16 G M M M D -H ar m on y- 25 6× 32 M M D -B B K N N -1 28 × 64 M M D -B B K N N -2 56 × 16 M M D V A E -H ar m on y- 12 8× 16 M M D V A E -H ar m on y- 25 6× 64 V A E -H ar m on y- 64 × 32 65 70 75 80 85 **** **** **** (J) Figure 9 Boxplot showing the values of the calculated metrics using CCA, ComBat, PCA, MMDVAE followed by Harmony with dimension (256, 64), PCA followed by BBKNN, and PCA followed by Harmony, as well as by the best AE for each of the tested dimension (H, L), analysing the MCA datasets. (A) AMII achieved by the different strategies. (B) AMII achieved by the best AE for each of the tested dimension. (C) FMS achieved by the different strategies. (D) FMS achieved by the best AE for each of the tested dimension. (E) HS achieved by the different strategies. (F) HS achieved by the best AE for each of the tested dimension. (G) CS achieved by the different strategies. (H) CS achieved by the best AE for each of the tested dimension. (I) VM achieved by the different strategies. (J) VM achieved by the best AE for each of the tested dimension. p-value≤ 0.0001 (****); 0.0001 0.05 (ns) .CC-BY-NC-ND 4.0 International licenseunder a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint https://doi.org/10.1101/727867 http://creativecommons.org/licenses/by-nc-nd/4.0/ Abstract